Hello! I am trying to implement sgemm matrix multiplication on multiple physical cores and I am a little confused on how to do so.
Say I have obtained 9 physical cores from an HPC system and I want sgemm to use all of these cores to do the matrix multiplication. In this case I do not want to use multithreading on these 9 cores, only these 9 cores as a whole. So in a way I guess you could say that the 9 cores are the threads to be used by sgemm. Below is some code I have created, which I believe implements what I want to do. Is this implementation correct?
program mkl_matrixmul
use mpi
implicit none
integer :: N,max_threads,mkl_get_max_threads
real, allocatable, dimension(:,:) :: A,B,C
integer :: ierror,num_cores,my_rank
double precision :: time1,time2
CALL MPI_Init(ierror) !Flag for error
CALL MPI_COMM_Size(MPI_COMM_WORLD,num_cores,ierror) !puts in the number of cores into num_cores
CALL MPI_Comm_rank(MPI_COMM_WORLD,my_rank,ierror) !defining the variable for the rank of the core
CALL MPI_BARRIER(MPI_COMM_WORLD,ierror)
if(my_rank == 0)then
!starting the timer
time1 = MPI_Wtime()
end if
N = 61740
Allocate(A(N,N),B(N,N),C(N,N))
A = 1.0
B = 2.0
C = 0.0
call mkl_set_num_threads(num_cores)
call sgemm('N','N',N,N,N,1.0,A,N,B,N,1.0,C,N)
CALL MPI_BARRIER(MPI_COMM_WORLD,ierror)
if(my_rank == 0)then
!printing the elapsed time
time2 = MPI_Wtime()
print *, 'elapsed time' , time2 - time1
print *, C(1,2)
end if
CALL MPI_Finalize(ierror)
end program mkl_matrixmul
Also if it helps, I am using a Sandy Bridge node with 256 GB of memory.
Thank you,
Brandon