Hello
I have been trying to optimize matrix multiplication on NUMA systems but so far without much luck.
I have played around with the dgemm routine and first touch.
A snippet of my code looks like this:
print*, 'initiating first touch' A=0 B=0 C=0 !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i) !$OMP DO SCHEDULE(STATIC) do i=1,dim rA(:,i)=0.d0 rB(:,i)=0.d0 rC(:,i)=0.d0 cA(i,:)=0.d0 cB(i,:)=0.d0 cC(i,:)=0.d0 end do !$OMP END DO !$OMP END PARALLEL print*, 'first touch done' Runtimebegin=omp_get_wtime() do i=1,10 call dgemm('N','N',dim,dim,dim,1.d0,A,dim,B,dim,0.d0,C,dim) end do Runtimeend=omp_get_wtime() print*, 'ABC', runtimeend-runtimebegin call amat(rA,y,z,n,m,mm) Runtimebegin=omp_get_wtime() do i=1,10 call dgemm('N','N',dim,dim,dim,1.d0,cA,dim,cB,dim,0.d0,cC,dim) end do Runtimeend=omp_get_wtime() print*, 'cAcBcC', runtimeend-runtimebegin Runtimebegin=omp_get_wtime() do i=1,10 call dgemm('N','N',dim,dim,dim,1.d0,rA,dim,rB,dim,0.d0,rC,dim) end do Runtimeend=omp_get_wtime() print*, 'rArBrC', runtimeend-runtimebegin Runtimebegin=omp_get_wtime() do i=1,10 call dgemm('N','N',dim,dim,dim,1.d0,cA,dim,rB,dim,0.d0,cC,dim) end do Runtimeend=omp_get_wtime() print*, 'cArBcC', runtimeend-runtimebegin
From which I get the following results on NUMA architechture.
ABC 37.0858174243185
cAcBcC 9.20657384615333
rArBrC 9.42917347316688
cArBcC 9.22702269622823
As we can see, it is important to spread your data across the NUMA system, but how it is spread seems to have little influence, or at least I haven't found an optimal way to spread it.
So I ask, is there an optimal way to spread my matrices when using dgemm?
Or if dgemm is not suitable on NUMA, are there any alternatives which handle matrix multiplication faster on NUMA architechture?
Thanks in advance
Tue