Quantcast
Channel: Intel® Software - Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
Viewing all articles
Browse latest Browse all 3005

optimizing dgemm on NUMA systems

$
0
0

Hello

I have been trying to optimize matrix multiplication on NUMA systems but so far without much luck.

I have played around with the dgemm routine and first touch.

A snippet of my code looks like this:

     print*, 'initiating first touch'
    A=0
    B=0
    C=0
    !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i)
    !$OMP DO SCHEDULE(STATIC)
    do i=1,dim
        rA(:,i)=0.d0
        rB(:,i)=0.d0
        rC(:,i)=0.d0
        cA(i,:)=0.d0
        cB(i,:)=0.d0
        cC(i,:)=0.d0
    end do
    !$OMP END DO
    !$OMP END PARALLEL

    print*, 'first touch done'

    Runtimebegin=omp_get_wtime()
    do i=1,10
    call dgemm('N','N',dim,dim,dim,1.d0,A,dim,B,dim,0.d0,C,dim)
    end do
    Runtimeend=omp_get_wtime()

    print*, 'ABC', runtimeend-runtimebegin

    call amat(rA,y,z,n,m,mm)

    Runtimebegin=omp_get_wtime()
    do i=1,10
    call dgemm('N','N',dim,dim,dim,1.d0,cA,dim,cB,dim,0.d0,cC,dim)
    end do
    Runtimeend=omp_get_wtime()

    print*, 'cAcBcC', runtimeend-runtimebegin

    Runtimebegin=omp_get_wtime()
    do i=1,10
    call dgemm('N','N',dim,dim,dim,1.d0,rA,dim,rB,dim,0.d0,rC,dim)
    end do
    Runtimeend=omp_get_wtime()

    print*, 'rArBrC', runtimeend-runtimebegin


    Runtimebegin=omp_get_wtime()
    do i=1,10
    call dgemm('N','N',dim,dim,dim,1.d0,cA,dim,rB,dim,0.d0,cC,dim)
    end do
    Runtimeend=omp_get_wtime()

    print*, 'cArBcC', runtimeend-runtimebegin

From which I get the following results on NUMA architechture.

 ABC   37.0858174243185
 cAcBcC   9.20657384615333
 rArBrC   9.42917347316688
 cArBcC   9.22702269622823

As we can see, it is important to spread your data across the NUMA system, but how it is spread seems to have little influence, or at least I haven't found an optimal way to spread it.

So I ask, is there an optimal way to spread my matrices when using dgemm?

Or if dgemm is not suitable on NUMA, are there any alternatives which handle matrix multiplication faster on NUMA architechture?

Thanks in advance

Tue

 


Viewing all articles
Browse latest Browse all 3005

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>