Quantcast
Channel: Intel® Software - Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
Viewing all articles
Browse latest Browse all 3005

Benchmarking MKL Lapack on ccNUMA systen

$
0
0

I work with a large ccNUMA SGI Altix system. One of our users is trying to benchmark some LAPACK routines on our system and is getting some disappointing scaling - stops scaling after 4 threads.

The test I am running is of diagonalizing a 4097x4097 matrix of double precision floats. It uses the routine DSYEV.

From analysing the hotspots in VTune, I find that almost all the time is spent in overhead and spin time from the functions:

[OpenMP dispatcher]<- pthread_create_child and in [OpenMP fork].

The code was compiled using ifort with the options: -O3 -openmp -g -traceback -xHost -align -ansi-alias -mkl=parallel.  Using version 13.1.0.146 of the compiler and version 11 of MKL. The system is made up of 8 core Xeon sandy bridge sockets.

The code was ran with the envars:

OMP_NUM_THREADS=16
MKL_NUM_THREADS=16
KMP_STACKSIZE=2gb
OMP_NESTED=FALSE
MKL_DYNAMIC=FALSE
KMP_LIBRARY=turnaround
KMP_AFFINITY=disabled

It is also ran with the SGI command for NUMA systems 'dplace -x2' which locks the threads to their cores.

So I suspect that there is something up with the options for the MKL, or the library isn't configured properly for our system. I have attached the code used.

Does anybody have any ideas on this?

Jim

AttachmentSize
Downloadcode.tar.gz2.02 KB

Viewing all articles
Browse latest Browse all 3005

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>