I work with a large ccNUMA SGI Altix system. One of our users is trying to benchmark some LAPACK routines on our system and is getting some disappointing scaling - stops scaling after 4 threads.
The test I am running is of diagonalizing a 4097x4097 matrix of double precision floats. It uses the routine DSYEV.
From analysing the hotspots in VTune, I find that almost all the time is spent in overhead and spin time from the functions:
[OpenMP dispatcher]<- pthread_create_child and in [OpenMP fork].
The code was compiled using ifort with the options: -O3 -openmp -g -traceback -xHost -align -ansi-alias -mkl=parallel. Using version 13.1.0.146 of the compiler and version 11 of MKL. The system is made up of 8 core Xeon sandy bridge sockets.
The code was ran with the envars:
OMP_NUM_THREADS=16
MKL_NUM_THREADS=16
KMP_STACKSIZE=2gb
OMP_NESTED=FALSE
MKL_DYNAMIC=FALSE
KMP_LIBRARY=turnaround
KMP_AFFINITY=disabled
It is also ran with the SGI command for NUMA systems 'dplace -x2' which locks the threads to their cores.
So I suspect that there is something up with the options for the MKL, or the library isn't configured properly for our system. I have attached the code used.
Does anybody have any ideas on this?
Jim