Quantcast
Channel: Intel® Software - Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
Viewing all articles
Browse latest Browse all 3005

Significant Overhead if threaded MKL is called from OpenMP parallel region

$
0
0

Hello,

my aim is to diagonalize quadratic matrices with different sizes dxd in parallel. To this end I wrote a for  loop. In each iteration the aligned memory (dependent on the dimension d) is allocated with mkl_malloc(). The matrix is filled and afterwards dsyev is called to determine the optimal workspace size. Then I allocate the (aligned) workspace needed with mkl_malloc(), call dsyev once again to diagonalize the matrices and deallocate the memory that was used for the workspace and to store the matrix (using mkl_free()). 

Since the diagonalizations are independent of each other I want to run these in parallel by using OpenMP. Therefore I used the OpenMP pragma: #pragma omp parallel for with proper scheduling. The memory for each diagonalization is not accessed by different threads.

I run the code with OMP_NESTED=true, MKL_DYNAMIC=false, OMP_DYNAMIC=false. If I set OMP_NUM_THREADS=1 and MKL_NUM_THREADS=4,8,16 no significant overhead ( %sys of linux top command) is observed. If I set OMP_NUM_THREADS=4 and MKL_NUM_THREADS=1 i.e. call the sequential version of MKL dsyev also no significant overhead is observed and roughly the same performance is ached like in the opposite case where MKL_NUM_THREADS=4 and OMP_NUM_THREADS=1.

BUT, if I now want to exploit my OpenMP parallelization with for example OMP_NUM_THREADS=2,4 and MKL_NUM_THREADS=4 I get a huge slow down. Up to 30% of the processors capacity are used for system calls (kernel) (the more OpenMP threads I use, the greater is the slow down). I tried different scheduling techniques to ensure load balancing as best as I can. If I change the scheduling, the problem i.e. overhead still persists.

Are the frequent calls to mkl_malloc() and ml_free() from different threads the reason for this ? If yes, I could allocate the maximum memory needed as one big block before entering the parallel region. Unfortunately the MKL routines have their own memory management to tune their performance. Is it likely that the internal memory management of threaded MKL dsyev can cause also such a large overhead ? Are there any other reasons for this slow down ?

Best regards,

Felix Kaiser


Viewing all articles
Browse latest Browse all 3005

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>