Significant Overhead if threaded MKL is called from OpenMP parallel region

Hello,

my aim is to diagonalize quadratic matrices with different sizes dxd in parallel. To this end I wrote a for loop. In each iteration the aligned memory (dependent on the dimension d) is allocated with mkl_malloc(). The matrix is filled and afterwards dsyev is called to determine the optimal workspace size. Then I allocate the (aligned) workspace needed with mkl_malloc(), call dsyev once again to diagonalize the matrices and deallocate the memory that was used for the workspace and to store the matrix (using mkl_free()).

Since the diagonalizations are independent of each other I want to run these in parallel by using OpenMP. Therefore I used the OpenMP pragma: #pragma omp parallel for with proper scheduling. The memory for each diagonalization is not accessed by different threads.

I run the code with OMP_NESTED=true, MKL_DYNAMIC=false, OMP_DYNAMIC=false. If I set OMP_NUM_THREADS=1 and MKL_NUM_THREADS=4,8,16 no significant overhead ( %sys of linux top command) is observed. If I set OMP_NUM_THREADS=4 and MKL_NUM_THREADS=1 i.e. call the sequential version of MKL dsyev also no significant overhead is observed and roughly the same performance is ached like in the opposite case where MKL_NUM_THREADS=4 and OMP_NUM_THREADS=1.

BUT, if I now want to exploit my OpenMP parallelization with for example OMP_NUM_THREADS=2,4 and MKL_NUM_THREADS=4 I get a huge slow down. Up to 30% of the processors capacity are used for system calls (kernel) (the more OpenMP threads I use, the greater is the slow down). I tried different scheduling techniques to ensure load balancing as best as I can. If I change the scheduling, the problem i.e. overhead still persists.

Are the frequent calls to mkl_malloc() and ml_free() from different threads the reason for this ? If yes, I could allocate the maximum memory needed as one big block before entering the parallel region. Unfortunately the MKL routines have their own memory management to tune their performance. Is it likely that the internal memory management of threaded MKL dsyev can cause also such a large overhead ? Are there any other reasons for this slow down ?

Best regards,

Felix Kaiser

Significant Overhead if threaded MKL is called from OpenMP parallel region

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List