I am attempting to parallelise calls to mkl within a parallel omp region to test whether or not the code executes faster. Simply parallelising part of the code does not yield linear increase in performance, hence a mixed approach makes sense. An outline of the code is as follows:
#pragma omp parallel for for (int i = 0; i < N; i+=2) { some_function(i); }
where some_function will make calls to zgesvd. For starters I would like the omp region to run on 2 threads and the calls to zgesvd inside to also run on 2 threads (for a total of 4 active threads). To achieve this I make the following calls in the begining of the program
omp_set_num_threads(2); mkl_set_num_threads(2); mkl_set_dynamic(false); omp_set_nested(true); omp_set_max_active_levels(2);
I have also tried setting omp threads to 4 and then adding threads(2) to the pragma with no success. Currently, the program creates >>3<< (??) threads on both Windows and Linux using the latest MKL & Intel compilers. Changing the value of omp_set_max_active_levels to 3 produces 4 threads on Windows and 3 threads on Linux. However, I don't exactly know what these threads are doing, I can just see their number.
Best regards
P.S. I noticed that by default the MKL will only try to use 4 threads on a quad-core CPU with hyperthreading enabled but according to top (which should be reliable? I don't really know.) the 4 threads are not always run 1/core (though that might be up to the OS), so why the limit?