Good day, I have a parallel program based on the MKL library. Recently, I have got access to a 28 core machine with two Intel Xeon E5-2690 processors with 256 Gbytes RAM. Soon I have noticed that my program execution time depends on the number of threads quite unexpectedly. It grows with number of threads increasing starting from 14 threads.
Then I have done some simple parallel scalability tests presented in the attached main.cpp and got the following interesting results (number of threads against wall time):
gesv, inversion of a matrix of linear size 45'000
2 12:12.40
3 12:09.93
4 6:16.89
5 6:19.99
6 4:16.09
8 3:11.99
10 2:36.16
12 2:11.92
15 2:07.66
18 1:36.31
23 1:24.16
27 1:18.03
gesv_multipl, 3'000'000 inversions of a matrix of linear time 100
2 5:18.72
3 5:14.52
4 5:13.01
5 5:14.08
6 5:18.92
8 5:29.00
10 6:05.00
12 6:09.86
15 13:48.54
18 12:50.62
23 6:30.92
27 5:48.64
sparse_mv_multipl, 5'000'000 matvec products of a CSR matrix
of linear size 100'000 with 1'000'000'000 nonzero elements
2 9:09.22
3 6:53.14
4 5:42.93
5 5:01.77
6 4:29.68
8 3:55.02
10 3:34.94
12 3:22.22
15 3:07.52
18 7:40.43
23 4:42.99
27 2:46.48
As one can see, in some tests the execution time depends on the number of threads quite nonlinearly with number of threads more than 10-12.
Additionally, I did Vtune hotspot and threading analyses for the
1. gesv test, 15 threads
2. gesv_multipl, 15 threads
cases. The case 2. is an example of very poor behaviour as is seen from the above values. In this case, I obtain the hotspot results shown in files 1.png, 2.png. The key thing as I can see from it is the enormous spinning time, mostly produced by the kmp_fork_barrier call. In the case 1., in contrast, no problems are reported.
Why is it so? Do I do something wrong? Or is it just the way it works? Thanks in advance.
Some additional information:
Operating system and version - CentOS Linux release 7.6.1810 (Core)
Library version - 2019.4.243
Compiler version - icpc (ICC) 19.0.4.243 20190416
GNU Compiler Collection (GCC) - gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)
Steps to reproduce the error (include makefiles, command lines, small test cases, and build instructions) - compile main.cpp
icpc -o a.out main.cpp -O3 -mkl=parallel -std=c++17
and execute on two Intel Xeon E5-2690 processors with 256 Gbytes RAM machine with the specified number of threads.