Hi everyone,
I am testing the performance of some code that calls MKL cblas cgemm() from within a parallel TBB section. I am using the MKL 2017 update 1, linking with the MKL_intel_thread.dll on a windows machine. My machine has 4 physical cores (8 logical threads).
I have about 10k matrices to multiply using this program. The matrices are of size ~ 500x500. tbb::parallel_for is used to parallel the work load with each thread taking a chunk of the matrices and do the calculation using MKL cgemm().
To avoid oversubscription, I call mkl_set_num_thread( 1 ).
Here are the time data I collected when I use 1, 2, 4 and 8 threads:
1 thread: 2350
2 threads: 1222
4 threads: 781
8 threads: 720
I was hoping to see a close-to-linear speed-up up to at least 4 threads, since I have only 4 physical cores. However, as you can see, the speed up at 4 threads is quite poor, only about 3x, and the speed-up at 8 threads is even worse (but I suppose that could be attributed to the super threading.. not sure if I am correct though)
So my question is, is the 3x speed-up at 4 threads normal? Did I do something wrong? I can understand that the speed-up would be capped/saturated when the number of cores keep increasing, but 4 seems to be way too early.
I tried some other matrix dimensions, but got largely the same data pattern, or sometimes even worse (2.5x speed-up) at 4 threads, depending on the matrix size.
Can anybody please shed some light on this? Thanks!
Ling