Quantcast
Channel: Intel® Software - Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
Viewing all articles
Browse latest Browse all 3005

Performance issue with using MKL within TBB

$
0
0

Hi everyone,

I am testing the performance of some code that calls MKL cblas cgemm() from within a parallel TBB section. I am using the MKL 2017 update 1, linking with the MKL_intel_thread.dll on a windows machine. My machine has 4 physical cores (8 logical threads).

I have about 10k matrices to multiply using this program. The matrices are of size ~ 500x500. tbb::parallel_for is used to parallel the work load with each thread taking a chunk of the matrices and do the calculation using MKL cgemm().

To avoid oversubscription, I call mkl_set_num_thread( 1 ).

Here are the time data I collected when I use 1, 2, 4 and 8 threads:

1 thread: 2350

2 threads: 1222

4 threads: 781

8 threads: 720

I was hoping to see a close-to-linear speed-up up to at least 4 threads, since I have only 4 physical cores. However, as you can see, the speed up at 4 threads is quite poor, only about 3x, and the speed-up at 8 threads is even worse (but I suppose that could be attributed to the super threading.. not sure if I am correct though)

So my question is, is the 3x speed-up at 4 threads normal? Did I do something wrong? I can understand that the speed-up would be capped/saturated when the number of cores keep increasing, but 4 seems to be way too early.

I tried some other matrix dimensions, but got largely the same data pattern, or sometimes even worse (2.5x speed-up) at 4 threads, depending on the matrix size.

Can anybody please shed some light on this? Thanks!

Ling

 


Viewing all articles
Browse latest Browse all 3005

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>