Hello,
I am doing development on a 24-core machine (E5-2697-v2). When I launch a single DGEMM where the matrices are large (m=n=k=15,000), the performance improves as I increase the number of threads used, which is expected. For reference, I get about 467 GFLOPs/sec using 24 cores.
Next, in an OpenMP parallel region, I have each thread launch an independent call to DGEMM where the matrices are large (m=n=k=15,000). Each thread has its own matrices which are used in its DGEMM. In this case, the overall performance improves as I increase the number of threads, up to a point. With higher numbers of threads, the overall performance decreases. What hardware limitation could be causing this? For reference, here are the performance results I got:
#threads Compute Speed Overall (GFLOP/sec) 1 26.3 2 52.6741 3 76.6518 4 102.413 5 124.401 6 148.394 7 168.022 8 190.557 9 210.165 10 232.156 11 249.77 12 271.149 13 291.211 14 313.747 15 327.467 16 349.917 17 361.444 18 377.498 19 346.558 20 368.453 21 356.597 22 319.446 23 301.81 24 277.273