Here is my problem:
I create 4 pthread, each thread will call MKL dgemm, but I dont want to use sequential MKL, I want to use multicore MKL. Since I am running it in intel xeon phi. I want to map the first MKL degmm to core 1-15, second dgemm to core 16-30, third dgemm to core 31-45, last dgemm to core 46-60.
The reason I want to do it this way because I am running small dgemms, I think parallel this dgemms would maximum use the hardware resource.
How can I achieve it? I use kmp_set_affinity , but I didnt get the correct performance.