Copied from this link.
I'm using intel's MKL to perform large matrix-matrix multiplications (Y=A*X). I noticed a significant performance drop when I increased the dimension of p from 4900 to 4950 while keeping the other dimensions fixed (average runtime from 10 runs is around 0.6s for p=4900 and 8s for p = 4950). Here's the code:
#include <iostream> #include <chrono> #include <mkl.h> using namespace std::chrono; int main(int argc, char** argv){ int N = 240000; int p = std::stoi(argv[1]); int K = 20; double *A, *X, *Y; double alpha = 1.0; double beta = 0.0; A = (double *)mkl_malloc(N * p *sizeof(double), 64); X = (double *)mkl_malloc(p * K *sizeof(double), 64); Y = (double *)mkl_malloc(N * K *sizeof(double), 64); for(int i = 0; i < (N*p); ++i){ A[i] = 1.0; } for(int i = 0; i < (p*K); ++i){ X[i] = 0.5; } for(int i = 0; i < (N*K); ++i){ Y[i] = 0.0; } auto start = high_resolution_clock::now(); for(int i = 0; i < 10; ++i){ cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, N, K, p, alpha, A, N, X, p, beta, Y, N); } auto stop = high_resolution_clock::now(); auto duration = duration_cast<microseconds>(stop - start); std::cout << (double)duration.count()/(1e6*10.0) << std::endl; mkl_free(X); mkl_free(A); mkl_free(Y); return 0; }
Does anyone know the reason for that? This happens for both MKL 2020.1.217 as well as MKL 2019. I'm using CentOS 7 and compiled the above code with g++ 6.3.0
g++ main.cpp -o main -DMKL_ILP64 -m64 -I${MKLROOT}/include -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl
Thanks!
Edit: This problem seems to be specific to the cluster node I was using since it disappears when I switched to a different cluster. The cluster node that has the problem uses Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz. The other cluster node that does not have this problem uses Intel(R) Xeon(R) CPU E7-8867 v4 @ 2.40GHz