Quantcast
Channel: Intel® Software - Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
Viewing all articles
Browse latest Browse all 3005

dgemm large performance difference for matrices of similar size

$
0
0

Copied from this link.

I'm using intel's MKL to perform large matrix-matrix multiplications (Y=A*X). I noticed a significant performance drop when I increased the dimension of p from 4900 to 4950 while keeping the other dimensions fixed (average runtime from 10 runs is around 0.6s for p=4900 and 8s for p = 4950). Here's the code:

#include <iostream>
#include <chrono> 
#include <mkl.h>
using namespace std::chrono;


int main(int argc, char** argv){
  int N = 240000;
  int p = std::stoi(argv[1]);
  int K = 20;

  double *A, *X, *Y;
  double alpha = 1.0;
  double beta = 0.0;

  A = (double *)mkl_malloc(N * p *sizeof(double), 64);
  X = (double *)mkl_malloc(p * K *sizeof(double), 64);
  Y = (double *)mkl_malloc(N * K *sizeof(double), 64);

  for(int i = 0; i < (N*p); ++i){
    A[i] = 1.0;
  }

  for(int i = 0; i < (p*K); ++i){
    X[i] = 0.5;
  }

  for(int i = 0; i < (N*K); ++i){
    Y[i] = 0.0;
  }

  auto start = high_resolution_clock::now(); 
  for(int i = 0; i < 10; ++i){
    cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
                N, K, p, alpha, A, N, X, p, beta, Y, N);
  }


  auto stop = high_resolution_clock::now();
  auto duration = duration_cast<microseconds>(stop - start); 
  std::cout << (double)duration.count()/(1e6*10.0) << std::endl; 

  mkl_free(X);
  mkl_free(A);
  mkl_free(Y);

  return 0;
}

Does anyone know the reason for that? This happens for both MKL 2020.1.217 as well as MKL 2019. I'm using CentOS 7 and compiled the above code with g++ 6.3.0

g++ main.cpp -o main -DMKL_ILP64 -m64 -I${MKLROOT}/include  -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl

Thanks! 

 

Edit: This problem seems to be specific to the cluster node I was using since it disappears when I switched to a different cluster. The cluster node that has the problem uses Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz. The other cluster node that does not have this problem uses Intel(R) Xeon(R) CPU E7-8867 v4 @ 2.40GHz


Viewing all articles
Browse latest Browse all 3005

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>