Hi,
I'm using cblas_dgemm to calculate matrix multiplication. For random generated matrix X of size N * N (N could be 100), I calculate Y = X^T * X. (X^T is the tranpose of X). I can do it in two ways: (1) using cblas_dgemm to calculate Y directly (2) using a forloop that for i = 1:N, Y += X[i] * X[i]^T, where X[i] is the i_th column of X.
By comparing the speed, theoretically, they should have same complexity of N^3. But in reality, (2) way might take 4 times longer than (1). Could you help me to understand this?
Thanks