Hello,
I am using cblas_dgemv to obtain AT*x. The size of the matrix A is about 10000 Rows x 20000 columns. I am storing A in row major format. Ai,j+1 is stored next to Aij
My questions are as follows (in order to get fastest execution time):
- What is better way to store A -- row major format or column major format? does it matter?
- Is it better to store A and set TransA=CblasTrans or store AT directly and use it with TransA=CblasNoTrans.
- If answer to #2 is to use AT directly, is it better to store AT in rowmajor format or column major format?
Another related question I have has to do with byte alignment. Let us say we are storing in A in row major format. A has m rows and n columns. I have read that, when doing multithreading using openmp, to avoid false sharing it is better if each row of A starts at a byte aligned boundary. A common way of doing that is by padding the number of columns such that it is divisible by 8 (64 bytes for 8 doubles). So LDA = n + (8 - n%8). Does doing this help dgemv run faster?
Finally, For my calculation I need alpha=1 and beta=0. Does cblas_dgemv optimize for this trivial case or does it do the extra and in this unneccessary calculations?
Thanks in advance for any help.