Hi,
I wrote a simple function and executed it on a KNL processor (68 cores, Flat Quadrature, using MCDRAM) using only one thread and n=10,000,000. I execute this function 100 times and take the average, then calculate the GFLOPS using the following formula gflops = (1e-9 * 2.0 * n ) / execution time
double multiplyAccum(long n,double *A, double *B) { long i; double result = 0; #pragma novector //#pragma simd for ( i = 0; i < n; i++ ) { result += A[i] * B[i]; } return result; }
1) When I use #pragma novector, I get 0.839571 GFLOPS/s
This is the compiler report for the loop:
remark #15319: loop was not vectorized: novector directive used
remark #25439: unrolled with remainder by 8
remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
remark #25457: Number of partial sums replaced: 1
When I use #pragma simd, I get 1.495788 GFLOPS/s
This is the compiler report for the loop:
remark #15388: vectorization support: reference A_34279 has aligned access [ multiplyAccum.cpp(64,3) ]
remark #15388: vectorization support: reference B_34279 has aligned access [ multiplyAccum.cpp(64,3) ]
remark #15305: vectorization support: vector length 8
remark #15399: vectorization support: unroll factor set to 8
remark #15309: vectorization support: normalized vectorization overhead 0.446
remark #15301: SIMD LOOP WAS VECTORIZED
remark #15448: unmasked aligned unit stride loads: 2
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 9
remark #15477: vector loop cost: 0.870
remark #15478: estimated potential speedup: 10.280
remark #15488: --- end vector loop cost summary ---
remark #25015: Estimate of max trip count of loop=156250
The potential speedup is 10X, while I only get 1.8X, What is the explanation for this ?
Thanks,