Quantcast
Channel: Intel® Software - Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
Viewing all articles
Browse latest Browse all 3005

Simple vectcorization question

$
0
0

 

Hi,

I wrote a simple function and executed it on a KNL processor (68 cores, Flat Quadrature, using MCDRAM) using only one thread and n=10,000,000. I execute this function 100 times and take the average, then calculate the GFLOPS using the following formula gflops = (1e-9 * 2.0 * n ) / execution time

double multiplyAccum(long n,double *A, double *B)
{
    long i;
    double result = 0;
    #pragma novector
    //#pragma simd
    for ( i = 0; i < n; i++ )
    {
        result += A[i] * B[i];
    }
    return result;
}

1) When I use #pragma novector, I get 0.839571 GFLOPS/s

This is the compiler report for the loop:

      remark #15319: loop was not vectorized: novector directive used
      remark #25439: unrolled with remainder by 8  
      remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
      remark #25457: Number of partial sums replaced: 1

When I use #pragma simd, I get  1.495788 GFLOPS/s

This is the compiler report for the loop:

      remark #15388: vectorization support: reference A_34279 has aligned access   [ multiplyAccum.cpp(64,3) ]
      remark #15388: vectorization support: reference B_34279 has aligned access   [ multiplyAccum.cpp(64,3) ]
      remark #15305: vectorization support: vector length 8
      remark #15399: vectorization support: unroll factor set to 8
      remark #15309: vectorization support: normalized vectorization overhead 0.446
      remark #15301: SIMD LOOP WAS VECTORIZED
      remark #15448: unmasked aligned unit stride loads: 2 
      remark #15475: --- begin vector loop cost summary ---
      remark #15476: scalar loop cost: 9 
      remark #15477: vector loop cost: 0.870 
      remark #15478: estimated potential speedup: 10.280 
      remark #15488: --- end vector loop cost summary ---
      remark #25015: Estimate of max trip count of loop=156250

The potential speedup is 10X, while I only get 1.8X, What is the explanation for this ? 

 

Thanks,


Viewing all articles
Browse latest Browse all 3005

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>