I have built an application that uses dgemm, ddot and daxpy via the PETSc library which was itself configured to use MKL (see below). I also used the MKL_VERBOSE option to confirm that the DGEMM calls use very small matrices (9x9), so I figured disabling error checking would improve performance.
I built PETSc with and without the -DMKL_DIRECT_CALL_SEQ flag.
icc -fPIC -wd1572 -g -O3 -axCORE-AVX2,AVX -xSSE4.2 -diag-disable=cpu-dispatch -shared -Wl,-soname,libpetsc.so
icc -fPIC -wd1572 -g -O3 -axCORE-AVX2,AVX -xSSE4.2 -diag-disable=cpu-dispatch -DMKL_DIRECT_CALL_SEQ -shared -Wl,-soname,libpetsc.so
Yet a performance profile shows no change in any of the dgemm, ddot and daxpy.
How can I prove that the direct path is actually being taken?
icc version 15.0.3.187
MKL version=11.2.3
PETSc configure command
./configure --prefix=${PETSC_DIR}/${PETSC_ARCH}/install --with-debugging=0 --with-shared-libraries=1 --with-cc=icc --with-fc=ifort --with-cxx=icpc --with-blas-lapack-dir=/nasa/intel/Compiler/2015.3.187/mkl/lib/intel64 --with-scalapack-include=/nasa/intel/Compiler/2015.3.187/mkl/include --with-scalapack-lib="/nasa/intel/Compiler/2015.3.187/mkl/lib/intel64/libmkl_scalapack_lp64.so /nasa/intel/Compiler/2015.3.187/mkl/lib/intel64/libmkl_blacs_sgimpt_lp64.so" --with-cpp=/usr/bin/cpp --with-gnu-compilers=0 --with-vendor-compilers=intel -COPTFLAGS="-g -O3 -axCORE-AVX2,AVX -xSSE4.2 -diag-disable=cpu-dispatch -DMKL_DIRECT_CALL_SEQ" -CXXOPTFLAGS="-g -O3 -axCORE-AVX2,AVX -xSSE4.2 -diag-disable=cpu-dispatch -DMKL_DIRECT_CALL_SEQ" -FOPTFLAGS="-g -O3 -axCORE-AVX2,AVX -xSSE4.2 -diag-disable=cpu-dispatch -fpp -DMKL_DIRECT_CALL_SEQ" --with-mpi-exec=mpiexec --with-mpi-compilers=0 --with-precision=double --with-sclar-type=real --with-dynamic-loading --with-x=0 --with-x11=0 --download-mumps --download-ptscotch --download-hypre