cblas_sgemm_pack result is not consistent with cblas_gemm

September 24, 2017, 8:52 pm

Latest and popular articles on Intel Technologies

≫ Next: Incorrect result with both FFTW and MKL FFT

≪ Previous: MKL FFTW3 interface problem

Hello,

I wrote a short code to call sgemm_pack to speed up. But the result is not consistent with cblas_sgemm.

For example,

Matrix A (2 x 2): [1.0, 2.0, 3.0, 4.0]

Matrix B (2 x 1): [1.0, 2.0]

With the row major, Matrix C (2 x 1) = A * B = [5, 11]. But with sgemm_pack + sgemm_compute, the result is [0.0, 0.0].

Could you please take a look. Any advice is welcomed.

Thanks

---

Environments: I use parallel studio xe. the version is 2017.1.132.

Build command: icc gemm_pack.c -I${MKLROOT}/include -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a ${MKLROOT}/lib/intel64/libmkl_sequential.a ${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group -lpthread -lm -ldl -std=c99

---

The sample code,

#include <stdio.h>
#include <mkl.h>

void print(float* a, int length, const char* name)
{
  int i = 0;
  for (i = 0; i < length; i++) {
    printf("%s[%d] = %f\n", name, i, a[i]);
  }
}

int main(void)
{
  int m = 2;
  int n = 1;
  int k = 2;

  float *a, *b, *c;
  a = (float*)malloc(sizeof(float) * m * k);
  b = (float*)malloc(sizeof(float) * k * n);
  c = (float*)malloc(sizeof(float) * m * n);

  int i = 0;
  for (i = 0; i < m *k; i++) {
    a[i] = i + 1;
  }
  for (i = 0; i < k * n; i++) {
    b[i] = i + 1;
  }

  float alpha = 1.0f;
  float beta = 0.0f;
  int lda = k;
  int ldb = n;
  int ldc = n;

  printf("========================SGEMM_PACK========================\n");
  print(a, m * k, "a");
  print(b, k * n, "b");
  float *packA = cblas_sgemm_alloc(CblasAMatrix, m, n, k);
  cblas_sgemm_pack(CblasRowMajor, CblasAMatrix, CblasNoTrans, m, n, k, alpha, a, lda, packA);

  cblas_sgemm_compute(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, k, packA, lda, b, ldb, beta, c, ldc);

  cblas_sgemm_free(packA);
  print(c, m * n, "c");

  printf("========================SGEMM========================\n");
  print(a, m * k, "a");
  print(b, k * n, "b");
  cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);
  print(c, m * n, "c");

  return 0;
}

↧

Incorrect result with both FFTW and MKL FFT

September 26, 2017, 9:28 am

Latest and popular articles on Intel Technologies

≫ Next: Is it normal that the best performance happened only using a half number of threads

≪ Previous: cblas_sgemm_pack result is not consistent with cblas_gemm

Hello everyone,

I've noticed that both MKL FFT and FFTW is giving me a wrong result while calculating 2D complex-to-complex BACKWARD FFT. I am attaching the source codes that I used on this forum to discuss it with the experts.

While the input array is :

 1.0+0.0i  1.0+0.0i  1.0+0.0i
 1.0+0.0i  1.0+0.0i  1.0+0.0i
 1.0+0.0i  1.0+0.0i  1.0+0.0i

The forward transform should be:

 9.0+0.0i  0.0+0.0i  0.0+0.0i
 0.0+0.0i  0.0+0.0i  0.0+0.0i
 0.0+0.0i  0.0+0.0i  0.0+0.0i

And the backward transform should be again the initial array. However, it is:

 9.0+0.0i  9.0+0.0i  9.0+0.0i
 9.0+0.0i  9.0+0.0i  9.0+0.0i
 9.0+0.0i  9.0+0.0i  9.0+0.0i

I attached my test code here. Could you pls. let me know why this is happening?

Attachment	Size
Download dp_plan_dft_2d_test.f90	5.56 KB
Download 2D_dft.f90	5.13 KB

↧

Is it normal that the best performance happened only using a half number of threads

September 26, 2017, 11:04 am

Latest and popular articles on Intel Technologies

≫ Next: Which algorithm is implemented in DGEMM?

≪ Previous: Incorrect result with both FFTW and MKL FFT

Hi,

I am running the FFT using MKL on intel cpu, which has 36 physical cores and 72 threads, as shown below.

I didn't use the Openmp but threadpool to do FFT using MKL.

The problem is using the threadpool gives a best performance when setting the number of threads as 36 but not 72. And using more number of threads will always give performance improvement when that number is less than 36. But using more number of threads than 36 will not give performance improvement anymore.

I notice that "To achieve higher performance, set the number of threads to the number of processors or physical cores,": https://software.intel.com/en-us/mkl-linux-developer-guide-improving-per.... Though it takes OpenMP, but the thing is the same with threadpool, which is the best performance gotten from setting the number of threads equal to maximum physical cores but not the maximum threads cores.

Why does it like this? Because the data processing complexity of FFT is too high?

So if it is like this, what do the other (36 threads) do? In what situation the 72 threads will fully employed?

Sorry too much questions!

Ant hint will be appreciated!

↧

Which algorithm is implemented in DGEMM?

September 26, 2017, 11:52 am

Latest and popular articles on Intel Technologies

≫ Next: zgemm3m using 1 thread ( MKL 2017 and 2018)

≪ Previous: Is it normal that the best performance happened only using a half number of threads

Interestingly, I've been unable to find an answer to this simple question. What is the algorithm that is used for matrix-matrix multiplications (e.g., DGEMM) in MKL? Is is classical (O(N^3)), Strassen (O(N^2.7)), or something else? Thanks.

↧

zgemm3m using 1 thread ( MKL 2017 and 2018)

September 27, 2017, 9:21 am

Latest and popular articles on Intel Technologies

≫ Next: What can I use instead of MKL on unsupported platforms such as iOS & Android?

≪ Previous: Which algorithm is implemented in DGEMM?

I am seeing some performance regression with MKL2017/2018 with zgemm3m

zgemm3m , in some cases , appears to be only using 1 thread (with a negative impact on elapsed time) despite the matrix being 'large'

This behaviour appeared in MKL 2017 and MKL 2018 but is not in MKL 2015

The call to zgemm3m takes two 4122x4122 double complex matrices. Windows 7 4 Core Xeon machine with HT.

transa=transb='N', m=n=k=4122. lda=4122,ldb=4122,alpha=1,beta=0,ldc=4122

We are essentially looping and calling zgemm3m with the same dimensions and matrix structure each time through the loop.

The loop is not OpenMP parallelized. Running in the "main" thread.

First time through the loop, zgemm3m uses all cores

Second time through the loop zgemm3m uses only one core ( and runs MUCH slower that the first call ).

It's very obvious in the debugger that zgemm3m is not using multiple threads the second time it is called. I tried to 'force' the correct # of threads before the call, with no change in behaviour.

		int numThreads = MKL_Get_Max_Threads();
		cout << "MKL Threads "<< numThreads << endl;
		MKL_Set_Num_Threads(numThreads);
		int numOMPThreads = omp_get_max_threads();
		cout << "OMP Threads "<< numOMPThreads << endl;
		omp_set_num_threads(numOMPThreads);
		mkl_set_dynamic(false);
                zgemm3m(....)

The output of above code trying to force the expected behaviour is always

MKL Threads 4
OMP Threads 8

What would cause zgemm3m to "turn off" threading?

Andrew

↧

What can I use instead of MKL on unsupported platforms such as iOS & Android?

September 28, 2017, 12:13 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® MKL 2018 Parcel with Cloudera* CDH is available

≪ Previous: zgemm3m using 1 thread ( MKL 2017 and 2018)

If I write my Windows, Mac & Linux code to utilise MKL, how can I then port those apps to say iOS or Android if they're not supported platforms?

Are there any "swap-in" alternatives or is it possible to get MKL working on those platforms?

↧

Intel® MKL 2018 Parcel with Cloudera* CDH is available

September 28, 2017, 12:54 am

Latest and popular articles on Intel Technologies

≫ Next: Normalized cross-correlation using MKL

≪ Previous: What can I use instead of MKL on unsupported platforms such as iOS & Android?

The Intel^® Math Kernel Library (MKL) 2018 parcel for Cloudera* CDH is available now.

Please follow this link to download and install MKL 2018:
Installing Intel^® MKL Cloudera* CDH Parcel

↧

Normalized cross-correlation using MKL

September 29, 2017, 12:21 am

Latest and popular articles on Intel Technologies

≫ Next: Unable to link statically with intel mkl

≪ Previous: Intel® MKL 2018 Parcel with Cloudera* CDH is available

I'm trying to use MKL to do NCC in order to find a pattern in a whole image.
It seems that vsldCorrExecX only can do correlation but not NCC.

for example,
pattern = [1 2;
3 4];
image = [1 2;
5 6]
We expect to get 0.9762 for sum((pattern-mean(pattern)).*(image-mean(image)))/sqrt((pattern-mean(pattern))^2*(image-mean(image))^2).

However, we actually get sum(pattern.*image) = 44.

Is there any way to do NCC directly?

↧

Unable to link statically with intel mkl

September 29, 2017, 11:41 am

Latest and popular articles on Intel Technologies

≫ Next: error while using zgetri

≪ Previous: Normalized cross-correlation using MKL

My program contains C/C++/Fortran code.

I am able to dynamically link with mkl and run my program.

I am using the following command to statically link with mkl.

/appl/intelv2017/bin/ifort -Wl,--start-group /appl/intelv2017/mkl/lib/intel64/libmkl_intel_lp64.a /appl/intelv2017/mkl/lib/intel64/libmkl_sequential.a /appl/intelv2017/mkl/lib/intel64/libmkl_core.a -Wl,--end-group -lpthread -lm -ldl -nofor-main -cxxlib *.o -o main.exe

I get the following error when from the above command.

crossval.o: In function `crossval':
/tmp/test/crossval.f:51: undefined reference to `dpotrs_'
/tmp/test/crossval.f:59: undefined reference to `ddot_'
/tmp/test/crossval.f:74: undefined reference to `ddot_'
/tmp/test/crossval.f:83: undefined reference to `ddot_'
loglik.o: In function `loglik':
/tmp/test/loglik.f90:52: undefined reference to `dpotrf_'
/tmp/test/loglik.f90:103: undefined reference to `dpotrs_'
/tmp/test/loglik.f90:114: undefined reference to `ddot_'
/tmp/test/loglik.f90:120: undefined reference to `dpotrf_'
/tmp/test/loglik.f90:132: undefined reference to `ddot_'
/tmp/test/loglik.f90:134: undefined reference to `dpotrs_'
/tmp/test/loglik.f90:141: undefined reference to `ddot_'
/tmp/test/loglik.f90:153: undefined reference to `dpotrs_'
/tmp/test/loglik.f90:159: undefined reference to `ddot_'
/tmp/test/loglik.f90:182: undefined reference to `dpotri_'
make: *** [main.exe] Error 1

All of the above routines are called from fortran code. It is not able to find the mkl function calls.

How do I resolve this issue?

↧

error while using zgetri

October 2, 2017, 6:50 pm

Latest and popular articles on Intel Technologies

≫ Next: MKL c datatype problem

≪ Previous: Unable to link statically with intel mkl

Dear all,

I am running a program that has been running many times in a cluster.

Maybe because the cluster has been through software upgrade, there are errors while running the executable file a.out.

There is no problem for compiling and linking. Just error will show up while run the program halfway..

forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source
libifcoremt.so.5   00002B6454D7A6D4 for__signal_handl     Unknown Unknown
libpthread-2.17.s 00002B6452C20370 Unknown               Unknown Unknown
libmkl_avx512_mic 00002B646F5370BE mkl_blas_avx512_m     Unknown Unknown
libmkl_avx512_mic 00002B646F544B61 mkl_blas_avx512_m     Unknown Unknown
libmkl_avx512_mic 00002B646F541935 mkl_blas_avx512_m     Unknown Unknown
libmkl_intel_thre 00002B644F2FF714 mkl_blas_ztrsm_ho     Unknown Unknown
libmkl_intel_thre 00002B644F319606 mkl_blas_ztrsm        Unknown Unknown
libmkl_core.so     00002B64515C1F74 mkl_lapack_ztrtri     Unknown Unknown
libmkl_core.so     00002B64514B032C mkl_lapack_zgetri     Unknown Unknown
libmkl_intel_lp64 00002B644E98683D ZGETRI                Unknown Unknown

Now we are using intel/17.0.4, impi/17.0.3.

      call ZGETRF( N_LEN_2, N_LEN_2, BQ , N_LEN_2, IPIV , INFO )

      call ZGETRI( N_LEN_2, BQ, N_LEN_2, IPIV, WORK, N_LEN_2, INFO )

The first subroutine

ZGETRF

is fine. But when it comes to the second function

ZGETRI. There is always a floating invalid error.

I just do not understand. Because the input of ZGETRI are just the output of ZGTRF.

*********updates********

I found the following on Intel® Math Kernel Library (Intel® MKL) 2017 Release Notes

Fixed irregular division by zero and invalid floating point exceptions
in {C/Z}TRSM for Intel® Xeon Phi™ processor x200 (aka KNL) and Intel® Xeon®
Processor supporting Intel® Advanced Vector Extensions 512 (Intel® AVX-512) code path

I found this maybe useful because my error message just mentioned

TRSM, Invalid floating AVX-512

********updates2********

It seems the error has something to do with the MKL library.

1. The code has been running for a long time.

2. I run the code in a low version MKL library, it works well.

I think the current MKL library which is 17.0.4 must has something not correct.

↧

MKL c datatype problem

October 3, 2017, 2:55 am

Latest and popular articles on Intel Technologies

≫ Next: How to use multiple linear regression in MKL

≪ Previous: error while using zgetri

Hi,

I am using cblas from MKL library but I got wrong results after redefine MKL_INT to long.

Is there something wrong?

#define MKL_INT long

#include "mkl.h"

....

mkl_dcscmv(...)

....

Thanks!

↧

How to use multiple linear regression in MKL

October 4, 2017, 4:02 pm

Latest and popular articles on Intel Technologies

≫ Next: MKL Pardiso sparse right hand sides

≪ Previous: MKL c datatype problem

Hi everyone,

I'm new to Linux and C++. I want to insert a function to my codes for implementing multiple linear regression. Since the hpc has MKL installed, I want to use the library. Any help is appreciated in advance!

Best Regards

↧

MKL Pardiso sparse right hand sides

October 5, 2017, 2:37 am

Latest and popular articles on Intel Technologies

≫ Next: Does anyone know if the MKL Fast Poisson Solver can be used for the nonlinear Poisson eqn?

≪ Previous: How to use multiple linear regression in MKL

I'm solving a sparse system of equation A*x=b

Matrix A is a sparse matrix for finite element. Say the size of A is 1 million by 1 million. It works fine if I have limited number of right hand sides (RHS). However, if the number of right hand sides increases. For example, we have 1000 RHS, allocate vector b requires a lot of memory. My RHS b is actually sparse vector. For each RHS, there are only several non-zero entries. I'm wondering is there any way to pass sparse RHS into pardiso?

Otherwise, what I can think is allocate sparse RHS before pass it to pardiso, divide the total RHS into groups and pass the RHS to pardiso by groups after convert it back to dense format.

Thanks for your suggestions.

↧

Does anyone know if the MKL Fast Poisson Solver can be used for the nonlinear Poisson eqn?

October 5, 2017, 1:25 pm

Latest and popular articles on Intel Technologies

≫ Next: fortran mkl csr sparse matrix storage, integer size of row/column index vectors for very large arrays

≪ Previous: MKL Pardiso sparse right hand sides

Hello,

Is it possible to modify the Intel MKL Fast Poisson Solver for the problem of type:

Δ .[K(u). Δ(u) ] = f

where Δ is the gradient symbol (I didn't find the reverse triangle in the special characters). K(u) is a positive differentiable function dependent on the position. Check the equation here:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.645.5026&rep=re...

The difference between the above equation and the demonstrated Poisson eqn. at MKL Poisson solver page is the term K(u).

↧

fortran mkl csr sparse matrix storage, integer size of row/column index vectors for very large arrays

October 7, 2017, 4:40 am

Latest and popular articles on Intel Technologies

≫ Next: Error While Loading Shared Libraries: libiomp5.so

≪ Previous: Does anyone know if the MKL Fast Poisson Solver can be used for the nonlinear Poisson eqn?

Hi there,

I ran into trouble with the single precision integer size (I think it is 32bit) of the row index vector of mkl csr arrays. I create squared sparse arrays with +56,000,000 rows and columns and +3,000,000,000 elements. The column index vector is still ok because largest numbers are equal to the column dimension. However, the row index vector makes trouble with entries equal to the length of the column index vector which cannot be held by a single precision integer. I could use a larger integer kind number (double precision integer, 64 bit) but then I am in trouble with all mkl routines dealing with csr matrices (e.g. dcsrmm). That is because these routine have a fortran 77 interfaces and as outlined in the mkl manual, the expect index vectors of kind single precision. I can imagine that one could set the default integer to 64 bit when installing the mkl, but I am not sure whether that is possible. Any Ideas??

Thanks a lot

↧

Error While Loading Shared Libraries: libiomp5.so

October 17, 2017, 10:09 am

Latest and popular articles on Intel Technologies

≫ Next: compile mkl example under qt creator 4.2.1 linux 64

≪ Previous: fortran mkl csr sparse matrix storage, integer size of row/column index vectors for very large arrays

Hi All,

I am running CentOS 7.3 on Intel Xeon Phi. I have successfully configured Parallel Studio XE 2017 update 5 with all the libraries and tools that come with it. However, for some reason I keep getting following error when I profile benchmarks like DeepBench, Intel LINKPACK or Intel Caffe with perf:

error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory

Both perf and benchmarks work standalone, only time they don't is when I hook them with each other. This wasn't the case before I had to re-install CentOS on my system. I have sourced MKL environment variables apart from other specific variables from /opt/intel/bin/*.sh. I haven't found this libiomp5.so installed anywhere on the system.

source /opt/intel/bin/compilervars.sh intel64
source /opt/intel/mkl/bin/mklvars.sh intel64
source /opt/intel/impi/2017.4.239/intel64/bin/mpivars.sh intel64
source /opt/intel/bin/iccvars.sh intel64
source /opt/intel/bin/ifortvars.sh intel64

Can anyone please share steps or suggestions on how to solve this issue?

Thanks.

↧

compile mkl example under qt creator 4.2.1 linux 64

October 17, 2017, 1:59 pm

Latest and popular articles on Intel Technologies

≫ Next: matlab no longer working after installing mkl

≪ Previous: Error While Loading Shared Libraries: libiomp5.so

Hi,

I am trying to compile the example cblas_caxpy example in the following environment

qt creator 4.2.1

qt 5.8.0

compiler : g++

os: linux debian 8.0 64 bits

I linked with lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 and added I guess proper include path but I get the following compile error:

in function 'main':

undefined reference to GetIntegerParameters

undefined reference to GetScalarC

etc..

What did I miss ?

Thanks

Agks

↧

matlab no longer working after installing mkl

October 18, 2017, 2:34 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® HPC Developer Conference 2017

≪ Previous: compile mkl example under qt creator 4.2.1 linux 64

I installed mkl_2018 on a linux Debian 8.0 64bits where I already had matlab installed. Before installing mkl, matlab worked fine.

Since I installed mkl, matlab starts but crashes with following error when I do a signal convolution:

Intel MKL FATAL ERROR: cannot load libmkl_avx.so

So I added

export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_avx.so to my .bashrc and sourced it, and now I can no longer launch matlab : I get following error message:

symbol lookup error: /opt/intel/mkl/lib/intel64/libmkl_avx.so: undefined symbol: mkl_parse_optimize_bsr_trsm_i8

Any idea ?

Thanks

↧

Intel® HPC Developer Conference 2017

October 19, 2017, 3:24 am

Latest and popular articles on Intel Technologies

≫ Next: SVD speed of 'small' matrices in MKL 2018_0_124

≪ Previous: matlab no longer working after installing mkl

Dear MKL Forum Users, join us at Intel® HPC Developer Conference in Denver, Colorado during November 11-12, 2017. This free technical training is open to the public and will feature industry luminaries sharing best practice and techniques for maximizing efficiency and getting the benefits from Intel architecture. Attendees have the option of attending technical sessions, hands-on tutorials and poster sessions that cover parallel programming, high productivity languages, artificial intelligence, systems, enterprise, visualization development and more. https://www.intel.com/content/www/us/en/events/hpcdevcon/overview.html

↧

SVD speed of 'small' matrices in MKL 2018_0_124

October 20, 2017, 12:59 am

Latest and popular articles on Intel Technologies

≫ Next: Inconsistent mkl_link_tool behavior

≪ Previous: Intel® HPC Developer Conference 2017

I'm using SVD during some least-square fitting, typically operating on spectral data (1000-2000 data points) and fitting with very few parameters (2-5).

For this, I'm generally using a direct implementaion of the SVD routines from the "numerical recipes" (single-threaded).

When I started needing SVDs in other areas (bigger matrices with a less extreme aspect ratio, typtically ~ 10000 x 1000) I started using MKL Lapacke, currenlty using version 2017_4_210 and here the routines greatly outperform the NR routines.

So I also started using them for the fitting as described above. However, when applying it to the "extreme" data of only very few parameters ( typical matrix size 2048 x 3 ), the Lapacke routines fell behind and the NR routines are just faster.

Just as a "guideline": Running the same (iterative) fitting on a typical standard data-set, my profile tells me I'm staying with the SVD-routines for about 4sec using NR routines and for about 7sec with the MKL routines)

Now, when MKL 2018 was announced a month ago, I was quite excited to read in the Release Notes (https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-2018-release-notes):

LAPACK:

Added the following improvements and optimizations for small matrices (N<16):
Added ?gesvd, ?geqr/?gemqr, ?gelq/?gemlq optimizations for tall-and-skinny/short-and-wide matrice

So I gave it a try, but was quite disappointed. Not only did the NR still outperfrom MKL routines, but for reasons not clear to me, the performance actually dropped significantly in the 2018_0_124 MKL compared to the 2017_4_210 version.

The same data for guideline:
- NR routines: 4sec
- MKL 2017: 7sec
- MKL 2018: 14sec

The only changes I did when comparing both variantes was to re-compile/link with the newer version and use the according new version DLLs.
Did I miss something? Or did I misunderstand the release notes? Does anybody have some other comparative data for running SVDs on matrices of size ( 2048 x 3 ) which will help me figure out whether it is problem of the lirbary or of my implementation of it?

I ran my tests on 8 cores enabled on a (4 core hyper-threaded i7-4712 HQ).

↧