Quantcast
Channel: Intel® Software - Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
Viewing all 3005 articles
Browse latest View live

cblas_sgemm_pack result is not consistent with cblas_gemm

$
0
0

Hello,

I wrote a short code to call sgemm_pack to speed up. But the result is not consistent with cblas_sgemm.

For example,

Matrix A (2 x 2): [1.0, 2.0, 3.0, 4.0]

Matrix B (2 x 1): [1.0, 2.0]

With the row major, Matrix C (2 x 1) = A * B = [5, 11]. But with sgemm_pack + sgemm_compute, the result is [0.0, 0.0].

Could you please take a look. Any advice is welcomed.

Thanks

---

Environments: I use parallel  studio xe. the version is 2017.1.132.

Build command: icc gemm_pack.c -I${MKLROOT}/include -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a ${MKLROOT}/lib/intel64/libmkl_sequential.a ${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group -lpthread -lm -ldl -std=c99

---

The sample code,

#include <stdio.h>
#include <mkl.h>

void print(float* a, int length, const char* name)
{
  int i = 0;
  for (i = 0; i < length; i++) {
    printf("%s[%d] = %f\n", name, i, a[i]);
  }
}

int main(void)
{
  int m = 2;
  int n = 1;
  int k = 2;

  float *a, *b, *c;
  a = (float*)malloc(sizeof(float) * m * k);
  b = (float*)malloc(sizeof(float) * k * n);
  c = (float*)malloc(sizeof(float) * m * n);

  int i = 0;
  for (i = 0; i < m *k; i++) {
    a[i] = i + 1;
  }
  for (i = 0; i < k * n; i++) {
    b[i] = i + 1;
  }

  float alpha = 1.0f;
  float beta = 0.0f;
  int lda = k;
  int ldb = n;
  int ldc = n;

  printf("========================SGEMM_PACK========================\n");
  print(a, m * k, "a");
  print(b, k * n, "b");
  float *packA = cblas_sgemm_alloc(CblasAMatrix, m, n, k);
  cblas_sgemm_pack(CblasRowMajor, CblasAMatrix, CblasNoTrans, m, n, k, alpha, a, lda, packA);

  cblas_sgemm_compute(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, k, packA, lda, b, ldb, beta, c, ldc);

  cblas_sgemm_free(packA);
  print(c, m * n, "c");

  printf("========================SGEMM========================\n");
  print(a, m * k, "a");
  print(b, k * n, "b");
  cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);
  print(c, m * n, "c");

  return 0;
}

 


Incorrect result with both FFTW and MKL FFT

$
0
0

Hello everyone,

I've noticed that both MKL FFT and FFTW is giving me a wrong result while calculating 2D complex-to-complex BACKWARD FFT. I am attaching the source codes that I used on this forum to discuss it with the experts.

While the input array is :

 1.0+0.0i  1.0+0.0i  1.0+0.0i
 1.0+0.0i  1.0+0.0i  1.0+0.0i
 1.0+0.0i  1.0+0.0i  1.0+0.0i

The forward transform should be:

 9.0+0.0i  0.0+0.0i  0.0+0.0i
 0.0+0.0i  0.0+0.0i  0.0+0.0i
 0.0+0.0i  0.0+0.0i  0.0+0.0i

And the backward transform should be again the initial array. However, it is:

 9.0+0.0i  9.0+0.0i  9.0+0.0i
 9.0+0.0i  9.0+0.0i  9.0+0.0i
 9.0+0.0i  9.0+0.0i  9.0+0.0i

I attached my test code here. Could you pls. let me know why this is happening?

 

Is it normal that the best performance happened only using a half number of threads

$
0
0

Hi,

I am running the FFT using MKL on intel cpu, which has 36 physical cores and 72 threads, as shown below.

I didn't use the Openmp but threadpool to do FFT using MKL.

The problem is using the threadpool gives a best performance when setting the number of threads as 36 but not 72. And using more number of threads will always give performance improvement when that number is less than 36. But using more number of threads than 36 will not give performance  improvement anymore.

I notice that "To achieve higher performance, set the number of threads to the number of processors or physical cores,": https://software.intel.com/en-us/mkl-linux-developer-guide-improving-per.... Though it takes OpenMP, but the thing is the same with threadpool, which is the best performance gotten from setting the number of threads equal to maximum physical cores but not the maximum threads cores.

Why does it like this? Because the data processing complexity of FFT is too high? 

So if it is like this, what do the other (36 threads) do?  In what situation the 72 threads will fully employed?

Sorry too much questions!

Ant hint will be appreciated!

 

 

 

 

 

 

Which algorithm is implemented in DGEMM?

$
0
0

Interestingly, I've been unable to find an answer to this simple question. What is the algorithm that is used for matrix-matrix multiplications (e.g., DGEMM) in MKL? Is is classical (O(N^3)), Strassen (O(N^2.7)), or something else? Thanks.

zgemm3m using 1 thread ( MKL 2017 and 2018)

$
0
0

I am seeing some performance regression with MKL2017/2018 with zgemm3m

zgemm3m , in some cases , appears to be only using 1 thread (with a negative impact on elapsed time) despite the matrix being 'large'

This behaviour appeared in MKL 2017 and MKL 2018 but is not in MKL 2015

The call to zgemm3m  takes two 4122x4122 double complex matrices. Windows 7 4 Core Xeon machine with HT.

transa=transb='N', m=n=k=4122. lda=4122,ldb=4122,alpha=1,beta=0,ldc=4122

We are essentially  looping  and calling zgemm3m with the same dimensions and matrix structure each time through the loop.

The loop is not OpenMP parallelized. Running in the "main" thread.

First time through the loop, zgemm3m uses all cores

Second time through the loop zgemm3m uses only one core ( and runs MUCH slower that the first call ).

It's very obvious in the debugger that zgemm3m is not using multiple threads the second time it is called. I tried to 'force' the correct # of threads before the call, with no change in behaviour.

		int numThreads = MKL_Get_Max_Threads();
		cout << "MKL Threads "<< numThreads << endl;
		MKL_Set_Num_Threads(numThreads);
		int numOMPThreads = omp_get_max_threads();
		cout << "OMP Threads "<< numOMPThreads << endl;
		omp_set_num_threads(numOMPThreads);
		mkl_set_dynamic(false);
                zgemm3m(....)

The output of above code trying to force the expected behaviour is always

MKL Threads 4
OMP Threads 8

What would cause zgemm3m to "turn off" threading?

 

Andrew

What can I use instead of MKL on unsupported platforms such as iOS & Android?

$
0
0

If I write my Windows, Mac & Linux code to utilise MKL, how can I then port those apps to say iOS or Android if they're not supported platforms?

Are there any "swap-in" alternatives or is it possible to get MKL working on those platforms?

Intel® MKL 2018 Parcel with Cloudera* CDH is available

Normalized cross-correlation using MKL

$
0
0

I'm trying to use MKL to do NCC in order to find a pattern in a whole image.
It seems that vsldCorrExecX only can do correlation but not NCC.

for example,
pattern = [1 2;
3 4];
image = [1 2;
5 6]
We expect to get 0.9762 for sum((pattern-mean(pattern)).*(image-mean(image)))/sqrt((pattern-mean(pattern))^2*(image-mean(image))^2).

However, we actually get sum(pattern.*image) = 44.

Is there any way to do NCC directly?


Unable to link statically with intel mkl

$
0
0

My program contains C/C++/Fortran code.

I am able to dynamically link with mkl and run my program.

I am using the following command to statically link with mkl.

/appl/intelv2017/bin/ifort -Wl,--start-group /appl/intelv2017/mkl/lib/intel64/libmkl_intel_lp64.a /appl/intelv2017/mkl/lib/intel64/libmkl_sequential.a /appl/intelv2017/mkl/lib/intel64/libmkl_core.a -Wl,--end-group -lpthread -lm -ldl -nofor-main -cxxlib *.o -o main.exe

I get the following error when from the above command.

crossval.o: In function `crossval':
/tmp/test/crossval.f:51: undefined reference to `dpotrs_'
/tmp/test/crossval.f:59: undefined reference to `ddot_'
/tmp/test/crossval.f:74: undefined reference to `ddot_'
/tmp/test/crossval.f:83: undefined reference to `ddot_'
loglik.o: In function `loglik':
/tmp/test/loglik.f90:52: undefined reference to `dpotrf_'
/tmp/test/loglik.f90:103: undefined reference to `dpotrs_'
/tmp/test/loglik.f90:114: undefined reference to `ddot_'
/tmp/test/loglik.f90:120: undefined reference to `dpotrf_'
/tmp/test/loglik.f90:132: undefined reference to `ddot_'
/tmp/test/loglik.f90:134: undefined reference to `dpotrs_'
/tmp/test/loglik.f90:141: undefined reference to `ddot_'
/tmp/test/loglik.f90:153: undefined reference to `dpotrs_'
/tmp/test/loglik.f90:159: undefined reference to `ddot_'
/tmp/test/loglik.f90:182: undefined reference to `dpotri_'
make: *** [main.exe] Error 1

All of the above routines are called from fortran code. It is not able to find the mkl function calls.

How do I resolve this issue?

 

error while using zgetri

$
0
0

Dear all,

I am running a program that has been running many times in a cluster.

Maybe because the cluster has been through software upgrade, there are errors while running the executable file a.out.

There is no problem for compiling and linking. Just error will show up while run the program halfway..

forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source
libifcoremt.so.5   00002B6454D7A6D4  for__signal_handl     Unknown  Unknown
libpthread-2.17.s  00002B6452C20370  Unknown               Unknown  Unknown
libmkl_avx512_mic  00002B646F5370BE  mkl_blas_avx512_m     Unknown  Unknown
libmkl_avx512_mic  00002B646F544B61  mkl_blas_avx512_m     Unknown  Unknown
libmkl_avx512_mic  00002B646F541935  mkl_blas_avx512_m     Unknown  Unknown
libmkl_intel_thre  00002B644F2FF714  mkl_blas_ztrsm_ho     Unknown  Unknown
libmkl_intel_thre  00002B644F319606  mkl_blas_ztrsm        Unknown  Unknown
libmkl_core.so     00002B64515C1F74  mkl_lapack_ztrtri     Unknown  Unknown
libmkl_core.so     00002B64514B032C  mkl_lapack_zgetri     Unknown  Unknown
libmkl_intel_lp64  00002B644E98683D  ZGETRI                Unknown  Unknown

Now we are using intel/17.0.4, impi/17.0.3.

      call ZGETRF( N_LEN_2, N_LEN_2, BQ , N_LEN_2, IPIV , INFO )

      call ZGETRI( N_LEN_2, BQ, N_LEN_2, IPIV, WORK, N_LEN_2, INFO )

The first subroutine

ZGETRF

is fine. But when it comes to the second function

ZGETRI. There is always a floating invalid error. 

I just do not understand. Because the input of ZGETRI are just the output of ZGTRF.

*********updates********

I found the following on Intel® Math Kernel Library (Intel® MKL) 2017 Release Notes

Fixed irregular division by zero and invalid floating point exceptions
in {C/Z}TRSM for Intel® Xeon Phi™ processor x200 (aka KNL) and Intel® Xeon®
Processor supporting Intel® Advanced Vector Extensions 512 (Intel® AVX-512) code path

I found this maybe useful because my error message just mentioned 

TRSM, Invalid floating AVX-512

********updates2********

It seems the error has something to do with the MKL library.

1. The code has been running for a long time.

2. I run the code in a low version MKL library, it works well.

I think the current MKL library which is 17.0.4 must has something not correct.

 

 

MKL c datatype problem

$
0
0

Hi,

I am using cblas from MKL library but I got wrong results after redefine MKL_INT to long.

Is there something wrong?

#define MKL_INT long

#include "mkl.h"

....

mkl_dcscmv(...)

....

 

Thanks!

How to use multiple linear regression in MKL

$
0
0

Hi everyone,

     I'm new to Linux and C++. I want to insert a function to my codes for implementing multiple linear regression. Since the hpc has MKL installed, I want to use the library. Any help is appreciated in advance!

Best Regards

Yi

 

MKL Pardiso sparse right hand sides

$
0
0

I'm solving a sparse system of equation A*x=b

Matrix A is a sparse matrix for finite element. Say the size of A is 1 million by 1 million. It works fine if I have limited number of right hand sides (RHS). However, if the number of right hand sides increases. For example, we have 1000 RHS, allocate vector b requires a lot of memory. My RHS b is actually sparse vector. For each RHS, there are only several non-zero entries. I'm wondering is there any way to pass sparse RHS into pardiso?

Otherwise, what I can think is allocate sparse RHS before pass it to pardiso, divide the total RHS into groups and pass the RHS to pardiso by groups after convert it back to dense format.

Thanks for your suggestions.

Does anyone know if the MKL Fast Poisson Solver can be used for the nonlinear Poisson eqn?

$
0
0

Hello,

Is it possible to modify the Intel MKL Fast Poisson Solver for the problem of type:

Δ .[K(u). Δ(u) ] = f 

where Δ is the gradient symbol (I didn't find the reverse triangle in the special characters). K(u) is a positive differentiable function dependent on the position. Check the equation here: 

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.645.5026&rep=re...

The difference between the above equation and the demonstrated Poisson eqn. at MKL Poisson solver page is the term K(u).

 

fortran mkl csr sparse matrix storage, integer size of row/column index vectors for very large arrays

$
0
0

Hi there,

I ran into trouble with the single precision integer size (I think it is 32bit) of the row index vector of  mkl csr arrays. I create squared sparse arrays with +56,000,000 rows and columns and +3,000,000,000 elements. The column index vector is still ok because largest numbers are equal to the column dimension. However, the row index vector makes trouble with entries equal to the length of the column index vector which cannot be held by a single precision integer. I could use a larger integer kind number (double precision integer, 64 bit) but then I am in trouble with all mkl routines dealing with csr matrices (e.g. dcsrmm). That is because these routine have a fortran 77 interfaces and as outlined in the mkl manual, the expect index vectors of kind single precision. I can imagine that one could set the default integer to 64 bit when installing the mkl, but I am not sure whether that is possible. Any Ideas??

Thanks a lot

 


Error While Loading Shared Libraries: libiomp5.so

$
0
0

Hi All,

I am running CentOS 7.3 on Intel Xeon Phi. I have successfully configured Parallel Studio XE 2017 update 5 with all the libraries and tools that come with it. However, for some reason I keep getting following error when I profile benchmarks like DeepBench, Intel LINKPACK or Intel Caffe with perf: 

error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory

Both perf and benchmarks work standalone, only time they don't is when I hook them with each other. This wasn't the case before I had to re-install CentOS on my system. I have sourced MKL environment variables apart from other specific variables from /opt/intel/bin/*.sh. I haven't found this libiomp5.so installed anywhere on the system.

source /opt/intel/bin/compilervars.sh intel64
source /opt/intel/mkl/bin/mklvars.sh intel64
source /opt/intel/impi/2017.4.239/intel64/bin/mpivars.sh intel64
source /opt/intel/bin/iccvars.sh intel64
source /opt/intel/bin/ifortvars.sh intel64

Can anyone please share steps or suggestions on how to solve this issue?

Thanks.

compile mkl example under qt creator 4.2.1 linux 64

$
0
0

Hi,

I am trying to compile the example cblas_caxpy example in the following environment

qt creator 4.2.1

qt 5.8.0

compiler : g++

os: linux debian 8.0 64 bits

I linked with lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 and added I guess proper include path but I get the following compile error:

in function 'main':

undefined reference to GetIntegerParameters

undefined reference to GetScalarC

etc..

What did I miss ?

Thanks

Agks

matlab no longer working after installing mkl

$
0
0

Hi

I installed mkl_2018 on a linux Debian 8.0 64bits where I already had matlab installed. Before installing mkl, matlab worked fine.

Since I installed mkl, matlab starts but crashes with following error when I do a signal convolution:

Intel MKL FATAL ERROR: cannot load libmkl_avx.so

So I added 

export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_avx.so to my .bashrc and sourced it, and now I can no longer launch matlab : I get following error message:

symbol lookup error: /opt/intel/mkl/lib/intel64/libmkl_avx.so: undefined symbol: mkl_parse_optimize_bsr_trsm_i8

Any idea ?

Thanks

Intel® HPC Developer Conference 2017

$
0
0

Dear MKL Forum Users, join us at Intel® HPC Developer Conference in Denver, Colorado during November 11-12, 2017.  This free technical training is open to the public and will feature industry luminaries sharing best practice and techniques for maximizing efficiency and getting the benefits from Intel architecture. Attendees have the option of attending technical sessions, hands-on tutorials and poster sessions that cover parallel programming, high productivity languages, artificial intelligence, systems, enterprise, visualization development and more.     https://www.intel.com/content/www/us/en/events/hpcdevcon/overview.html

SVD speed of 'small' matrices in MKL 2018_0_124

$
0
0

I'm using SVD during some least-square fitting, typically operating on spectral data (1000-2000 data points) and fitting with very few parameters (2-5).

For this, I'm generally using a direct implementaion of the SVD routines from the "numerical recipes" (single-threaded).

When I started needing SVDs in other areas (bigger matrices with a less extreme aspect ratio, typtically ~ 10000 x 1000) I started using MKL Lapacke, currenlty using version 2017_4_210 and here the routines greatly outperform the NR routines.

So I also started using them for the fitting as described above. However, when applying it to the "extreme" data of only very few parameters ( typical matrix size 2048 x 3 ), the Lapacke routines fell behind and the NR routines are just faster.

Just as a "guideline": Running the same (iterative) fitting on a typical standard data-set, my profile tells me I'm staying with the SVD-routines for about  4sec using NR routines and for about 7sec with the MKL routines)

Now, when MKL 2018 was announced a month ago, I was quite excited to read in the Release Notes (https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-2018-release-notes):

LAPACK:

  • Added the following improvements and optimizations for small matrices (N<16):
  • Added ?gesvd, ?geqr/?gemqr, ?gelq/?gemlq  optimizations for tall-and-skinny/short-and-wide matrice

So I gave it a try, but was quite disappointed. Not only did the NR still outperfrom MKL routines, but for reasons not clear to me, the performance actually dropped significantly in the 2018_0_124 MKL compared to the 2017_4_210 version.

The same data for guideline:
- NR routines: 4sec
- MKL 2017: 7sec
- MKL 2018: 14sec

The only changes I did when comparing both variantes was to re-compile/link with the newer version and use the according new version DLLs.
Did I miss something? Or did I misunderstand the release notes? Does anybody have some other comparative data for running SVDs on matrices of size ( 2048 x 3 ) which will help me figure out whether it is problem of the lirbary or of my implementation of it?

I ran my tests on 8 cores enabled  on a (4 core hyper-threaded i7-4712 HQ).

 

Viewing all 3005 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>