Quantcast
Channel: Intel® Software - Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
Viewing all 3005 articles
Browse latest View live

Wrong sgesvd answer

$
0
0

Hi,

I want to compute the SVD of two fairly small square matrices (of size 32x32 or 16x16). I use the following code to compute SVD:

int iRows = m1; //=32
int iCols = m2;  //=16
int INFO;
char JOBU = 'A'; 
char JOBVT = 'A';
float* A_Z1 = new float[iRows*iRows];
float* U_Z1 = new float[iRows*iRows];
float* VT_Z1 = new float[iRows*iRows];
float* S_Z1 = new float[iRows];
float* A_Z2 = new float[iCols*iCols];
float* U_Z2 = new float[iCols*iCols];
float* VT_Z2 = new float[iCols*iCols];
float* S_Z2 = new float[iCols];
int LWORK_Z1 = max(1,5*iRows);
float* WORK_Z1 = new float[LWORK_Z1];
int LWORK_Z2 = max(1,5*iCols);
float* WORK_Z2 = new float[LWORK_Z2];
//Compute SVD of vZ1a[a]................................................
for(int i = 0; i < iRows; ++i)
     for(int j = 0; j < iRows; ++j)
          A_Z1[j+i*iRows] = vZ1a[a](j,i);
sgesvd(&JOBU, &JOBVT, &iRows, &iRows, A_Z1, &iRows, S_Z1, U_Z1, &iRows, VT_Z1, &iRows, WORK_Z1, &LWORK_Z1, &INFO);
//Compute SVD of vZ2a[a]................................................
for(uint32_t i = 0; i < iCols; ++i)
     for(uint32_t j = 0; j < iCols; ++j)
         A_Z2[j+i*iCols] = vZ2a[a](j,i);
sgesvd(&JOBU, &JOBVT, &iCols, &iCols, A_Z2, &iCols, S_Z2, U_Z2, &iCols, VT_Z2, &iCols, WORK_Z2, &LWORK_Z2, &INFO);

Later in my application I use left and right singular vectors for a machine learning algorithm and I have algorithms for measuring the accuracy of results.

I linked the same program with ATLAS+CLAPACK and MKL.

Surprisingly, I get very inaccurate results using MKL (i.e. PSNR using MKL is 53 and using CLAPACK is 73!).

Note that I am using the same code for both of them. All I do is to link to different libraries.

Whats wrong here? I don't see a difference between lapack function declarations (obviously there shouldn't be any). So my function call should be correct. Also I take care of the fact that I should pass the transpose of my input matrix since sgesvd needs column-major ordering

I compiled with both gcc and intel compiler but same results. I use the sequential libraries just to be safe.

I hope this is one of those stupid mistakes that I make because it has been bugging me for quite some time :)

Best regards,

-Ehsan


VML vcAbs precision vs. AVX

$
0
0

Hi,

I've been writing some vectorized C++ code to compute the magnitude of a vector of 8-byte complex samples using AVX instructions on an Intel Xeon SandyBridge CPU.  I recently noticed some minor differences between my output and that of the MKL vcAbs function.  I've been able to mimic the speed of the MKL function by using the _mm256_rcp_ps() and _mm256_rsqrt_ps() functions to approximate the square root of the magnitude squared (vs. using _mm256_sqrt_ps()) but for some reason, my answers are coming out slightly different than those produced by vcAbs.  For example, the hand-coded AVX might produce a magnitude value of 0.511695 and vcAbs will produce .511643 for that same sample.  When running my test, I have the VML mode set to VML_EP.  If I modify the code to use _mm256_sqrt_ps() instead and set the VML mode to VML_HA, I get answers that match.  I'm just curious if there is more to the algorithm or something else going on inside the VML function that is affecting the answer when precision mode is VML_EP beyond just using the reciprocal approximations.  I've tried re-ordering my function calls to use rcp followed by rsqrt as well as rsqrt followed by rcp but it doesn't seem to make a difference.

For reference, I am working on a CentOS 6 platform, using MKL v11.0, and compiling with Intel CompserXE 2013.  I've been using the compile flags -O3, -xHost, -no-prec-div, -fp-model fast among others.  Unfortunately this was all done at my office and I don't have access to my source code to post or the specific input/output values I was providing to the algorithm.  Just wondering if someone can help me understand what other factors might cause a difference in the answers and what optimizations might be going on under the hood with VML_EP?  I *think* the core algorithms (my rsqrt/rcp vs. VML) must be similar as the execution speed is similar on a vector of 16384 elements, looping 1e6 times.

Thanks in advance,

Nick 

Substitution time consuming in Pardiso

$
0
0

Dear All,

I have a reactive transport model that utilizing pardiso as the solver. The model is discretized by control volume method and solved by newton iteration method. I do sybmolic factorization at first, then do numerical factorization and substitution in every time step. The model can work but I do have a question on the speedup problem.

The parameter for pardiso are as follows:
    iparm = 0
        iparm(1) = 1 ! no solver default
        iparm(2) = 3 ! fill-in reordering from METIS ,0-MIN DEGREE, 2-METIS, 3-OPENMP VERSION
        iparm(3) = 0 ! numbers of processors. Input the next call mkl_set_dynamic(0), mkl_set_num_threads(n);    
        iparm(4) = 61 ! 0-no iterative-direct algorithm; 10*L+K, K=1 CGS, K=2 CGS for symmetric, 1.0E-L: stopping criterion
        iparm(5) = 0 ! no user fill-in reducing permutation
        iparm(6) = 0 ! if == 0, the array of b is replaced with the solution x.
        iparm(7) = 0 ! Output, Number of iterative refinement steps performed
        iparm(8) = 9 ! numbers of iterative refinement steps, must be 0 if a solution is calculated with separate substitutions (phase = 331, 332, 333)
        iparm(9) = 0 ! not in use
        iparm(10) = 13 ! Default value 13, perturbe the pivot elements with 1E-13
        iparm(11) = 1 ! use nonsymmetric permutation and scaling MPS
        iparm(12) = 0 ! not in use
        iparm(13) = 1 ! maximum weighted matching algorithm is switched-on (default for non-symmetric)
        iparm(14) = 0 ! Output: number of perturbed pivots
        iparm(15) = 0 ! Output, Peak memory on symbolic factorization.
        iparm(16) = 0 ! Output, Permanent memory on symbolic factorization. This value is only computed in phase 1.
        iparm(17) = 0 ! Output, Size of factors/Peak memory on numerical factorization and solution.
        iparm(18) = 0 ! Input/output. Report the number of non-zero elements in the factors. >= 0 Disable reporting.
        iparm(19) = 0 ! Input/output. Report number of floating point operations to factor matrix A. >= 0 Disable reporting.
        iparm(20) = 0 ! Output: Numbers of CG Iterations. >0 CGS succeeded, reports the number of completed iterations.
        iparm(24) = 1 ! Parallel factorization control, 0: classic algorithm, 1: two-level factorization algorithm, improve scalability on many threads.
        iparm(25) = 0 ! Parallel forward/backward solve control. 0: Use parallel algorithm for the solve step; 1: Use the sequential forward/backward solve.         
        iparm(27) = 0 !check matrix error, 0-without check, 1-check

        maxfct = 1        
        mnum = 1        
        nrhs = 1        
        error = 0 ! initialize error flag        
        msglvl = 0 ! print statistical information        
        mtype = 11 ! real unsymmetric

First, I do symbolic factorization with phase = 11 (One time)
        phase = 11   ! only reordering and symbolic factorization
        call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, idum, nrhs, iparm, msglvl, ddum, ddum, error)
Then, in every time step, I do newton iteration. For every newton iteration, I do numerical factorization and substitution:
!Newton iteration
        phase = 22   ! only factorization
        call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, idum, nrhs, iparm_in, msglvl, ddum, ddum, error)
        phase = 33   ! only substitution           
    call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, idum, nrhs, iparm, msglvl, b_in, x_out, error)
!Newton iteration

The matrix is 2573775*2573775 with nozero entry of 12810555.
When I use 4 processors, the time consuming is as follows:
TimeStep  NewtonIteration  NumericalFactorization Substitution
1         1                2.673447               0.253894
1         2                0.000024               0.614713
1         3                0.000023               0.595046
2         1                0.000023               0.619408
2         2                0.000023               0.613553
...
My question is: Why does substitution cost so much time after the first step? In common sense, time consuming of substitution is much less than numerical factorization.

System is WIN7 X64
Compiler: Intel Parallel Studio XE 2013

Thanks.

Symbolic factorization in Pardiso

$
0
0

Hi All,

I use Pardiso to solve reactive transport problem. I tried to do symbolic factorization at first and then in the following newton iterations, only do numerical factorization and substitution.

For the simple problem, this works fine. But for the complex problem, things are quite different. The results are correct in the beginning, but after many iterations, the result is incorrect with large error. Then I tried to do symbolic factorization every step, it can generate correct results.

The structure of matrix (ia, ja) does not change for the whole simulation, only the coefficient (a) and right hand (b) changes. What's wrong with my setting?

The parameter for pardiso are as follows:
        iparm = 0
        iparm(1) = 1 ! no solver default
        iparm(2) = 3 ! fill-in reordering from METIS ,0-MIN DEGREE, 2-METIS, 3-OPENMP VERSION
        iparm(3) = 0 ! numbers of processors. Input the next call mkl_set_dynamic(0), mkl_set_num_threads(n);    
        iparm(4) = 61 ! 0-no iterative-direct algorithm; 10*L+K, K=1 CGS, K=2 CGS for symmetric, 1.0E-L: stopping criterion
        iparm(5) = 0 ! no user fill-in reducing permutation
        iparm(6) = 0 ! if == 0, the array of b is replaced with the solution x.
        iparm(7) = 0 ! Output, Number of iterative refinement steps performed
        iparm(8) = 9 ! numbers of iterative refinement steps, must be 0 if a solution is calculated with separate substitutions (phase = 331, 332, 333)
        iparm(9) = 0 ! not in use
        iparm(10) = 13 ! Default value 13, perturbe the pivot elements with 1E-13
        iparm(11) = 1 ! use nonsymmetric permutation and scaling MPS
        iparm(12) = 0 ! not in use
        iparm(13) = 1 ! maximum weighted matching algorithm is switched-on (default for non-symmetric)
        iparm(14) = 0 ! Output: number of perturbed pivots
        iparm(15) = 0 ! Output, Peak memory on symbolic factorization.
        iparm(16) = 0 ! Output, Permanent memory on symbolic factorization. This value is only computed in phase 1.
        iparm(17) = 0 ! Output, Size of factors/Peak memory on numerical factorization and solution.
        iparm(18) = 0 ! Input/output. Report the number of non-zero elements in the factors. >= 0 Disable reporting.
        iparm(19) = 0 ! Input/output. Report number of floating point operations to factor matrix A. >= 0 Disable reporting.
        iparm(20) = 0 ! Output: Numbers of CG Iterations. >0 CGS succeeded, reports the number of completed iterations.
        iparm(24) = 1 ! Parallel factorization control, 0: classic algorithm, 1: two-level factorization algorithm, improve scalability on many threads.
        iparm(25) = 0 ! Parallel forward/backward solve control. 0: Use parallel algorithm for the solve step; 1: Use the sequential forward/backward solve.         
        iparm(27) = 0 !check matrix error, 0-without check, 1-check

        maxfct = 1        
        mnum = 1        
        nrhs = 1        
        error = 0 ! initialize error flag        
        msglvl = 0 ! print statistical information        
        mtype = 11 ! real unsymmetric

Thanks and regards,

Daniel

data fitting df?Interpolate1D

$
0
0

Hi,

I am new in the data fitting land of mkl and have a couple of questions.

a) the main interpolate function, is the cell input  used if provided or is recalculated ( assuming that the pointer is not NULL)?

The reason I am asking is that, in the case of calibration of a parametric function the  points over which the updated function is

recalculated are given and therefore the search for their cell position redundant. My guess is no, since I don't see a flag that would

signify that the cell variable pointed memory is valid (i.e. not available for writting only).

b) the are two flags for the type input. If one uses the DF_CELL, is this different from a call to

df?SearchCells1D, other than the fact that the latter can be done on a task that has no function information ?

Thank you very much in advance for your help,

Petros

 

ps: Suggestion: the readability of the C-API of the data fitting routines would greatly benefit from constness wualifiers on pointers and data.

 

Memory cost and efficiency of Pardiso

$
0
0

Hi All,

I have implemented Pardiso to solve my problem and it works fine. But the memory cost and efficiency of Pardiso is not outstanding. We have another sparse solver (Solver A, not parallelized) that use level-based preconditioner and CGS-based acceleration.The memory cost for Pardiso is about three to four times of Solver A. For most of my matrix, if I use only one processor for Pardiso, Solver A is much faster, if I use six processors, they are almost the same speed. For some large matrix, Solver A is even faster than Pardiso even if Pardiso uses 6 processors. I compared the runtime of numerical factorization and substitution, and found that solver A is much more efficient. We do not expect Pardiso to beat Solver A if Pardiso runs in serial mode. But we do hope Pardiso to beat Solver A when it runs in parallel mode using 6 processors.

The speedup for Pardiso seems good, I can get a speedup of 4.5 if I use 6 processors.

Here are the parameter I use.      

        maxfct = 1        
        mnum = 1        
        nrhs = 1        
        error = 0 ! initialize error flag        
        msglvl = 0 ! print statistical information        
        mtype = 11 ! real unsymmetric

        iparm= 0
        iparm(1) = 1 ! no solver default
        iparm(2) = 3 ! fill-in reordering from METIS ,0-MIN DEGREE, 2-METIS, 3-OPENMP VERSION
        iparm(3) = 0 ! numbers of processors. Input the next call mkl_set_dynamic(0), mkl_set_num_threads(n);    
        iparm(4) = 61 ! 0-no iterative-direct algorithm; 10*L+K, K=1 CGS, K=2 CGS for symmetric, 1.0E-L: stopping criterion
        iparm(5) = 0 ! no user fill-in reducing permutation
        iparm(6) = 0 ! if == 0, the array of b is replaced with the solution x.
        iparm(7) = 0 ! Output, Number of iterative refinement steps performed
        iparm(8) = 9 ! numbers of iterative refinement steps, must be 0 if a solution is calculated with separate substitutions
        iparm(9) = 0 ! not in use
        iparm(10) = 13 ! Default value 13, perturbe the pivot elements with 1E-13
        iparm(11) = 1 ! use nonsymmetric permutation and scaling MPS
        iparm(12) = 0 ! not in use
        iparm(13) = 1 ! maximum weighted matching algorithm is switched-on (default for non-symmetric)
        iparm(14) = 0 ! Output: number of perturbed pivots
        iparm(15) = 0 ! Output, Peak memory on symbolic factorization.
        iparm(16) = 0 ! Output, Permanent memory on symbolic factorization. This value is only computed in phase 1.
        iparm(17) = 0 ! Output, Size of factors/Peak memory on numerical factorization and solution.
        iparm(18) = 0 ! Input/output. Report the number of non-zero elements in the factors. >= 0 Disable reporting.
        iparm(19) = 0 ! Input/output. Report number of floating point operations to factor matrix A. >= 0 Disable reporting.
        iparm(20) = 0 ! Output: Numbers of CG Iterations. >0 CGS succeeded, reports the number of completed iterations.
        iparm(24) = 1 ! Parallel factorization control, 0: classic algorithm, 1: two-level factorization algorithm, improve scalability on many threads.
        iparm(25) = 0 ! Parallel forward/backward solve control. 0: Use parallel algorithm for the solve step; 1: Use the sequential forward/backward solve.

I tried to change iparm(4) and iparm(10), that does not make much difference. How can I improve the efficiency of Pardiso?

Thanks and best regards,

Daniel

using all cores of a CPU with MKL

$
0
0

hi

i hope you are doing good. I am using intel MKL to develop a multi-CPU version of my linear system solver. The setup is as follows:

I have say 8 nodes connected via infiniband. Each node is fitted with a dual quad core xeon. I divide my computation (spmv's, ddots, daxpys) in equal chunks to all these nodes. Now the algorithm (Preconditioned CG) runs on all the nodes and the nodes have to communicate often in betweent he iteration loop to update their information and collaborate to arrive at a solution.

My question is that i use intel MKL to perform all the computations on each of these nodes. How can i make sure that each 'node (with 8 cores on each node) make use of all of its cores when running say spmv or ddot or daxpy or even dnorm?

Kindly suggest/advise.

rohit

How to call sygvd(a, b, w [,itype] [,jobz] [,uplo] [,info])

$
0
0

Dear all,

I want to solve the Ax = vBx by calling subroutine of call sygvd(a,b,w[,itype][,jobz][,uplo][,info]) for g95.
I refer to http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor
and generate the following link and include options

 -L$(MKLROOT)/lib/intel64 $(MKLROOT)/lib/intel64/libmkl_blas95_lp64.a $(MKLROOT)/lib/intel64/libmkl_lapack95_lp64.a -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm   <<<<here I delete the '-lpthread -lm' because they do not exist in my linux system  

-I$(MKLROOT)/include/intel64/lp64 -I$(MKLROOT)/include

When I compile my program, I get the following error:

ifort -c -o He.o He.F90
ifort -c -o He_m.o He_m.F90
ifort -o runHeif.exe He_m.o He.o -I/opt/intel/composerxe-2011.4.191/mkl/include/intel64/lp64 -I/opt/intel/composerxe-2011.4.191/mkl/include -L/opt/intel/composerxe-2011.4.191/mkl/lib/intel64 /opt/intel/composerxe-2011.4.191/mkl/lib/intel64/libmkl_blas95_lp64.a /opt/intel/composerxe-2011.4.191/mkl/lib/intel64/libmkl_lapack95_lp64.a /opt/intel/composerxe-2011.4.191/mkl/lib/intel64/libmkl_intel_lp64.a /opt/intel/composerxe-2011.4.191/mkl/lib/intel64/libmkl_sequential.a /opt/intel/composerxe-2011.4.191/mkl/lib/intel64/libmkl_core.a
He.o: In function `MAIN__':
He.F90:(.text+0xc0e): undefined reference to `sygvd_'
He.o: In function `svm_':
He.F90:(.text+0x20b1): undefined reference to `sygvd_'
make: *** [runHeifort] Error 1How to solve this problem?

Notes: I use non-commercial l_fcompxe_intel64_2011.4.191.tgz and l_mkl_10.3.4.191_intel64.tgz
Is there anything with my delete of '-lpthread -lm' ?
My program is very simple, I upload it.

Best regards.
sandf

AttachmentSize
Downloadprog.tgz3.8 KB

SVD multithreading bug in MKL

$
0
0

Hi,

It seems like that mkl has a serious bug in ?gesvd.

First, I tried to find singular values using the Armadillo library with mkl backends, but results are different according to MKL_NUM_THREADS.

for example,

/* -*- mode: c++; -*- */
#include <armadillo>

int main(int argc, char *argv[])
{
    using namespace arma;

    mat M = randu<mat>(957, 957);
    mat U, V;
    vec s;

    svd(U, s, V, M);
    mat newM = U * diagmat(s) * V.t();

    printf("norm: %f\n", arma::norm(M - newM,2));
    return 0;
}

yields the following results

$ MKL_NUM_THREADS=1 ./main
norm: 0.000000
$ MKL_NUM_THREADS=2 ./main
norm: 0.000000
$ MKL_NUM_THREADS=3 ./main
norm: 0.000000
$ MKL_NUM_THREADS=4 ./main
norm: 0.000000
$ MKL_NUM_THREADS=5 ./main
norm: 371.303371
$ MKL_NUM_THREADS=6 ./main
norm: 138.622780
$ MKL_NUM_THREADS=7 ./main
norm: 138.622780

I wrote another program that uses mkl only, but it was worse.

/* -*- mode: c++; -*- */
#include <mkl.h>
#include <string.h>
#include <math.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
    lapack_int M = 958;
    double *mat = new double[M * M];
    double *mat_orig = new double[M * M];
    double *s = new double[M];
    double *diags = new double[M*M];
    double *U = new double[M * M];
    double *Vt = new double[M * M];
    double *superb = new double[M-2];

    for (int i=0; i<M*M; ++i)
        mat[i] = static_cast<double>(rand())/RAND_MAX;

    memcpy(mat_orig, mat, sizeof(double) * M * M);
    LAPACKE_dgesvd(LAPACK_COL_MAJOR, 'a', 'a', M, M, mat, M, s, U, M, Vt, M, superb);

    memset(diags, 0, sizeof(double) * M * M);
    for (int i=0; i<M; ++i)
        diags[i * M + i] = s[i];

    cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, M, M, M, 1.0, U, M, diags, M, 0.0, mat, M);
    cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, M, M, M, 1.0, mat, M, Vt, M, 0.0, U, M);

    double res = 0.0;
    for (int i=0; i<M*M; ++i) {
        double t = mat_orig[i] - U[i];
        res += (t*t);
    }
    printf("norm: %f\n", sqrt(res));

    delete[] superb;
    delete[] Vt;
    delete[] U;
    delete[] diags;
    delete[] s;
    delete[] mat_orig;
    delete[] mat;
    return 0;
}

$ MKL_NUM_THREADS=1 ./main2
norm: 0.000000
$ MKL_NUM_THREADS=2 ./main2
norm: 0.000000
$ MKL_NUM_THREADS=3 ./main2
norm: 0.000000
$ MKL_NUM_THREADS=4 ./main2
norm: 0.000000
$ MKL_NUM_THREADS=5 ./main2
norm: 457.024091
$ MKL_NUM_THREADS=6 ./main2
Segmentation fault (core dumped)
$ MKL_NUM_THREADS=7 ./main2
Segmentation fault (core dumped)

What's wrong with this? I'm using the lastest mkl + intel compiler (11.0 update 3) in ubuntu 12.04 LTS, intel i7 990X machine.

how to generate random numbers over [0.0, 1.0) with MKL VSL

$
0
0

Hi, everyone!

The basic random number generators in MKL VSL can distribute the numbers within [0.0, 1.0], but how to generate random numbers over [0.0, 1.0)?

Thanks in advance!

Pardiso sparse right hand side iparm(31) = 1 (or iparm[30]=1 in C++)

$
0
0

Dear All

I am testing an example with sparse right hand side and sparse solution iparm[30 ]=1 in my C++.

What I have noticed that

the solution of (perm[0] = 0, b[0] !=0) and (perm[0], b[0]=0) are different.

But I expected that the solution should be the same since perm[0]= 0,

Furthermore the solution x[0] is 0th entry of the solution vectro.

However, from the pardiso manul of iparm(31), x[0] should be a random number, since I did't request to caculate it.

http://software.intel.com/sites/products/documentation/hpc/mkl/mklman/GU...

The attachment is my C++ code.

Could you tell me why this happened?

Hailong

AttachmentSize
Downloadmain.c2.67 KB

Inconsistencies with declaration of 64-bit integers ( signed and unsigned ) in many MKL functions

$
0
0

There are inconsistencies with declaration of 64-bit integers ( signed and unsigned ) in many MKL functions. In mkl_types.h there is a declaration:

/* MKL integer types for LP64 and ILP64 */
#if (!defined(__INTEL_COMPILER)) & defined(_MSC_VER)
    #define MKL_INT64 __int64
    #define MKL_UINT64 unsigned __int64
#else
    #define MKL_INT64 long long int
    #define MKL_UINT64 unsigned long long int
#endif

So, MKL_INT64 type is a portable declaration. However, in many MKL functions non-portable declarations long long int and unsigned long long int are used. It creates lots of problems if some legacy C++ compilers are used. For example, Borland C++ v5.x doesn't support  long long int and unsigned long long int types.

 

Sparse addition

$
0
0

Hello!

My question is regarding the addition of a sparse matrix on to itself. For e.g in the dense sense, performing an operation like A = A + B, where A and B are dense matrices.

Speaking in the sparse CSR sense, A may be represented by ia,ja and aval and B may be represented by ib,jb,bval and C may be represented by ic,jc and cval.

1. operation: A = B + C

Since the number of non-zero entries in A is unknown, mkl_dcsradd is called first with request = 1. After this call is executed, the number of non-zero entries will be known and then mkl_dcsradd is called with request = 2. At this point, the result of the addition is known and stored in ia,ja and aval.

Problem 1: I don't clearly understand the use of pointers here, but what I have seen is that though ia, ja and aval as per the CSR format are arrays, mkl_?csradd accepts and outputs ia, ja and aval as scalars. Printing them gives the notion of them being arrays to the end user, but they actually are scalars.

(code attached with output )

2. Considering operation: A = A + B

Problem 2: Suppose after the operation 1 is performed, I need to perform the operation A = A + B, it wouldn't be possible to do this straight away because the input arguments to the call of mkl_?csradd require the CSR representation of A (3 1D arrays). On the other hand, the data I would have is just 3 scalars (not arrays!). Is there a solution to this problem? Suppose we can circumvent this problem, will it be a problem to then perform A = A + B, because there is an inconsistency in the size of A on LHS and RHS, as mkl would see? From my point of view, I'm trying to just make the matrix A bigger and am wanting to just add all the entries in B to the same number of entries in A, hence no inconsistency, physically speaking.

I hope I have presented the problem clearly. The code and output attached should hopefully make things clear!

Many Thanks,

Amar

Incorrect results or crashes found in MKL 11.0 Update 3 LAPACK; Patch available soon

$
0
0

Problems have been found in an LAPACK internal function in Intel MKL 11.0 update 3 (available in Intel® Composer XE 2013 update 3) which can cause many other LAPACK functions to crash or provide incorrect results. The problem is related to an operation that divides work among threads for a certain class of LAPACK functions that are bandwidth limited, but can still benefit from some parallelism. The problem occurs when a thread count greater than 4 and not divisible by 4 is chosen. For a list of all affected functions, please refer to our knowledgebase article on this topic.

 We will be making a patch available soon.

 Workaround: explicitly set the number of threads to 1,2,3, or 4n (where n is an integer greater than 1) 

Parallel FGMRES

$
0
0

Hello!

What paradigm of parallel processing is FGMRES designed for? SPMD or MPMD ?

Is it possible to run FGMRES in parallel in the Single Program Multiple Data paradigm?

I intend to partition my matrix into n parts and then invoke FGMRES sequentially in each of the processors (as using the parallel flag took more time than the sequential run for the example routine). Does it make sense to do this? Are there other smarter ways of dealing with this?

Many Thanks,


ScaLAPACK pdpotrs

$
0
0

I want to use pdpotrs function to solves a system of linear equations with a Cholesky factored symmetric positivedefinite matrix. Can you explain the basic step of using the ScaLAPACK function for me, and send me a example of pdpotrs in fortran, thanks.

More problems with Forum software

$
0
0

At the top of the MKL forum there is a link with the text "Click here for more information about private thread." Clicking on the link leads to a "Page Not Found" message.

Is is no longer possible to post privately?

A question on solving many similar matrices with PARDISO

$
0
0

Hello,

I have a quick question about solving many structurely similar sparse matrices with PARDISO.

I particular, if I have many linear systems A_i x_i = b_i , where i=1,...,N and A_i have the same sparse structure.

I know that only one pointer array pt is needed for solving all of the systems, and I can solve them by the following pseudo code:

-----------------------------------------------------

do i=1,N

PARDISO phase 1 (A_i ,pt)

PARDISO phase 2 (A_i, pt)

PARDISO phase 3 (A_i,b_i,pt)

end do

-----------------------------------------------------

My question is that if it is OK to pull out (and if it is benefitial to do so) the phase 1 and/or phase 2 of PARDISO out of the do loop. That is,

-----------------------------------------------------

PARDISO phase 1 (A_1 ,pt)

do i=1,N

PARDISO phase 2 (A_i, pt)

PARDISO phase 3 (A_i,b_i,pt)

end do

-----------------------------------------------------

or,

-----------------------------------------------------

PARDISO phase 1 (A_1 ,pt)

PARDISO phase 2 (A_1, pt)

do i=1,N

PARDISO phase 3 (A_i,b_i,pt)

end do

-----------------------------------------------------

I have tried all the three possibility above, and it seems like all of them give the same (correct) result. The computational time needed are also almost the same. This experimental result seems to be weired to me, because my intuition is that only the phase 1 (symbolic factorization) can be done once for all the matrices. The phase 2 (numerical factorization) should be done independently for each matrix. I can not find an explanation nor example in the MKL package/manual to tell me the proper way to do this.

Thank you.

MKL_FREEBUFFERS error

$
0
0

I recently purchased Visual Fortran XE Composer 2013 and got MKL_FREEBUFFERS while compiling my code.

The problem did not occur when I used the earlier Visual Fortran compiler.

Any help would be greatly appreciated.

First solve with dense right hand side and following solve with sparse rhs

$
0
0

Dear All

I am trying to using pardiso (sparse direct solver) as my subdomain (substructure) solver in my domain decomposition application.

The first solve with dense right hand side (rhs) coming from source terms and boundary conditions.

The following sloves using the factorized sparse matrices with sparse right hand side only assotiated with boundary conditions.

With pardiso, the first solve I set IPARM(31) = 0 and PERM(i)=0 with phase = 13.

The following solves, I set IPARM(31) = 1, and only some PERM(i) = 1.

It seems like the entries b( m ) affected my solution even if PERM(m) =0 when phase = 33.

Here are my questions.

1. Is PERM usefull for function pardiso( ) when phase = 33 ?

2. Is there a better way to acheive my goal which is 

   the first solve (phase = 13) is with dense right hand side, but the following solves (phase = 33) using the factorized matrices with sparse right hand side ?

Hailong

Viewing all 3005 articles
Browse latest View live


Latest Images

<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>