Quantcast
Channel: Intel® Software - Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
Viewing all 3005 articles
Browse latest View live

Pardiso scaling inversely with number of threads for openmp

$
0
0

Hello.  After a number of failed debugging attempts and tests, I'm hoping to get some input on using pardiso in parallel with openmp.  The software  in question uses the intel fortran compiler (ifort) and also uses the pardiso solver within a broader finite element code.

I have attempted to run pardiso in parallel via openmp over 1, 2 and 4 processors, but the solve time systematically increases as the number of processors increases.  This behavior is repeatable on: 

  1. two different computers (1 linux desktop, 1 linux cluster)
  2. multiple different versions of the intel fortran compilers/mkl (11.1_080, 2013.5.192, 2013_sp1) used on the linux desktop
  3. repeated checking of different mkl link line options (https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/)
  4. tests with different matrix sizes - number of equations can vary from ~100,000 to greater than 10,000,000

For example, pardiso solve times might scaling in the following way for a relatively small matrix (80,000 equations):  1 processor: 0.79 seconds, 2 processors: 1.17 seconds, 4 processors: 1.91 seconds.

So, something I'm doing across all the systems (hardware, compiler versions, etc) is fundamentally wrong. 

Before posting specifics for one example (iparm input parameters, mkl link line commands, compiler version, etc), is there any documentation ,previous posts, etc I should look at that might shed some light on this issue?  At this point I've gone through the mkl manual and forums and haven't found any clues to what the issue is.  If there is no other documentation to look up, I'll go ahead and post up whatever system/solver information is required.

Thanks in advance,

John

 


Pardiso scaling inversely with number of threads for openmp

$
0
0

Hello.  After a number of failed debugging attempts and tests, I'm hoping to get some input on using pardiso in parallel with openmp.  The software  in question uses the intel fortran compiler (ifort) and also uses the pardiso solver within a broader finite element code.

I have attempted to run pardiso in parallel via openmp over 1, 2 and 4 processors, but the solve time systematically increases as the number of processors increases.  This behavior is repeatable on: 

  1. two different computers (1 linux desktop, 1 linux cluster)
  2. multiple different versions of the intel fortran compilers/mkl (11.1_080, 2013.5.192, 2013_sp1) used on the linux desktop
  3. repeated checking of different mkl link line options (https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/)
  4. tests with different matrix sizes - number of equations can vary from ~100,000 to greater than 10,000,000

For example, pardiso solve times might scaling in the following way for a relatively small matrix (80,000 equations):  1 processor: 0.79 seconds, 2 processors: 1.17 seconds, 4 processors: 1.91 seconds.

So, something I'm doing across all the systems (hardware, compiler versions, etc) is fundamentally wrong. 

Before posting specifics for one example (iparm input parameters, mkl link line commands, compiler version, etc), is there any documentation ,previous posts, etc I should look at that might shed some light on this issue?  At this point I've gone through the mkl manual and forums and haven't found any clues to what the issue is.  If there is no other documentation to look up, I'll go ahead and post up whatever system/solver information is required.

Thanks in advance,

John

Pardiso scaling inversely (openmp)

$
0
0

Hello.  After a number of failed debugging attempts and tests, I'm hoping to get some input on using pardiso in parallel with openmp.  The software  in question uses the intel fortran compiler (ifort) and also uses the pardiso solver within a broader finite element code.

I have attempted to run pardiso in parallel via openmp over 1, 2 and 4 processors, but the solve time systematically increases as the number of processors increases.  This behavior is repeatable on: 

  1. two different computers (1 linux desktop, 1 linux cluster)
  2. multiple different versions of the intel fortran compilers/mkl (11.1_080, 2013.5.192, 2013_sp1) used on the linux desktop
  3. repeated checking of different mkl link line options (https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/)
  4. tests with different matrix sizes - number of equations can vary from ~100,000 to greater than 10,000,000

For example, pardiso solve times might scaling in the following way for a relatively small matrix (80,000 equations):  1 processor: 0.79 seconds, 2 processors: 1.17 seconds, 4 processors: 1.91 seconds.

So, something I'm doing across all the systems (hardware, compiler versions, etc) is fundamentally wrong. 

Before posting specifics for one example (iparm input parameters, mkl link line commands, compiler version, etc), is there any documentation ,previous posts, etc I should look at that might shed some light on this issue?  At this point I've gone through the mkl manual and forums and haven't found any clues to what the issue is.  If there is no other documentation to look up, I'll go ahead and post up whatever system/solver information is required.

Thanks in advance,

John

Problem using pdhseqr

$
0
0

Hi,

I have some trouble using pdhseqr from ScaLAPACK (it is supposed to compute the eigen value and Schur factorization of a matrix already in Hessenberg form). It happens for both pshseqr and pdhseqr. The same happens whether I compile with MSVC 12 or Intel 14. Here is a (as much as possible) minimal code sample. It is C++ but so far I haven't had any problem with other functions from PBLAS or ScaLAPACK. 

#include "mpi.h"
#include <algorithm>    // std::max
#include <iostream>
#include "mkl.h"

#ifdef _WIN32 /* Win32 or Win64 environment */
#define numroc_ NUMROC
#define descinit_ DESCINIT
#endif

extern "C" {
	/* Cblacs declarations */
	void Cblacs_get(int, int, int*);
	void Cblacs_gridinit(int*, const char*, int, int);
	void Cblacs_pcoord(int, int, int*, int*);
	void Cblacs_gridexit(int);
	void Cblacs_exit(int);

	int numroc_(int*, int*, int*, int*, int*);
	void descinit_(int *, int *, int *, int *, int *, int *, int *,
		int *, int *, int *);

	void pdhseqr_(char *job, char* compz, int* n, int* ilo, int* ihi, double* h, int* desch, double* wr, double* wi, double* z, int* descz, double* work, int* lwork, int* iwork, int* liwork, int* info);
void psgemm_(char*transa, char*transb, int*m, int*n, int*k, float*alpha, float*a, int*ia, int*ja, int*desca, float*b, int*ib, int*jb, int*descb, float*beta, float*c, int*ic, int*jc, int*descc);
	void pdgemm_(char*transa, char*transb, int*m, int*n, int*k, double*alpha, double*a, int*ia, int*ja, int*desca, double*b, int*ib, int*jb, int*descb, double*beta, double*c, int*ic, int*jc, int*descc);
	void pshseqr_(char *job, char* compz, int* n, int* ilo, int* ihi, float* h, int* desch, float* wr, float* wi, float* z, int* descz, float* work, int* lwork, int* iwork, int* liwork, int* info);
}

inline int indxl2g(int lidx, int sblock, int nprocs, int iproc)
{
	return nprocs*sblock*(lidx / sblock) + lidx%sblock + ((nprocs + iproc - 0) % nprocs)*sblock;
}


int main(int argc, char* argv[])
{
	// Initialize parallel stuff
	int ctxt, myrank, myrow, mycol, numproc;

	MPI::Init();
	myrank = MPI::COMM_WORLD.Get_rank();
	numproc = MPI::COMM_WORLD.Get_size();

	char major = 'R';	// the process grid will be row-major
	int iZERO = 0; int iONE = 1; int iMONE = -1;
	double dZERO = 0.0;  double dONE = 1.0; double dTWO = 2.0;
	float fZERO = 0.0;  float fONE = 1.0; float fTWO = 2.0;

	Cblacs_get(0, 0, &ctxt);	// get the system context
	Cblacs_gridinit(&ctxt, &major, numproc, 1);
	Cblacs_pcoord(ctxt, myrank, &myrow, &mycol);

	// Create a shared matrix
	// Size of the data on the local process
	int M = 10;	int N = 10;
	int block = 10;
	int m = numroc_(&M, &block, &myrow, &iZERO, &numproc);
	int n = numroc_(&N, &block, &mycol, &iZERO, &iONE);

	std::cout << "m, n: "<< m << ", "<< n << std::endl;
	float* H = new float[m*n];
	float* Q = new float[m*n];

	// Finally fill in the BLACS array descriptor
	int descH[9]; int descQ[9]; int info;
	m = std::max(1, m);	// necessary because descinit() will throw if llda is 0

	descinit_(descH, &m, &n, &block, &block, &iZERO, &iZERO, &ctxt, &m, &info);
	descinit_(descQ, &m, &n, &block, &block, &iZERO, &iZERO, &ctxt, &m, &info);


	//Make H and Q upper-Hessenberg
	for (int j = 0; j < m; ++j)
	{
		for (int k = 0; k < n; ++k)
		{
			if (indxl2g(j, block, numproc, myrow) > indxl2g(k, block, 1, mycol) + 0)
			{
				H[j + m*k] = 0;
				Q[j + m*k] = 0;
			}
			else{ H[j + m*k] = 1; Q[j + m*k] = 1; }
		}
	}

	char op = 'N';
	// Compute H <- Q*Q + H, just to check that is works
	psgemm_(&op, &op, &M, &N, &N, &fONE, Q, &iONE, &iONE, descQ, Q, &iONE, &iONE, descQ, &fONE, H, &iONE, &iONE, descH);

	// Check the output
	for (int j = 0; j < m; ++j)
	{
		for (int k = 0; k < n; ++k)
		{
			std::cout << H[j + m*k] << "\t";
		}
		std::cout << std::endl;
	}


	char job = 'S'; char compz = 'I';
	float * wr = new float[N];	float * wi = new float[N];
	float lwork = -42.0; int liwork = -42;

	pshseqr_(&job, &compz, &N, &iONE, &N, H, descH, wr, wi, Q, descQ, &lwork, &iMONE, &liwork, &iMONE, &info);

	std::cout << "lwork, liwork: "<< lwork << ", "<< liwork << std::endl;
	std::cout << "info: "<< info << std::endl;

	int work_size = 10000;
	float work[10000]; int iwork[10000];

	pshseqr_(&job, &compz, &N, &iONE, &N, H, descH, wr, wi, Q, descQ, work, &work_size, iwork, &work_size, &info);
	std::cout << "info: "<< info << std::endl;

	Cblacs_gridexit(ctxt);
	Cblacs_exit(0);
	return 0;
}

And here is the output:

D:\lib\pxhseqr_minimal\x64\Debug>mpiexec -n 1 pxhseqr_minimal.exe
m, n: 10, 10
2       3       4       5       6       7       8       9       10      11
0       2       3       4       5       6       7       8       9       10
0       0       2       3       4       5       6       7       8       9
0       0       0       2       3       4       5       6       7       8
0       0       0       0       2       3       4       5       6       7
0       0       0       0       0       2       3       4       5       6
0       0       0       0       0       0       2       3       4       5
0       0       0       0       0       0       0       2       3       4
0       0       0       0       0       0       0       0       2       3
0       0       0       0       0       0       0       0       0       2
lwork, liwork: 894262, -42
info: 0
{   -1,   -1}:  On entry to

job aborted:
rank: node: exit code[: error message]
0: Oz: -1073741819: process 0 exited without calling finalize

If I try to run with Visual Studio's debugger I get this when pdhseqr is called the second time:

Unhandled exception at 0x00007FF6CA2D8F1D in pxhseqr_minimal.exe: 0xC0000005: Access violation reading location 0x0000000000000007.

I don't know how MKL works, but since the error message is messed up (process coordinates are -1,-1) I think it might be based on an old source where pxerbla was called with only 2 arguments (see line 357 http://www.netlib.org/scalapack/explore-html/d3/d3a/pdhseqr_8f_source.html ).

Or am I doing something wrong? 

Any help would be very welcome, I have been stuck with having to compute eigen vectors in serial for a while so I would really like to get this routine to work.

Best,

Romain

Problems in running MKL fortran example

$
0
0

Dear all,

After the example fortran code vdrnggaussian.f successfully compiled (in visual studio 2010+Intel Parallel Studio XE 2013), the resulting executable file returns the following error message:

MKL ERROR: Parameter 1 was incorrect on entry to vdRngGaussian.

Error: bad arguments (code    -3).

What should I do to get the right results? Many thanks for your help.

 

 

pardiso multiple right hand side with sparse matrix

$
0
0

Hi All,

Pardiso can handle multiple right side, AX=B, meaning B is a matrix. Every column of B is a right hand side. If B is very sparse, is there any way to exploit this feature, instead of storing B as a full matrix? Thank you!

Linking BLA95 and LAPACK 95 to a shared library in Linux

$
0
0

I'm in the process of porting an application from Windows to Linux, and one component of my application is a shared library (a DLL on Windows) compiled with Intel Fortran and with heavy use of the BLAS and LAPACK libraries. 

When trying to compile the library in Linux, my first preference was to link the MKL libraries to it statically. According to the link line advisor, the required definition for LINKLIBS in my makefile was to be:

 $(MKLROOT)/lib/intel64/libmkl_blas95_lp64.a $(MKLROOT)/lib/intel64/libmkl_lapack95_lp64.a -Wl,--start-group $(MKLROOT)/lib/intel64/libmkl_intel_lp64.a $(MKLROOT)/lib/intel64/libmkl_core.a $(MKLROOT)/lib/intel64/libmkl_sequential.a -Wl,--end-group -lpthread -lm

Using this, I obtained the following error message:

ld: /opt/intel/composer_xe_2013_sp1.3.174/mkl/lib/intel64/libmkl_lapack95_lp64.a(dgesdd.o): relocation R_X86_64_32 against `__STRLITPACK_1' can not be used when making a shared object; recompile with -fPIC
/opt/intel/composer_xe_2013_sp1.3.174/mkl/lib/intel64/libmkl_lapack95_lp64.a: could not read symbols: Bad value

What this implies is that libkml_lapack95_lp64.a is only intended for being linked to main applications or other static libraries, not to dynamic libraries. I verified this by typing:

readelf --relocs libmkl_lapack95_lp64.a | egrep '(GOT|PLT|JU?MP|SLOT)'

as suggested here and, sure enough, there is no offset table (even though most of the other MKL static libraries do in fact have them).

This is frustrating (I have another C shared library which I link with the static versions of the IPP libraries without any issues) but I can live with it. So, I went back to the MKL link-line advisor to find how to link the MKL dynamically, and the advice is:

"Use this link line:

 $(MKLROOT)/lib/intel64/libmkl_blas95_lp64 $(MKLROOT)/lib/intel64/libmkl_lapack95_lp64 -L$(MKLROOT)/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lpthread -lm"

I obeyed this to the letter, and the result is:

ifort: error #10236: File not found:  '/opt/intel/composer_xe_2013_sp1.3.174/mkl/lib/intel64/libmkl_blas95_lp64'
ifort: error #10236: File not found:  '/opt/intel/composer_xe_2013_sp1.3.174/mkl/lib/intel64/libmkl_lapack95_lp64'

Not altogether surprising, because indeed there aren't any files with these names that have either no suffix or a .so suffix. There are just the ".a" ones, and if I add the ".a" at the ends of the filenames in my link line then of course I'm using the same static libraries as before and I get the same relocation error as before. By the way, the same happens if I use the "-lmkl_blas95_lp64" syntax that I've seen documented elsewhere.

So, should I compile my own shared library? It seems that not even this is supported, since if I dig down into /opt/intel/composer_xe_2013_sp1.3.174/mkl/interfaces/blas95 and examine the makefile that is provided for doing a custom build, even that only supports static builds.

So, finally, my question: what should I do? I see three alternatives:

(1) Manually edit the Intel makefiles to recompile libmkl_blas95_lp64.a and libmkl_lapack95_lp64.a with the ifort equivalent to '-fPIC' (if there is one).

(2) Manually edit the Intel makefiles to create ibmkl_blas95_lp64.so and libmkl_lapack95_lp64.so.

(3) Conclude that I'm doing something which, for some reason I can't discern, is unsupported and inadvisable.

Hopefully, there is a fourth alternative which is simpler than any of the above, but I don't know what that may be. I'd very much appreciate some advice. 

Using Intel® C++ Composer XE for Multiple Simple Random Sampling without Replacement

$
0
0

<!--break-->

Introduction

Random sampling is often used when pre- or post-processing of all records of the entire data set is expensive, as in the following examples. When the file of records or database is too large, retrieval cost for one record is too high. In further physical examination of the real-world entity described by a record, fiscal audit of financial records, or medical examinations of sampled patients for epidemiological studies, post-processing of one data record is too time-consuming. Random sampling is typically used to support statistical analysis of an entire data set and some aggregate statistic estimation (such as average), to estimate parameters of interest, or to perform hypothesis testing. Typical applications of random sampling are financial audit, fissile materials audit, epidemiology, exploratory data analysis and graphics, statistical quality control, polling and marketing research, official surveys and censuses, statistical database security and privacy, etc.

<!--break-->

Problem statement

Definitions:

  • The population to be sampled is assumed to be a set of records (tuples) of a known size N.
  • A fixed-size random sample is a random sample for which the sample size is a specified constant M.
  • A simple random sample without replacement (SRSWOR) is a subset of the elements of a population where each element is equally likely to be included in the sample and no duplicates are allowed.

We need to generate multiple fixed size simple random samples without replacement. Each sample is unbiased, i.e., item (record) in each sample was chosen from the whole population with equal probability 1/N, independently of others. All samples are independent.

Note: We consider a special case of problems where all records are numbered using natural numbers from 1 to N, so we do not need access to population items themselves (or we have array of indexes of population items).

In other words, we need to conduct a series of experiments, each generating a sequence of M unique random natural numbers from 1 to N (1≤M≤N).

The attached program uses M=6 and N=49, conducts 119 696 640 experiments, generates a large number of result samples (sequences of length M) in the single array RESULTS_ARRAY, and uses all available parallel threads. In the program, we call each experiment a “lottery M of N”.

<!--break-->

Considered approaches to simulate one experiment

Algorithm 1

A straightforward algorithm to simulate one experiment is as follows:

A1.1: let RESULTS_ARRAY be empty
A1.2: for i from 1 to M do:
    A1.3: generate random natural number X from {1,...,N}
    A1.4: if X is already present in RESULTS_ARRAY (loop), then go to A1.3
    A1.5: put X at the end of RESULTS_ARRAY
End.

In more detail, step A1.4 is the “for” loop of length i-1:

A1.4.1: for k from 1 to i-1:
A1.4.2: if RESULTS_ARRAY[i]==X, then go to A1.3

Algorithm 2

This algorithm uses the partial “Fisher-Yates shuffle” algorithm. Each experiment is treated as a partial length-M random shuffle of the whole population of N elements. It needs M random numbers. The algorithm is as follows:

A2.1: (Initialization step) let PERMUT_BUF contain natural numbers 1, 2, ..., N
A2.2: for i from 1 to M do:
    A2.3: generate random integer X uniform on {i,...,N}
    A2.4: interchange PERMUT_BUF[i] and PERMUT_BUF[X]
A2.5: (Copy step) for i from 1 to M do: RESULTS_ARRAY[i]=PERMUT_BUF[i]
End.

Explanation: each iteration of the loop A2.2 works as a real lottery step. Namely, in each step, we extract random item X from remaining items in the bin PERMUT_BUF[i], ..., PERMUT_BUF[N] and put it at the end of the results row PERMUT_BUF[1],...,PERMUT_BUF[i]. The algorithm is partial because we do not generate full permutation of length N, but only a part of length M.

At the cost of more memory and extra Initialization and Copy steps (loops), Algorithm 2 needs fewer random numbers than Algorithm 1, and does not have the second nested loop A1.4 with “if” branching. Therefore, we chose to use Algorithm 2.

In the case of simulating many experiments, Initialization step is needed only once because at the beginning of each experiment, the order of natural numbers 1...N in the PERMUT_BUF array does not matter (like in real lottery).

Note that in our C program (attached), zero-based arrays are used.

<!--break-->

Optimization

We use Intel® C++ Compiler, with its OpenMP* implementation, and Intel® MKL shipped with Intel® Composer XE 2013 SP1.

Parallelization

We exploit all CPUs with all available processor cores by using OpenMP* (see “#pragma parallel for” in the code, and see [4] for more details about OpenMP usage).

We use Intel® MKL MT2203 BRNG since it easily supports a parallel independent stream in each thread (see [3] for details).

    #pragma omp parallel for num_threads(THREADS_NUM)
    for( thr=0; thr<THREADS_NUM; thr++ ) { // thr is thread index
        VSLStreamStatePtr stream;

        // RNG initialization
        vslNewStream( &stream, VSL_BRNG_MT2203+thr, seed );

        ... // Generation of experiment samples (in thread number thr)

        vslDeleteStream( &stream );
    }

Generation of experiment samples

In each thread, we generate EXPERIM_NUM/THREADS_NUM experiment results. For each experiment we call Fisher_Yates_shuffle function that implements steps A2.2, A2.3, and A2.4 of the core algorithm to generate the next results sample. After that we copy the generated sample to RESULTS_ARRAY (step A2.5) as shown below:

    //  A3.1: (Initialization step) let PERMUT_BUF contain natural numbers 1, 2, ..., N
    for(i=0; i<N; i++) PERMUT_BUF[i]=i+1; // we will use the set {1,...,N}

    for(sample_num=0;sample_num<EXPERIM_NUM/THREADS_NUM;sample_num++) {
        Fisher_Yates_shuffle(...);

        for(i=0; i<M; i++)
            RESULTS_ARRAY[thr*ONE_THR_PORTION_SIZE + sample_num*M + i] = PERMUT_BUF[i];
    }

Fisher_Yates_shuffle function

The function implements steps A2.2, A2.3, and A2.4  of the core algorithm (chooses a random item from the remaining part of PERMUT_BUF and places this item at the end of the output row, namely, to PERMUT_BUF[i]):

    for(i=0; i<M; i++) {
        j = Next_Uniform_Int(...);

        tmp = PERMUT_BUF[i];
        PERMUT_BUF[i] = PERMUT_BUF[j];
        PERMUT_BUF[j] = tmp;
    }

Next_Uniform_Int function

In step A2.3 of the core algorithm, our program calls the Next_Uniform_Int function to generate the next random integer X, uniform on {i,...,N-1}.

To exploit the full power of vectorized RNGs from Intel MKL, but to hide vectorization overheads, the generator must be called to generate a sufficiently large vector D_UNIFORM01_BUF of size RNGBUFSIZE that fits the L1 cache. Each thread uses its own buffer D_UNIFORM01_BUF and the index D_UNIFORM01_IDX pointing to after the random number from that buffer used last. In the first call to Next_Uniform_Int function (or in the case all random numbers from the buffer have been used), we regenerate the full buffer of random numbers again by calling to vdRngUniform function with the length RNGBUFSIZE and set the index D_UNIFORM01_IDX to zero (in fact, the index was already set to zero a while before):

    vdRngUniform( ... RNGBUFSIZE, D_UNIFORM01_BUF ... );

 

Because Intel MKL provides only generators of random values with same distribution, but in step A2.3 we need random integers on different intervals, we fill our buffer with double-precision random numbers uniformly distributed on [0;1) and then, in the “Integer scaling step”, we convert these double-precision  values to the needed integer intervals. Fortunately, we know that our algorithm in step A2.3 will need this sequence of numbers, distributed as follows:

    number 0   distributed on {0,...,N-1}   = 0   + {0,...,N-1}
    number 1   distributed on {1,...,N-1}   = 1   + {0,...,N-2}
    ...
    number M-1 distributed on {M-1,...,N-1} = M-1 + {0,...,N-M}
    (then repeat previous M steps)
    number M     distributed on: see (0)
    number M+1   distributed on: see (1)
    ...
    number 2*M-1 distributed on: see (M-1)

    (then again repeat previous M steps)
    ...
    etc.
Hence, “Integer scaling step” looks like this:
    // Integer scaling step
    for(i=0;i<RNGBUFSIZE/M;i++)
        for(k=0;k<M;k++)
            I_RNG_BUF[i*M+k] =
                k + (unsigned int)(D_UNIFORM01_BUF[i*M+k] * (double)(N-1-k));

Notes:

  • RNGBUFSIZE must be a multiple of M;
  • This double-nested loop is not suitable for good vectorization because M=6 is not a multiple of 8 (8 is the number of integers in the Intel® Advanced Vector Extensions (Intel® AVX) vector register);
  • Even if we interchange loops “for i” and “for k” and choose RNGBUFSIZE/M to be multiple of 8, this double-nested loop is not suitable for good vectorization, because we will store the results not contiguously in memory;
  • We put scaled integers I_RNG_BUF[i*M+k] into the same buffer where we put double-precision random values D_UNIFORM01_BUF[i*M+k]. Although depending on the CPU type, it may be preferable to have a separate buffer for integers, so that both buffers together fit L1 cache. Separate buffers allow to avoid store-after-load forwarding penalty stalls that might occur because the size of loaded double-precision values is not equal to the size of stored integers.

<!--break-->

Conclusions

The attached, Intel C++ Composer XE based, implementation of the algorithm presented in this article for the case of 119 696 640 experiments of “lottery 6 of 49” runs ~24*13 times faster than the sequential algorithm based on the sequential scalar version using GNU* Scientific Library (GSL)+GNU Compiler Collection (GCC).

Measured work time is:

  • 0.216 sec (algorithm presented in this article);
  • 69.321 sec (sequential scalar algorithm, based on GSL+GCC, i.e., using gsl_ran_choose function, sequential RNG gsl_rng_mt19937 from GSL, gcc 4.4.6 20110731 with options -O2 -mavx -I$GSL_ROOT/include -L$GSL_ROOT/lib -lgsl -lgslcblas).

The measurements were done on the following platform:

  • CPU: 2 x 3d-Generation Intel® Core™ i7 processor 2.5GHz, 2*12 cores, 30MB L3 cache size, hyper-threading off;
  • OS: Red Hat Enterprise Linux* Server release 6.2, x86_64;
  • Software: Intel® C++ Composer XE 2013 SP1 (with Intel C++ Compiler 13.1.1 and Intel MKL 11.0.3).

<!--break-->

Program code attached (see lottery6of49.c file).

<!--break-->

References

[1] D. Knuth. The Art of Computer Programming. Volume 2. Section 3.4.2 Random Sampling and Shuffling. Algorithm S, Algorithm P;

[2] Intel® Math Kernel Library Reference Manual, available at https://software.intel.com/en-us/intel-software-technical-documentation?..., section “Statistical Functions”, subsection “Random Number Generators”;

[3] Intel® MKL Vector Statistical Library Notes, available at https://software.intel.com/en-us/intel-software-technical-documentation?..., section “Independent Streams. Block-Splitting and Leapfrogging” about usage of several independent streams of VSL_BRNG_MT2203;

[4] User and Reference Guide for the Intel® C++ Compiler, available at https://software.intel.com/en-us/intel-software-technical-documentation?..., section “Key Features”, subsection “OpenMP support”;

[5] GNU Scientific Library (GSL), available at http://www.gnu.org/software/gsl, documentation section “18 Random Number Generation” about gsl_rng_alloc() and gsl_rng_mt19937 and subsection “20.38 Shuffling and Sampling” about gsl_ran_choose() function.

 

 

 

 

 

AttachmentSize
Downloadlottery6of49_0.c6.36 KB

Large Scale Weighted Least Squares

$
0
0

Hello,

I would like to use Intel MKL to solve a Large Scale Weighted Least Squares problem.

The Matrices are sparse yet their elements number might be huge.

This is a result of working on images and hence the matrices are the size of (M x N)^2.

My questions are:

1. Which solvers should I use for that?

2. Is there a solver that instead of creating the matrix I can hand it a pointer to function which can calculate each (i, j) element of the matrix and hence eliminate the memory constrain?

3. Can MKL handle sparse matrices of (36e6) ^ 2 elements? How much time should I expect it to take solving: Ax = b where A is 36e6 by 36e6 ?

Thank You.

 

The PZGESV subroutine

$
0
0

Hello everybody,

Please I need your help to understand the scalapck subroutine PZGESV. 

I am trying to use this subroutine to solve a linear equation system A*X = B where A is a (N by N) matrix distributed as follow :

each process has a local matrix Al of size (Ml by N) and Ml isn't the same for the different processes but of course the sum of the different Ml is equal to N. 

I tried this :

B_A = 4; NB_A = 4; MB_B = 4;
CALL DESCINIT(DESCA,K_total,K_total,MB_A,NB_A,IRSRC,ICSRC,ICTXT,K_proc,INFO)
CALL DESCINIT(DESCB,K_total,2*N_Emetteurs_Calcul,MB_B,2*N_Emetteurs_Calcul,IRSRC,ICSRC,ICTXT,K_proc,INFO)

 But since K_proc value is local (it depends on the process), I get an error of ILLEGAL VALUE and INFO = -9 

Then I tried to put my matrices in a local matrix of size K_proc_max x K_total, I didn't get error and INFO = 0 but i had a totally wrong results :

MB_A = 4; NB_A = 4; MB_B = 4;
K_tot_proc_max = maxval(K_total_procs)
CALL DESCINIT(DESCA,K_total,K_total,MB_A,NB_A,IRSRC,ICSRC,ICTXT,K_tot_proc_max,INFO)
CALL DESCINIT(DESCB,K_total,2*N_Emetteurs_Calcul,MB_B,2*N_Emetteurs_Calcul,IRSRC,ICSRC,ICTXT,K_tot_proc_max,INFO)

Allocate(IPIV(K_tot_proc_max+MB_A));
IA=1;JA=1;IB=1;JB=1;
Allocate(Zreduite_proc_inter(K_tot_proc_max,K_total))

Do ii=1,K_total
    Zreduite_proc_inter(1:K_total_proc,ii) = Zreduite_proc(:,ii)
EndDo
Allocate(Vreduit_proc_inter(K_tot_proc_max,2*N_Emetteurs_Calcul))

Do ii=1,2*N_Emetteurs_Calcul
    Vreduit_proc_inter(1:K_total_proc,ii) = Vreduit_proc(:,ii)
EndDo

CALL PZGESV(K_total,2*N_Emetteurs_Calcul,Zreduite_proc_inter,IA,JA,DESCA,IPIV,Vreduit_proc_inter,IB,JB,DESCB,INFO)

I really don't know how to fix it. Can anybody help me please to put it right ? Thank you.

MKL BLAS95 and LAPACK95 interface build

$
0
0

Hi,



I'm trying to run the following MKL interface make files in RHEL MRG (3.10.33-rt32.34.el6rt.x86_64) in order to interface to MKL using the BLAS and LAPACK interface (using the mkl_ prefixed header files):



/opt/intel/composer_xe_2013_sp1.1.106/mkl/interfaces/lapack95/makefile

and

/opt/intel/composer_xe_2013_sp1.1.106/mkl/interfaces/blas95/makefile



However, both build fail because I do not have the ifort compiler:

make libintel64 INSTALL_DIR=../../lib95

make PLAT=lnx32e build interface= MKLROOT=../.. INSTALL_DIR=../../lib95 FC=ifort

make[1]: Entering directory `/opt/intel/composer_xe_2013_sp1.1.106/mkl/interfaces/lapack95'

mkdir -p ../../lib95/lib/intel64/obj_lapack95_intel64_/obj77

ifort -auto -module ../../lib95/lib/intel64/obj_lapack95_intel64_/obj77 -c -o ../../lib95/lib/intel64/obj_lapack95_intel64_/obj77/lapack_interfaces.obj source/lapack_interfaces.f90

make[1]: ifort: Command not found

make[1]: *** [../../lib95/lib/intel64/libmkl_lapack95_lp64.a] Error 127

make[1]: Leaving directory `/opt/intel/composer_xe_2013_sp1.1.106/mkl/interfaces/lapack95'

make: *** [libintel64] Error 2



which ifort

/usr/bin/which: no ifort in (/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/tester/bin)



I also don't have a bin directory in the /composer_xe_2013_sp1.1.106/mkl directory, which a few forums suggested would have an ifort installer.



I tried using gfortran in the make command (FC=gfortran), but the build fails towards the end for one of the files.



Could you provide ifort installation instructions, or an alternative method for making these interface files?



Thanks,



Jay

transpose matrix stored in CSR format using MKL

$
0
0

All,

I have some legacy parallel code that uses CSR format to store a very large, sparse matrix. In making some additions to the code, I have the need to transpose the matrix, storing the transpose in CSR format as well.

Is there a routine in the MKL that would help me do this? I thought of using a BLAS routine to repeatedly multiply column vectors with ones in successive locations by the matrix, and accumulate the results. Is there a better way?

Thanks,

Tom

Octave 3.8.1 with MKL

$
0
0

I am trying to rebuild GNU Octave  with the Intel MKL on Ubuntu 14.04

uname -a

Linux zoli-Precision-WorkStation-T3500 3.13.0-27-generic #50-Ubuntu SMP Thu May 15 18:06:16 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

 

base on the description:

https://software.intel.com/en-us/articles/using-intel-mkl-in-gnu-octave

The only thing I haven't done is to edit the  configure.in file since there is no such file in the source directory of Octave 3.8.1. But this has something to do with fft which I don't need now anyway (I guess it's not a problem).

Having set the enviroment variables (CC, CXX, F77, CFLAGS, CPPFLAGS, LDFLAGS) I run configure with the suggested parameters and I get the following error:

checking whether ifort has the intrinsic function ISNAN... yes

checking whether ifort generates correct size integers... no

configure: error: your Fortran compiler must have an option to make integers the same size as octave_idx_type (int).  See the file INSTALL for more information.

I have installed both the Intel C++ and Fortran compilers (l_ccompxe_2013_sp1.3.174 and l_fcompxe_2013_sp1.3.174) with only the 64 bit option.

SInce the error message refers to Fortan I posted this problem here but not sure if it is really the problem of the Fortran compiler.

Could anyone help me please?

 

 

 

 

 

pardiso_getdiag

$
0
0

When I do a

call pardiso_getdiag( pt, df, da, mnum, error)

the arrays DF and DA are filled with 'factored' and 'original' diagonal pivots.

The pivots are in the same 'order' of the 'original' matrix or they are 'permuted'?

 

Intel® MKL Cookbook Recipes

$
0
0

Intel MKL Users,

We would like to Introduce a new feature Intel® MKL Cookbook, an online Document with recipes for assembling Intel MKL routines for solving complex problems.Please give us your valuable feedback on these Cookbook recipes, and let us know if you want us to include more recipes and/or improve existing recipes.

Thank you for Evaluating

Intel MKL Team


Error in the documentation of feast

$
0
0

Hi,

I would like to report an error in the documentation of the reference manual for the extended eigensolver (feast). In the rci interface description, the parameter ijob is detailed. When ijob = 30 or 40, it is written that i is fpm(25) and j fpm(24)+fpm(25)-1 whereas i is equal to fpm(24). With this value of i, the solver works correctly.

Best regards.

 

Eigenvalue Solver Error (dfeast_scsrgv )

$
0
0

I'm trying to solve eigenvalue problem for large sparse matrices using dfeast_scsrgv function. The function works fine for small problems (ex: 8*8 sparse matrix) but it gives System.StackOverflowException error for larger problems( ex: 200*200 sparse matrix) . I'm using Visual Studio 2008 and MKL version 11 with most recent updates installed. My system is windows 64 bit and the programming language is C++. Following I provided the eigenvalue solver code that I'm using. In debug mode when I reach  dfeast_scsrgv  line it gives me Stack Overflow error. I do not think I am using any infinite loop or unnecessary large arrays. I would appreciate if someone can help me to fix the problem. Thanks! 

 

   
//Convert stiffness and mass matrix to CSR format - Seldon library
int NumStiff = M_GStiff.GetDataSize();   
Vector<double> V_GStiffVal   (NumStiff);   
Vector<int>    V_GStiffColInd(NumStiff);
Vector<int>    V_GStiffRowPtr(PrbDim+1);
ConvertToCSR(M_GStiff, prop, V_GStiffRowPtr, V_GStiffColInd, V_GStiffVal);  
 
int NumMass = M_GMass.GetDataSize();
Vector<double> V_GMassVal   (NumMass);
Vector<int>    V_GMassColInd(NumMass);
Vector<int>    V_GMassRowPtr(PrbDim+1);
ConvertToCSR(M_GMass, prop, V_GMassRowPtr, V_GMassColInd, V_GMassVal);   
 
//Release memory
M_GStiff.Clear();
M_GMass.Clear();
 
//Convert Seldon format to typical C array
double* a = V_GStiffVal.GetData();
int*   ia = V_GStiffRowPtr.GetData();
int*   ja = V_GStiffColInd.GetData();   
 
double* b = V_GMassVal.GetData();
int*   ib = V_GMassRowPtr.GetData();
int*   jb = V_GMassColInd.GetData();
 
// Convert matrix from 0-based C-notation to Fortran 1-based
int nnz = ia[PrbDim]; 
for (int i = 0; i < PrbDim+1; i++)        ia[i] += 1;    
for (int i = 0; i < nnz; i++)                 ja[i] += 1;  
 
for (int i = 0; i < PrbDim+1; i++)      ib[i] += 1;
for (int i = 0; i < nnz; i++)              jb[i] += 1;
 
// Initialize variables for the solver
double Error    = 0;
int    Loop     = 0;
int    NumMode  = 10;
double Emin     = 0;
double Emax     = pow(10.0,10.0);    
int    Flag     = 0;
char   MTyp     = 'U';
int    NumEigen = NumMode;
 
vector<int>    V_FPM (128,0);
vector<double> V_Eigen(NumMode,0);
vector<double> V_Res (NumMode,0);    
 
V_FPM[0]  = 1;
V_FPM[1]  = 8;
V_FPM[2]  = 12;
V_FPM[3]  = 20;
V_FPM[4]  = 0;
V_FPM[5]  = 0;
V_FPM[6]  = 5;
V_FPM[13] = 0;
V_FPM[63] = 0;
 
int*    P_FPM          = &V_FPM[0];
double* P_Eigen        = &V_Eigen[0];  
double* P_Res          = &V_Res[0];    
double dDum;
 
// Call Eigenvalue Solver
dfeast_scsrgv (&MTyp, &PrbDim, a, ia, ja, b, ib, jb,
                P_FPM, &Error, &Loop, &Emin, &Emax, &NumMode, P_Eigen, &dDum, &NumEigen, P_Res, &Flag);

mkl_?csradd

$
0
0

Dear all

looks as if I need to use mkl_?csradd (C++).

However, my csr matrix is zero based; mkl_?csradd does only support one based sparse matrices.

I would appreciate any suggestions for a pragmatic solution.

A simple application example of mkl_?csradd would be nice, too.

Kind regards

Wolfram

FGMRES

$
0
0

All,

I just started learning to use the mkl and fgmres in general and I'm having troubles implementing/understanding the fgmres. I've looked at the example that came in the library and I've been able to run it successfully but I do have a couple of questions. I was trying to do a simple example just to confirm if I understand what's going on. I've reproduced part of my codes below. Looking at the scheme in the manual, the fgmres does not use the coefficient matrix at all in all the calls except when the rci_request is not equal to zero. In the example provided in the manual, the expected solution was used to initialize the right hand side. so at what point does the code actually use the original matrix? In my excerpt below, the rci_request was zero, when I called the "dfgmres".

character(1)::transa

integer ::n,m,lval,ndiag,openstatus,lvall,ndiagg,i,j

parameter (ndiag=5,lval=9,m=9,n=9)

integer :: idiag(ndiag)

double precision ::val(lval,ndiag),rhs(lval), y(m),computed_solution(n)

!-----------------------------------------------------------

! declarations used in the fgmres

!-----------------------------------------------------------

integer :: itercount

integer :: rci_request, rci_request2

integer :: size

parameter (size=128)

integer :: ipar(size)

double precision :: dpar(size),tmp(n*(2*n+1)+(n*(n+9))/2+1)

!-----------------------------------------------------------

idiag= (/-3,-1,0,1,3/)

!read array from file

open(unit=14, file="filetest",status="old",iostat=openstatus)

if(openstatus/=0) stop "cannot read data from file1"

read(14,*), ((val(lvall,ndiagg),lvall=1,lval),ndiagg=1,ndiag)    !This was successfully read  

 

open(unit=17, file="vector",status="old",iostat=openstatus)

if(openstatus/=0) stop "cannot read data from file2"

read(17, *), (rhs(i),i=1,lval)                                                !successfully read

!------------------------------------------------------------

!Initializing the initial guess

!------------------------------------------------------------

    do i=1,m

        computed_solution(i)=1.0

    end do

!--------------------------------------------------------------

!Initialize the solver

!--------------------------------------------------------------

call dfgmres_init(n,computed_solution,rhs,rci_request,ipar,dpar,tmp)

    !if (rci_request /= 0) goto 999

!-----------------------------------------------------------------

!   Setting the desired parameters

! do the restart after 2 iterations

!   Logical parameters

!   do not do the stopping test for the maximal number of iterations

!   do the preconditioned iterations of fgmres method

!   double precision parameters

!   set the relative tolerance to 1.0d-3 instead of default 1.0d-6

!-------------------------------------------------------------------

    

            ipar(15)=2

            ipar(8)=0

            ipar(11)=1

            dpar(1)=1.0d-3

!-------------------------------------------------------------------           

! Check the correctness and consistency of the newly set parameters

!-------------------------------------------------------------------

        call dfgmres_check(n,computed_solution,rhs,rci_request,ipar,dpar,tmp)

        !if(rci_request /=0) goto 999

!----------------------------------------------------------------------

!compute the solution by rci (p)fgmres solver with preconditioning

!reverse communication starts here

!-----------------------------------------------------------------------

call dfgmres(n,computed_solution, rhs, rci_request, ipar, dpar, tmp)

!-----------------------------------------------------------------------

    ipar(13)=0

call dfgmres_get(n,computed_solution,rhs,rci_request2,ipar,dpar,tmp,itercount)

!----------------------------------------------------------------------------

 

 

libiomp5md.dll location (release build)

$
0
0

Hello, where would I find the dll without tracing code? Currently I use one from MKL 11.1.2 (64-bit, version 5.0.2013.1126, file size 1043kB, modified 2014-01-31 12:23) and the profiler shows this (judging by OpenMP source I found, I assume that __kmp_print_storage_map_gtid is some printfoid tracing function, which eats 69.31% of the time): Inclusive Samples % Function Name 100.00 cs.exe 99.25 - RtlUserThreadStart 99.25 -- BaseThreadInitThunk 80.50 --- __kmp_launch_worker(void *) 80.50 ---- __kmp_launch_thread 69.32 ----- __kmp_fork_barrier(int,int) 69.31 ------ __kmp_print_storage_map_gtid 11.16 ----- __kmp_invoke_task_func 11.15 ------ __kmp_invoke_microtask 10.85 ------- mkl_blas_dgemm 0.19 ------- mkl_lapack_dlasr3 0.05 ------- etc... 18.76 __tmainCRTStartup 18.76 - AfxWinMain(struct HINSTANCE__ *,struct HINSTANCE__ *,char *,int) I did find: C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64\compiler C:\Program Files (x86)\Intel\Composer XE 2013 SP1\redist\intel64\compiler but they are the wrong ones.

Viewing all 3005 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>