How to delete mkl out-of-core temporary files

June 1, 2017, 12:26 am

Latest and popular articles on Intel Technologies

≫ Next: problem when using cluster solver with distributed CSR format of input data

≪ Previous: Fortran95 Interface causes segfault

Hi,

I'm using pardiso_64 in MKL 11.3.3.1.

When run in out-of-core mode, temporary files are generated. I have set the environment variable MKL_PARDISO_OOC_KEEP_FILE = 1 according to https://software.intel.com/en-us/mkl-developer-reference-c-intel-mkl-par... in order to delete temporary files when computations are complete. (I have also tried = 0 as "keep file = 1" seemed counter intuitive, but that's besides the point).

I assumed that files would be deleted when I called pardiso_64 with phase=-1 in order to release internal memory. But the files are not deleted.

How can I get pardiso to delete temporary out-of-core files?

Best,
Jens

Thread Topic:

Help Me

↧

problem when using cluster solver with distributed CSR format of input data

June 3, 2017, 10:49 pm

Latest and popular articles on Intel Technologies

≫ Next: Runtime error when building with mkl dylib

≪ Previous: How to delete mkl out-of-core temporary files

The following error encountered when using cluster solver:

ERROR during symbolic factorization: -2

The example (cl_solver_sym_distr_f.f) provided with Cluster compiler works well. But the matrix I provided in the attachment produces the above error.

I used Intel MPI to compile the code (also in the attachment). The executable file compiled through IntelMPI yields the above error. But it can run using OpenMPI's mpirun.

Could you help me check my code? Both the source code and data are in the attachment.

Thank you very much!

Qian

Attachment	Size
Download intel_cluster_dist_solver.zip	20.46 KB

↧

Runtime error when building with mkl dylib

June 6, 2017, 12:36 am

Latest and popular articles on Intel Technologies

≫ Next: config_number_of_transforms.c example typo

≪ Previous: problem when using cluster solver with distributed CSR format of input data

When I built my program with MKL statically (.a files), everything was fine. But, when I built the code with those .dylib files and ran it, I got following errors:

dyld: Symbol not found: _MKL_Detect_Cpu_Global_Lock
Referenced from: /opt/intel/compilers_and_libraries_2017.4.181/mac/mkl/lib/libmkl_intel_lp64.dylib
Expected in: flat namespace
in /opt/intel/compilers_and_libraries_2017.4.181/mac/mkl/lib/libmkl_intel_lp64.dylib
Abort trap: 6

What does this usually imply? I tried to add the directory containing .dylib files into $PATH, but it didn't help.

Thread Topic:

Question

↧

config_number_of_transforms.c example typo

June 6, 2017, 6:09 am

Latest and popular articles on Intel Technologies

≫ Next: Trust Region Size Parameter Choise

≪ Previous: Runtime error when building with mkl dylib

There is a typo in the config_number_of_transforms.c example (https://software.intel.com/en-us/mkl-developer-reference-c-config_number_of_transforms). "DFTI_INPUT_DISTANCE" on line 107 (inside the commented code) should be "DFTI_OUTPUT_DISTANCE".

↧

Trust Region Size Parameter Choise

June 6, 2017, 6:36 am

Latest and popular articles on Intel Technologies

≫ Next: KMP_AFFINITY

≪ Previous: config_number_of_transforms.c example typo

Along this period, we have developed a calculation method that uses the Trust Region MKL API (with constraints).
We have build a procedure that works with this TR algorithm.
The procedure is called 4 times. Each time it is called it uses the previous results as input and the trust region size parameter change consequently: 100, 10, 1, 0.1

We know that there is not a criteria of choise "region size parameter" so we decided to adopt this rule.

is this choise correct, or it is enough to use alway only one value?

maybe the order of these values can change something!

Thank you very much

Gianluca

↧

KMP_AFFINITY

June 6, 2017, 11:16 am

Latest and popular articles on Intel Technologies

≫ Next: MKL library

≪ Previous: Trust Region Size Parameter Choise

Hi,

Does anyone know I can internally change KMP_AFFINITY in the sub-process invoked from my program? My experiment shows it does not work with intel compiler but however it is under gcc compiler.

here is the example:

let's say I have KMP_AFFINITY=scatter, which is for my main process. Then inside main process before invoking another executable as the sub-process, putenv is used to modify KMP_AFFINITY=none for the sub-process.

is this supposed to work? my run shows the KMP_AFFINITY=none does not apply to the sub-process if intel compiler is used to compile and link my main program. but it is with gcc compiler.

when I double check the environment, in the sub-process, there is one extra environment variable for my exe with intel compiler

__KMP_REGISTERED_LIB_23907=0xacfa1d0-cafe8af0-libiomp5.a

what does this guy do and how to explain such difference? Thank you

Hongwei

↧

MKL library

May 22, 2017, 7:13 am

Latest and popular articles on Intel Technologies

≫ Next: is the implemented Trust Region method deterministic?

≪ Previous: KMP_AFFINITY

Sir

I have to install a code. it requiyes linking of lapack n blas file.
the code was written in 2009 using mkl 8 version. according to it for linking paths are

LROOT = /opt/intel/mkl/lib/intel64/
LAPACK = -lmkl_lapack -lmkl
BLAS = -L$(LROOT) -lmkl_intel64 -lguide -lpthread

LFLAGS = $(LIBSCE) $(BLAS) $(LAPACK)

now i am having 2016 version of mkl. it does not have guide, mkl, pthread etc.
i know
-lmkl_lapack is replaced by lmkl_lapack95_ilp64

how to modify the commands as per 2016 version to link n compile

thanks

Thread Topic:

How-To

↧

is the implemented Trust Region method deterministic?

June 9, 2017, 12:44 am

Latest and popular articles on Intel Technologies

≫ Next: Direct Sparse Solver for Clusters Crash when using MPI Nested Dissection Algorithm

≪ Previous: MKL library

In this case results cannot change if initial condition and constraints are the same ...

this is what happen in our implementation.

Thank you

Gianluca

↧

Direct Sparse Solver for Clusters Crash when using MPI Nested Dissection Algorithm

June 10, 2017, 11:39 am

Latest and popular articles on Intel Technologies

≫ Next: Small matrix speed optimization

≪ Previous: is the implemented Trust Region method deterministic?

I have a code that calls the the Direct Sparse Solver for Clusters Interface.I have an error when I run it using the option for the MPI based nested dissection. Documentation can be found here: https://software.intel.com/en-us/mkl-developer-reference-c-cluster-sparse-solver-iparm-parameter

When i have I param[39] set to 3 everything works fine.

When I set it to 10, I get no errors, no warnings, no output when msglvl is set to one. I assume this is because the system is crashing really hard.

I am using the 64-bit interface of the solver and not using the MPI-based dissection is not an option (my matrix has 50 billion non-zero elements and is 12 billion by 12 billion). I am using the Latest version of the MKL cluster library.

I just spent two weeks modifying the code to remove overlaps in the matrix elements to use this feature.

What Is going wrong?

Thread Topic:

Bug Report

↧

Small matrix speed optimization

June 11, 2017, 12:29 pm

Latest and popular articles on Intel Technologies

≫ Next: Benchmarking MKL GEMV

≪ Previous: Direct Sparse Solver for Clusters Crash when using MPI Nested Dissection Algorithm

Hello all,

since I can run now the mkl library 2017, I have a couple of follow up questions that hopefully deserve a thread of their own. I am doing some mode matching, and consequently I need matrix inversions on matrices of the order 10x10 up to 50x50 most of the time (the maximum size would be somewhere aroud 200x200 but very rarely, they will almost exclusively be in the 10-50 range). I have optimized my non mkl parts of the code so they are under 10% of all the simulation time, so any speedup on the mkl functions would be greatly beneficial, if possible. Now I have set the mkl_num_threads to max, release mode, ia32, O2 optimization, optimized for speed and so forth to make it as fast as I can/know to make currently. I only have a couple of matrices to invert per frequency point (5 to 6), and the code must execute one frequency point at a time. My questions are as follows:

1) Is there a way to improve the performance of the mkl functions in any way (by setting some flags in the program itself, or in Visual studio, or am I missing some functions that are better in this situation, or something else completely), either in the mkl 2017 or the old mkl 10.0.012 that I have? I am using cblas_zgemm, cblas_zdscal, zgetri, zgetrf, vzSqrt, cblas_zaxpy, but most of the time is spent in matrix inversion, so zgetri, zgetrf take most of the time. Are there better functions than these or can I set some additional flags to make them faster?

2) Since the optimization is for speed and not size, I of course expected the output (exe) to be bigger, but can someone explain and/or help me optimize the output size for the exe in mkl 2017, because it is 3x bigger than with the old mkl 10.0.012? This represents a small problem unfortunatelly, and if possible I would be very happy if it can be mittigated in any way (without optimize for size)

3) Some of the matrices are symmetrical, and I was hopping that the symmetrical versions of zgetri, zgetrf, that is zsytrf and zsytri, I would be able to theoretically get a 2x speedup, but for some reason the speed is the same. Is this expected? Are my matrices too small for any noticeable effect? Am I missing something? In both cases I feed the functions the full matrix to invert, and while I am debugging I can see that only half of the matrix elements are calculated for the symmetrical functions, and I fill the symmetric elements, but there is no speed up.

Any information, even if not good is welcome. Thank you all in advance.

Thread Topic:

How-To

↧

Benchmarking MKL GEMV

June 13, 2017, 1:16 am

Latest and popular articles on Intel Technologies

≫ Next: TR solver question

≪ Previous: Small matrix speed optimization

Hello,

I am trying to compare my own implementation of GEMV with the MKL. For benchmarking I use the following code:

size_t M = 64; // rows
size_t N = 2; // columns

// allocate memory
float *matrix = (float*) mkl_malloc(M*N * sizeof(float), 64);
float *vector = (float*) mkl_malloc(N   * sizeof(float), 64);
float *result = (float*) mkl_malloc(M   * sizeof(float), 64);

// execute warm up calls
for (size_t i = 0; i < NUM_WARMUPS; ++i) {
    cblas_sgemv(CblasRowMajor, CblasNoTrans, M, N, 1.0f,
                matrix, N,
                vector, 1,
                0.0f,
                result, 1);
}

// measure runtime
float avg_runtime = 0;
for (size_t i = 0; i < NUM_EVALUATIONS; ++i) {
    auto start = dsecnd();
    cblas_sgemv(CblasRowMajor, CblasNoTrans, M, N, 1.0f,
                matrix, N,
                vector, 1,
                0.0f,
                result, 1);
    auto end = dsecnd();
    float runtime = (end - start) * 1000;

    avg_runtime += runtime;
}
avg_runtime /= NUM_EVALUATIONS;
std::cout << "avg_runtime: "<< avg_runtime << std::endl;

// free buffers
mkl_free(matrix);
mkl_free(vector);
mkl_free(result);

On my system this gives me an average runtime of around 0.0003ms with the first evaluation taking around 0.002ms. Because the average seemed really fast, even for the small input size, I printed the runtimes of all 200 evaluations to make sure my calculation of the average value was correct. If I add a

std::cout << runtime << std::endl;

in line 29 the measured runtimes are way higher and every one of the 200 evaluations takes around 0.002ms. This seems more plausible compared to other libraries and my own implementation.

It seems like the compiler does some optimization to my code and notices that I call the routine with the exact same input multiple times. Can anyone confirm this? What is the suggested way of benchmarking MKL routines?

Thanks in advance!

Thread Topic:

Question

↧

TR solver question

June 14, 2017, 7:49 am

Latest and popular articles on Intel Technologies

≫ Next: Struggling to get Automatic Off load working with MIC/MKL 2017

≪ Previous: Benchmarking MKL GEMV

When the solver returns RCI_Request = 2 to calculate the Jacobian, can I assume that x has not changed since the previous calculation of the function value?

It would be a great performance boost if I could because my calculations of function value and jacobian are not seperable and I could just use my stored jacobian.

↧

Struggling to get Automatic Off load working with MIC/MKL 2017

June 14, 2017, 12:27 pm

Latest and popular articles on Intel Technologies

≫ Next: how to deal with real and structurally symmetric matrix's parameters in PARDISO

≪ Previous: TR solver question

I have a MIC card in a Microway XEON Workstation which seems to be functioning as expected (see micinfo debug output)

After updating to MKL 2017 Update 3, I am struggling to set AO to function. I created a simple DGEMM test program and have been calling DGEMM with square matrix sizes up to 16384, and cannot get AO to "kick-in".

In prior versions of MKL, I could see AO working at sizes of about 4096x4096 on this same machine.

The following env vars are set.

MKL_MIC_ENABLE=1
OFFLOAD_REPORT=2
MKL_MIC_DISABLE_HOST_FALLBACK=1
MIC_LD_LIBRARY_PATH=C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2017\windows\mkl\lib\intel64_win_mic

>micinfo
MicInfo Utility Log
Copyright 2011-2013 Intel Corporation All Rights Reserved.

Created Wed Jun 14 11:34:46 2017

System Info
HOST OS : Windows
OS Version : Microsoft Windows 7 Professi
Driver Version : 3.3.30726.0
MPSS Version : 3.3.30726.0
Host Physical Memory : 32709 MB

Device No: 0, Device Name: mic0

Version
Flash Version : 2.1.02.0390
SMC Firmware Version : 1.16.5078
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8+mpss3.3
Device Serial Number : ADKC32800563

Board
Vendor ID : 0x8086
Device ID : 0x225d
Subsystem ID : 0x3608
Coprocessor Stepping ID : 2
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : C0
Board SKU : C0PRQ-3120/3140 P/A
ECC Mode : Enabled
SMC HW Revision : Product 300W Active CS

Cores
Total No of Active Cores : 57
Voltage : 1039000 uV
Frequency : 1100000 kHz

↧

how to deal with real and structurally symmetric matrix's parameters in PARDISO

June 18, 2017, 8:33 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel MKL DftiComputeForward how to get full transform matrix from CCE format in C

≪ Previous: Struggling to get Automatic Off load working with MIC/MKL 2017

I want to use PARDISO to solve a problem, in which the matrix is real and structurally symmetric. I read the PARDISO Version 5.0.0 Reference Sheet — Fortran, and Parallel Sparse Direct Solver PARDISO | User Guide Version 5.0.0. I also lean some code which solve problem with symmetric or non-symmetric matrices. I know the parameter mtype need to be 1, since my matrix is real and structurally symmetric. But I do not know how to deal with other parameters.

Any help would be appreciated.

Regards,
rf.qian

Thread Topic:

How-To

↧

Intel MKL DftiComputeForward how to get full transform matrix from CCE format in C

June 21, 2017, 1:23 am

Latest and popular articles on Intel Technologies

≫ Next: syrk mkl armadillo wrong output

≪ Previous: how to deal with real and structurally symmetric matrix's parameters in PARDISO

I'm trying to implement a 2 dimensional fourier transform via use of MKL FFT functions.

I'm interested in transforming from the space domain (i.e., my input signal is a 2D MxN matrix of `double`s) to the frequency domain (i.e., a 2D MxN output matrix of complexes with double accuracy, `MKL_Complex16`) and then back to the space domain after some filtering.

Based on the examples provided by intel's MKL implementation (i.e., basic_dp_real_dft_2d.c etc.) I've created the following matlab-ish function:

    bool fft2(double *in, int m, int n, MKL_Complex16 *out) {
      bool ret(false);
      DFTI_DESCRIPTOR_HANDLE hand(NULL);
      MKL_LONG dim[2] = {m, n};
      if(!DftiCreateDescriptor(&hand, DFTI_DOUBLE, DFTI_REAL, 2, dim)) {
        if(!DftiSetValue(hand, DFTI_PLACEMENT, DFTI_NOT_INPLACE)) {
          if(!DftiSetValue(hand, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX)) {
            MKL_LONG rs[3] = {0, n, 1};
            if(!DftiSetValue(hand, DFTI_INPUT_STRIDES, rs)) {
              MKL_LONG cs[3] = {0, n / 2 + 1, 1};
              if(!DftiSetValue(hand, DFTI_OUTPUT_STRIDES, cs)) {
                if(!DftiCommitDescriptor(hand)) {
                  ret = !DftiComputeForward(hand, in, out));
                }
              }
            }
          }
        }
      }
      DftiFreeDescriptor(&hand);
      return ret;
    }

Due to the fact that I want to do some DSP stuff (e.g., Gaussian filtering) and thus I have to do matrix multiplications. I want the full transformation matrix instead of the CCE format in C matrix that DftiComputeForward outputs.

**How can I reconstruct the full transformation matrix of an arbitrary sized 2d signal (i.e., matrix) from the CCE format in C matrix that I get as output from DftiComputeForward function?**

For example if I have the following 2D real signal:

0.1, 0.2, 0.3
0.4, 0.5, 0.6
0.7, 0.8, 0.9

It's full transformation matrix would be:

4.5 + 0j, -0.45 + 0.259808j, -0.45 - 0.259808j
-1.35 + 0.779423j, 0 - 0j, 0 - 0j
-1.35 - 0.779423j, 0 + 0j, 0 + 0j

However the result from `DftiComputeForward` in CCE is:

4.5 + 0j, -0.45 + 0.259808j, -1.35 + 0.779423j,
0 - 0j, -1.35 - 0.779423j, 0 + 0j,
0 + 0j, 0 + 0j, 0 + 0j

↧

syrk mkl armadillo wrong output

June 22, 2017, 2:48 pm

Latest and popular articles on Intel Technologies

≫ Next: .NET Memory Usage - MKL under .NET

≪ Previous: Intel MKL DftiComputeForward how to get full transform matrix from CCE format in C

I have written the following simple program for syrk using armadillo (arma.sourceforge.net).

Environment : Rhea from OLCF. https://www.olcf.ornl.gov/computing-resources/rhea/

MKL : Tried with version 16 and 17. Problem occurs in both.

#define ARMA_DONT_USE_WRAPPER
#define ARMA_USE_BLAS

#include <iostream>
#include <armadillo>
using namespace std;
using namespace arma;
int main() {
  int m = 10000;
  int n = 50;
  fmat A;
  A.load("H_init.csv");
  cout << "A::"<< A.n_rows << "x"<< A.n_cols << endl;
  fmat AtA = arma::zeros<fmat>(n, n);
  AtA = A.t() * A;
  cout << "AtA "<< endl;
  cout << max(max(AtA)) << ""<< min(min(AtA)) << ""<< norm(AtA, "fro") << endl;
  return 0;
}

I have also attached the H_init.csv along with this email. I compile using the following three procedure.

Compilation 1 (gcc with mkl): Based on the article https://software.intel.com/en-us/articles/a-new-linking-model-single-dyn.... I compile as "g++ hth.cpp -o hth -O2 -I ~/armadillo-7.800.1/include/ -fopenmp -lmkl_rt"

Compilation 2 (with icc and mkl): icc hth.cpp -o hth -O2 -I ~/armadillo-7.800.1/include/ -fopenmp -mkl

Compilation 3 (gcc with intel linker recommendation): g++ hth.cpp -o hth -O2 -I ~/armadillo-7.800.1/include/ -fopenmp -lmkl_rt -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl

Compilation based on method compilation 1 is producing wrong output on HtH. Compilation 2 and 3 works fine. Infact in the case of compilation 3, even the ordering of the libraries appears important. If the ordering is change a bit, it is producing wrong output.

Output of compile 1: (Wrong)

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 14.0px Menlo}
span.s1 {font-variant-ligatures: no-common-ligatures}

A::10000x50

AtA

24044.6 3697.91 350951

Output out of compile 2: (Right)
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 14.0px Menlo}
span.s1 {font-variant-ligatures: no-common-ligatures}

A::10000x50

AtA

3436.05 2437.46 126222

Output out of compile 3: (Right)

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 14.0px Menlo}
span.s1 {font-variant-ligatures: no-common-ligatures}

A::10000x50

AtA

3436.05 2437.46 126222

Am I making any mistake on this? Can't I link w/ gcc using -lmkl_rt?

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 14.0px Menlo; color: #5330e1}
span.s1 {font-variant-ligatures: no-common-ligatures}

Attachment	Size
Download H_init.tar.gz	4.22 MB

Thread Topic:

Bug Report

↧

.NET Memory Usage - MKL under .NET

June 23, 2017, 12:28 am

Latest and popular articles on Intel Technologies

≫ Next: Direct Sparse Solver for Clusters - Pardiso Memory Allocation Error

≪ Previous: syrk mkl armadillo wrong output

As every .NET developer knows memory usage is managed from Garbage Collector. This layer determines when memory is released and how to reorganize it. It allocates spaces for each thread separately and avoid conflicts.

For this, we programmers often don’t know exactly what really happen at this level, the details.

In general, this is enough, because GC has been built in order to permit developer to concentrate at higher levels.

But sometimes, especially when you pass pointers to memory block, like array to API functions or specifically to MKL API functions, it is important to know what happen under the scene.

IntPtr x = new IntPtr(0);
double[] x_init = null;
x = mkl_malloc(sizeof(double) * n, 64);
Marshal.Copy(x_init, 0, x, n);
//use x pointer …
mkl_free(ref x);

This set of instructions define a memory space for an array of double and assign a memory pointer to x. This pointer x is then passed to a function like this:

[DllImport("mkl_rt.dll", CallingConvention = CallingConvention.Cdecl, ExactSpelling = true, SetLastError = false)]
        internal static extern int dtrnlspbc_init(
           ref IntPtr handle,
           ref int n,
           ref int m,
           IntPtr x,
           IntPtr LW,
           IntPtr UP,
           double[] eps,
           ref int iter1,
           ref int iter2,
           ref double rs
        );

Everything seems to work, but this is a very subtle felling!

Yes, because memory is in the heap area managed by GC and can be moved, reorganized. This means that your pointer x is not reliable.

We spend a lot of time to fight against strange results, sometimes good, sometimes not. Where was the trick? Was Intel fault of our fault? I don’t like this work “fault”, but the question was where the issue?

At the end, we got the solution!

This was: Pin the pointer with the correct syntax and methods.

GCHandle x_handle = GCHandle.Alloc(x_init, GCHandleType.Pinned);
x = x_handle.AddrOfPinnedObject();
//use x pointer …
x_handle.Free();

Now GC can’t move the pointer and the array used by API function is always located.

I hope these notes useful for poor developer always alone in the ocean of troubles.

Gianluca

↧

Direct Sparse Solver for Clusters - Pardiso Memory Allocation Error

June 25, 2017, 12:16 pm

Latest and popular articles on Intel Technologies

≫ Next: HyperThreading and CPU usage

≪ Previous: .NET Memory Usage - MKL under .NET

Hi,

Once again having some trouble with the Direct Sparse Solver for clusters. I am getting the following error when running on a single process

entering matrix solver
*** Error in PARDISO  (     insufficient_memory) error_num= 1
*** Error in PARDISO memory allocation: MATCHING_REORDERING_DATA, allocation of 1 bytes failed
total memory wanted here: 142 kbyte

=== PARDISO: solving a real structurally symmetric system ===
1-based array indexing is turned ON
PARDISO double precision computation is turned ON
METIS algorithm at reorder step is turned ON


Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.000005 s
Time spent in reordering of the initial matrix (reorder)         : 0.000000 s
Time spent in symbolic factorization (symbfct)                   : 0.000000 s
Time spent in allocation of internal data structures (malloc)    : 0.000465 s
Time spent in additional calculations                            : 0.000080 s
Total time spent                                                 : 0.000550 s

Statistics:
===========
Parallel Direct Factorization is running on 1 OpenMP

< Linear system Ax = b >
             number of equations:           6
             number of non-zeros in A:      8
             number of non-zeros in A (%): 22.222222

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 128
             number of independent subgraphs:  0< Preprocessing with state of the art partitioning metis>
             number of supernodes:                    0
             size of largest supernode:               0
             number of non-zeros in L:                0
             number of non-zeros in U:                0
             number of non-zeros in L+U:              0

ERROR during solution: 4294967294

I just hangs when running on a single process. Below is the CSR format of my matrix and the provided RHS to solve for

CSR row values
0
2
6
9
12
16
18

CSR col values
0
1
0
1
2
3
1
2
4
1
3
4
2
3
4
5
4
5

Rank 0 rhs vector :
1
0
0
0
0
1

Now my calling file looks like:

void SolveMatrixEquations(MKL_INT numRows, MatrixPointerStruct &cArrayStruct, const std::pair<MKL_INT,MKL_INT>& rowExtents)
{

	double pressureSolveTime = -omp_get_wtime();

	MKL_INT mtype = 1;  /* set matrix type to "real structurally symmetric" */
	MKL_INT nrhs = 1;  /* number of right hand sides. */

	void *pt[64] = { 0 }; //internal memory Pointer

						  /* Cluster Sparse Solver control parameters. */
	MKL_INT iparm[64] = { 0 };
	MKL_INT maxfct, mnum, phase=13, msglvl, error;

	/* Auxiliary variables. */
	float   ddum; /* float dummy   */
	MKL_INT idum; /* Integer dummy. */
	MKL_INT i, j;

	/* -------------------------------------------------------------------- */
	/* .. Init MPI.                                                         */
	/* -------------------------------------------------------------------- */

	int     mpi_stat = 0;
	int     comm, rank;
	mpi_stat = MPI_Comm_rank(MPI_COMM_WORLD, &rank);
	comm = MPI_Comm_c2f(MPI_COMM_WORLD);

	/* -------------------------------------------------------------------- */
	/* .. Setup Cluster Sparse Solver control parameters.                                 */
	/* -------------------------------------------------------------------- */
	iparm[0] = 0; /* Solver default parameters overridden with provided by iparm */
	iparm[1] =3; /* Use METIS for fill-in reordering */
	//iparm[1] = 10; /* Use parMETIS for fill-in reordering */
	iparm[5] = 0; /* Write solution into x */
	iparm[7] = 2; /* Max number of iterative refinement steps */
	iparm[9] = 8; /* Perturb the pivot elements with 1E-13 */
	iparm[10] = 0; /* Don't use non-symmetric permutation and scaling MPS */
	iparm[12] = 0; /* Switch on Maximum Weighted Matching algorithm (default for non-symmetric) */
	iparm[17] = 0; /* Output: Number of non-zeros in the factor LU */
	iparm[18] = 0; /* Output: Mflops for LU factorization */
	iparm[20] = 0; /*change pivoting for use in symmetric indefinite matrices*/
	iparm[26] = 1;
	iparm[27] = 0; /* Single precision mode of Cluster Sparse Solver */
	iparm[34] = 1; /* Cluster Sparse Solver use C-style indexing for ia and ja arrays */

	iparm[39] = 2; /* Input: matrix/rhs/solution stored on master */
	iparm[40] = rowExtents.first+1;
	iparm[41] = rowExtents.second+1;
	maxfct = 3; /* Maximum number of numerical factorizations. */
	mnum = 1; /* Which factorization to use. */
	msglvl = 1; /* Print statistical information in file */
	error = 0; /* Initialize error flag */
	//cout << "Rank "<< rank << ": "<< iparm[40] << ""<< iparm[41] << endl;
#ifdef UNIT_TESTS
	//msglvl = 0;
#endif




	phase = 11;
	#ifndef UNIT_TESTS
	if (rank == 0)printf("Restructuring system...\n");
	cout << "Restructuring system...\n"<<endl;;
	#endif

	cluster_sparse_solver(pt, &maxfct, &mnum, &mtype, &phase,&numRows, &ddum, cArrayStruct.rowIndexArray, cArrayStruct.colIndexArray, &idum, &nrhs, iparm, &msglvl,&ddum, &ddum, &comm, &error);
	if (error != 0)
	{
		cout << "\nERROR during solution: "<< error << endl;
		exit(error);
	}


	phase = 23;

#ifndef UNIT_TESTS
//	if (rank == 0) printf("\nSolving system...\n");
	printf("\nSolving system...\n");
#endif

	cluster_sparse_solver_64(pt, &maxfct, &mnum, &mtype, &phase,&numRows, cArrayStruct.valArray, cArrayStruct.rowIndexArray, cArrayStruct.colIndexArray, &idum, &nrhs, iparm, &msglvl,
		cArrayStruct.rhsVector, cArrayStruct.pressureSolutionVector, &comm, &error);
	if (error != 0)
	{
		cout << "\nERROR during solution: "<< error << endl;
		exit(error);
	}

	phase = -1; /* Release internal memory. */
	cluster_sparse_solver_64(pt, &maxfct, &mnum, &mtype, &phase,&numRows, &ddum, cArrayStruct.rowIndexArray, cArrayStruct.colIndexArray, &idum, &nrhs, iparm, &msglvl, &ddum, &ddum, &comm, &error);
	if (error != 0)
	{
		cout << "\nERROR during release memory: "<< error << endl;
		exit(error);
	}
	/* Check residual */

	pressureSolveTime += omp_get_wtime();


#ifndef UNIT_TESTS
	//cout << "Pressure Solve Time: "<< pressureSolveTime << endl;
#endif

	//TestPrintCsrMatrix(cArrayStruct,rowExtents.second-rowExtents.first +1);
}

This is based on the format of one of the examples. Now i am trying to use the ILP64 interface becasue my example system is very large. (16 billion non-zeros). I am using the Intel C++ compiler 2017 as part of the Intel Composer XE Cluster Edition Update 1. I using the following link lines in my Cmake files:

TARGET_COMPILE_OPTIONS(${MY_TARGET_NAME} PUBLIC "-mkl:cluster""-DMKL_ILP64""-I$ENV{MKLROOT}/include")
TARGET_LINK_LIBRARIES(${MY_TARGET_NAME} "-Wl,--start-group $ENV{MKLROOT}/lib/intel64/libmkl_intel_ilp64.a $ENV{MKLROOT}/lib/intel64/libmkl_intel_thread.a $ENV{MKLROOT}/lib/intel64/libmkl_core.a $ENV{MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl")

What is interesting is that this same code runs perfectly fine on my windows development machine. Porting it to my linux cluster is causing issues. Any Ideas?

I am currently awaiting the terribly long download for the update 4 Composer XE package. But I don't have much hope of that fixing it because this code used to run fine on this system.

↧

HyperThreading and CPU usage

June 25, 2017, 6:33 am

Latest and popular articles on Intel Technologies

≫ Next: How can I interrupt / abort LAPACK and BLAS methods which do not support callback ?

≪ Previous: Direct Sparse Solver for Clusters - Pardiso Memory Allocation Error

Hi everyone,

I tried LAPACKE_dgels and change NO thread-nubmer settings at all. I guess the default thread number (the same as phycical core number) is used. As I wathch the CPU usage during the code running, it reach a peak at 50 %. I guess that means using 50% of CPU made the calculation run as fast as it could, and using more than 50% of CPU by hyper-threading only slow it down? Do I understand it right here?

Thread Topic:

Question

↧

How can I interrupt / abort LAPACK and BLAS methods which do not support callback ?

June 26, 2017, 6:43 am

Latest and popular articles on Intel Technologies

≫ Next: Unhandled exception at 0x011BB3FA (mkl_core.dll)

≪ Previous: HyperThreading and CPU usage

I'm computing some SVDs and other time-consuming things using the mkl C libraries.

I've found that some methods implement a progress call back (https://software.intel.com/en-us/mkl-developer-reference-c-mkl-progress), but that does not seem to be the case for the calls I'm interested in ( _gesvd, _gesdd, _gemm, _imatcopy ).

Is there a clean work-around for this issue? I would ideally like to be able to post "progress", but at least I would like to be able to cleanly "abort" the computation by user-interaction.

Zone:

Windows*

Thread Topic:

Question

↧