I studied the example cl_solver_sym_sp_0_based_c.c in cluster_sparse_solverc/source . I compiled it using:
make libintel64 example=cl_solver_sym_sp_0_based_c
It runs fine . However the matrix is too small to look at performance. So I modified the example to read in a 3million^2 matrix from a text file. When I run it with 24 cpus ( 1 host ), it factors the matrix in 30 second. When I run it with 48 cpus ( 2 hosts ) it factors it in 20 seconds. This is great! But when I run it with 72 or more cpus, I keep getting this after the reordering stage:
Reordering completed ...
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 106418 RUNNING AT cforge201
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
==================================================================================
The command I am using is:
mpirun -np 3 -machinefile ./hostfile ./cl_solver_sym_sp_0_based_c.exe
Where hostfile contains:
cforge200:1
cforge201:1
cforge202:1
Here are my example files to see if issue is reproduceable:
cl_solver_sym_sp_0_based_c.c - Edit all the occurences of *.txt to the path where the files are on your system
https://www.dropbox.com/s/ndkzi9zojxuh1xo/cl_solver_sym_sp_0_based_c.c?dl=0
ia, ja, a, and b data in text files:
https://www.dropbox.com/s/3dkhbillyso03kc/ia_ja_a_b_data.tar.gz?dl=0
Curious what kind of performance improvement you get when running with MPI on 12, 24, 48, and 72 cpus!