Dear experts,
I'm trying to use INTEL cluster sparse solver to solve a big size of symmetric complex number equations. I found that my code has strange behavior, it gives the correct solution or crashes as the number of the cluster processes is changed.
My cluster consists of two nodes, each node has Intel i7-5820k CPU with 6 cores, 128GB memory. The OS is OpenSUSE Leap version 42.3. The compiler is Intel Parallel Studio Cluster version, 2018.3.222.
Based on the hardware configuration, I am able to run the code with the number of processes from 1 to 12.
When running the code with 1 process, the code gives the correct solution.
When 2 to 8 processes are used, the code is crashed, shown the following error messages.
***** < mpirun -n 2 ./test_psolver_v1.a > <Error Message>*****
Fatal error in PMPI_Allgather: Message truncated, error stack:
PMPI_Allgather(1093)....................: MPI_Allgather(sbuf=0x7ffc336490e4, scount=1, MPI_INT, rbuf=0x8afcb80, rcount=1, MPI_INT, MPI_COMM_WORLD) failed
MPIR_Allgather_impl(908)................: fail failed
MPIR_Allgather(861).....................: fail failed
MPIR_Allgather_intra(681)...............: fail failed
MPIDI_CH3_PktHandler_EagerShortSend(457): Message from rank 0 and tag 7 truncated; 8 bytes received but buffer size is 4
***** < mpirun -n 3 ./test_psolver_v1.a > <Error Message>*****
Fatal error in PMPI_Allgather: Invalid count, error stack:
PMPI_Allgather(1093).....: MPI_Allgather(sbuf=0x8b53ea0, scount=1, MPI_LONG_LONG_INT, rbuf=0x8b54300, rcount=1, MPI_LONG_LONG_INT, MPI_COMM_WORLD) failed
MPIR_Allgather_impl(908).: fail failed
MPIR_Allgather(861)......: fail failed
MPIR_Allgather_intra(332): fail failed
MPIC_Send(335)...........: Negative count, value is -32766
Fatal error in PMPI_Allgather: Message truncated, error stack:
PMPI_Allgather(1093)....................: MPI_Allgather(sbuf=0x7ffd59422ee4, scount=1, MPI_INT, rbuf=0x9cf6e00, rcount=1, MPI_INT, MPI_COMM_WORLD) failed
MPIR_Allgather_impl(908)................: fail failed
MPIR_Allgather(861).....................: fail failed
MPIR_Allgather_intra(267)...............: fail failed
MPIDI_CH3_PktHandler_EagerShortSend(457): Message from rank 0 and tag 7 truncated; 16 bytes received but buffer size is 12
***** < mpirun -n 7 ./test_psolver_v1.a > <Error Message>*****
Fatal error in PMPI_Allgather: Invalid count, error stack:
PMPI_Allgather(1093).....: MPI_Allgather(sbuf=0xa4781a0, scount=1, MPI_LONG_LONG_INT, rbuf=0xa476580, rcount=1, MPI_LONG_LONG_INT, MPI_COMM_WORLD) failed
MPIR_Allgather_impl(908).: fail failed
MPIR_Allgather(861)......: fail failed
MPIR_Allgather_intra(332): fail failed
MPIC_Send(335)...........: Negative count, value is -32766
MPIR_Allgather_intra(267): fail failed
MPIC_Sendrecv(547).......: Negative count, value is -32764
Fatal error in PMPI_Allgather: Message truncated, error stack:
PMPI_Allgather(1093)....................: MPI_Allgather(sbuf=0x7fff02c0b5e4, scount=1, MPI_INT, rbuf=0x86c7600, rcount=1, MPI_INT, MPI_COMM_WORLD) failed
MPIR_Allgather_impl(908)................: fail failed
MPIR_Allgather(861).....................: fail failed
MPIR_Allgather_intra(267)...............: fail failed
MPIDI_CH3_PktHandler_EagerShortSend(457): Message from rank 4 and tag 7 truncated; 16 bytes received but buffer size is 12
MPIR_Allgather_intra(267)...............: fail failed
MPIDI_CH3U_Receive_data_found(131)......: Message from rank 2 and tag 7 truncated; 32 bytes received but buffer size is 28
When running the code with 9 to 12 processes, the code gives the correct solution again.
I searched the Internet, but I don't get any useful information to solve it. Is this MKL problem or I did something wrong? Can you help me?
The following are my code and test input files.
Thank you very much.
Dan