Quantcast
Channel: Intel® Software - Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
Viewing all articles
Browse latest Browse all 3005

Deadlock Problem when using the Cluster FFT

$
0
0

Hello,

I have run a massive simulation using MPI on distributed memory supercomputers
(FUJITSU Server PRIMERGY CX2550 M4 × 880)
and compiled with the intel/2018.2.046 Fortran compiler.

I have deadlock problems when using the Cluster FFT and the Available Auxiliary Functions
(MKL_CDFT_ScatterData and MKL_CDFT_GatherData) and the performance of the simulation is too slow.

The simulation is for solving Navier–Stokes equations and
3D(X, Y, and Z) arrays necessary to solve the equations.
Since in the simulation boundary conditions of the Y and Z directions are periodic,
I applied 2D Cluster FFT in the two directions and iterated the calculation along the other direction X as below.

==============================================
STATUS = DftiCreateDescriptorDM(MKL_COMM,DESC,DFTI_DOUBLE,DFTI_COMPLEX,2,LENGTHS)

STATUS = DftiGetValueDM(DESC,CDFT_LOCAL_SIZE,SIZE)
STATUS = DftiGetValueDM(DESC,CDFT_LOCAL_NX,NXX)
STATUS = DftiGetValueDM(DESC,CDFT_LOCAL_X_START,START_X)
STATUS = DftiGetValueDM(DESC,CDFT_LOCAL_OUT_NX,NX_OUT)
STATUS = DftiGetValueDM(DESC,CDFT_LOCAL_OUT_X_START,START_X_OUT)
ALLOCATE(LOCAL(SIZE), WORK(SIZE), STAT=STATUS)
STATUS = DftiSetValueDM(DESC,DFTI_PLACEMENT,DFTI_NOT_INPLACE)

DO I = 1, Nx-1

 ALLOCATE(X_IN(M,N))
 
 DO K = 1, N
  DO J = 1, M
  X_IN(J,K) = DCMPLX(A(I,J,K),0d0)
  END DO
 END DO

 STATUS = DftiCommitDescriptorDM(DESC)
 STATUS = MKL_CDFT_SCATTERDATA_D(COMM,ROOTRANK,ELEMENTSIZE,2,LENGTHS,X_IN,NXX,START_X,LOCAL) 
 STATUS = DftiComputeForwardDM(DESC,LOCAL,WORK)
 STATUS = MKL_CDFT_GATHERDATA_D(COMM,ROOTRANK,ELEMENTSIZE,2,LENGTHS,X_IN,NXX,START_X,WORK)

 DO K = 1, N
  DO J = 1, M
  T_F1(I,J,K) = X_IN(J,K)
  END DO
 END DO
 
 DEALLOCATE(X_IN)

END DO

DEALLOCATE(LOCAL, WORK)

~~~~~~~~~~~~~~~~~~~
<SOME CALCULATIONS>
~~~~~~~~~~~~~~~~~~~

STATUS = DftiGetValueDM(DESC,CDFT_LOCAL_SIZE,SIZE)
STATUS = DftiGetValueDM(DESC,CDFT_LOCAL_NX,NX_OUT)
STATUS = DftiGetValueDM(DESC,CDFT_LOCAL_X_START,START_X_OUT)
STATUS = DftiGetValueDM(DESC,CDFT_LOCAL_OUT_NX,NXX)
STATUS = DftiGetValueDM(DESC,CDFT_LOCAL_OUT_X_START,START_X)
ALLOCATE(LOCAL(SIZE), WORK(SIZE), STAT=STATUS)
SCALE = 1.0_8/(N*M)
STATUS = DftiSetValueDM(DESC,DFTI_BACKWARD_SCALE,SCALE)

DO I = 1, Nx-1

 ALLOCATE(X_IN(M,N))
 
 DO K = 1, N
  DO J = 1, M
  X_IN(J,K) = A(I,J,K)
  END DO
 END DO
 
 STATUS = DftiCommitDescriptorDM(DESC)
 STATUS = MKL_CDFT_SCATTERDATA_D(COMM,ROOTRANK,ELEMENTSIZE,2,LENGTHS,X_IN,NXX,START_X,WORK)
 STATUS = DftiComputeBackwardDM(DESC,WORK,LOCAL)
 STATUS = MKL_CDFT_GATHERDATA_D(COMM,ROOTRANK,ELEMENTSIZE,2,LENGTHS,X_IN,NXX,START_X,LOCAL)
 
 DO K = 1, N
  DO J = 1, M
  P(I,J,K) = REAL(X_IN(J,K))
  END DO
 END DO 
 
 DEALLOCATE(X_IN)

END DO

DEALLOCATE(LOCAL, WORK)

STATUS = DftiFreeDescriptorDM(DESC)
==============================================

I programed this simulation based on the 'cdft_example_support' and 'dm_complex_2d_double_ex2' provided by the Intel MKL.
After using -check_mpi, I've got the following errors when calculating the first Do loop.

==============================================
[0] ERROR: no progress observed in any process for over 11:12 minutes, aborting application
[0] WARNING: starting premature shutdown

[0] ERROR: GLOBAL:DEADLOCK:HARD: fatal error
[0] ERROR:    Application aborted because no progress was observed for over 11:12 minutes,
[0] ERROR:    check for real deadlock (cycle of processes waiting for data) or
[0] ERROR:    potential deadlock (processes sending data to each other and getting blocked
[0] ERROR:    because the MPI might wait for the corresponding receive).
[0] ERROR:    [0] no progress observed for over 11:12 minutes, process is currently in MPI call:
[0] ERROR:       mpi_gather_(*sendbuf=0x762610, sendcount=2, sendtype=MPI_INTEGER, *recvbuf=0x2b9c9acc4b80, recvcount=2, recvtype=MPI_INTEGER, root=0, comm=MPI_COMM_WORLD, *ierr=0x7fffca56ca50)
[0] ERROR:       module_mpi_mp_mkl_cdft_scatterdata_d_ (/home/~)
[0] ERROR:       press_ffttdma_ (/home/~)
[0] ERROR:       rk3_uvwpc_ (/home/~)
[0] ERROR:       MAIN__ (/home/~)
[0] ERROR:       main (/home/~)
[0] ERROR:       __libc_start_main (/usr/lib64/libc-2.17.so)
[0] ERROR:       (/home/~)
.
.
.
[0] INFO: GLOBAL:DEADLOCK:HARD: found 1 time (1 error + 0 warnings), 0 reports were suppressed
[0] INFO: Found 1 problem (1 error + 0 warnings), 0 reports were suppressed.
==============================================

I have tried to solve this deadlock and being slow problems for several weeks but I can't fix it.
I would greatly appreciate any help or some insight on this problems.

Best regards

YU,


Viewing all articles
Browse latest Browse all 3005

Latest Images

Trending Articles



Latest Images

<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>