Quantcast
Channel: Intel® Software - Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
Viewing all articles
Browse latest Browse all 3005

F90 SIMD/Parallel FFT

$
0
0

I found some related posts and links such as https://software.intel.com/en-us/node/507041#EEACAA31-C805-484F-B4AE-D8F...

However nothing that jumped out.

Most of what I do results in executables that run on a single core, and I have enough executables linked together that I end up with 'all' the stuff running most of the cores, such that they are relatively evenly loaded. However I also have some cases where the machine is idle and I want to run a single executable across all the cores. Additionally I also got access to a XEON-Phi. My previous efforts have mostly taken advantage of structuring code such that the vectorization report allows me to restructure to get plenty of automatic optimization without having to dig into OpenMP to extensively... But the times 'they are a changing'.

My goals are:

1) To get the code running the FFTs across multiple cores

2) To port to a MIC (XEON-Phi) which I have available to me, and use that to then move some other code to the MIC.

Here is where I am with a simple library (Since I am poking this in by hand, there may be typos!)...:

MODULE RHVECTMATH
  USE omp_lib
  IMPLICIT NONE
  PUBLIC :: RH$VSADD, RH$VSUM,

  CONTAINS
...
...
  SUBROUTINE RH$RFFT(SIG, nFFT, FFT_Type)
  !DEC$ ATTRIBUTES DEFAULT, REFERENCE, ALIAS:'rh$rfft_'::rh$rfft
  USE MKL_DFTI
  IMPLICIT NONE
  REAL(KIND=4), DIMENSION(nFFT), INTENT(INOUT) :: Sig
  INTEGER(KIND=4)                             , INTENT(IN   ) :: nFFT
  INTEGER(KIND=4)                             , INTENT(IN   ) :: FFT_Type

  INTEGER(KIND=4)                                 :: Entires = 0
  INTEGER(KIND=4)                                 :: Status
  REAL(KIND=4), DIMENSION(nFFT)   :: SigCF
  COMPLEX(KIND=4), DIMENSION(nFFT)   :: SigCF
  LOGICAL(KIND=4)                                 :: Local_IO = .TRUE.

 TYPE(DFTI_DESCRIPTOR), POINTER :: My_Dec1_Handle
 Entries = Entries + 1
 Status = DftiCreateDescriptor(MyDesc1_Handle, DFTI_SINGLE, DFTI_REAL, 1 nFFT)
  IF(Local_IO == .TRUE. .AND. Entries <= 2) WRITE(11,*)'Threads:114  Status=',Status !<- SigDump  without
 Status = DftiSetValue(MyDesc1_Handle, DFTI_NOT_INPLACE)
!no  !$OMP PARALLEL DO SIMD PRIVATE(Entries) SHARED(Sig, SigSF)
 Status = DftiComputeForward(MyDesc1_Handle, Sig, SigSF)
!no  !$OMP END PARALLEL

 !$OMP PARALLEL DO  SHARED(Sig, SigSF)
Sig = SigSF
 !$OMP END PARALLEL

RETURN
END SUBROUTINE RH$RFFT

...
END MODULE RHVECTORMATH

Without the write at ^line #26^ it sigdumps whether I compile with -O0 or -O3.

Question #1: is what should I be doing differently with the MKL to avoid the sigdump without the write?

The library build is as follows:

DO_RH_DBG = -nod-lines
MIC = -D__MIC__ -mmic
NOMIC = -D__NOMIC__
MICOFFLOAD = -D__MIC__ -D__INTEL_OFFLOAD -offload-attribute-target=mic
F90_NO_WARN = ifort -132 ${DO_RH_DBG} -ip -inline-min-size=10 -inline-max-size=40000 inline-max-per-routine=100000 inline-max-total-size=2000 -inline-max-per-compile=500000 -recursive -fPIC -gen-interfaces source
F90_LINK = ifort -w1 -g -debug all -check-all -recursive -static-libcxa

RHINKS = -I$(<a path>)/inc
RHINKS3 = -I$(<a path>)/lib/rhvec
RHINKS4 = -I$(<another path>)/inc

LIB = ../../librhvec.a

rhobjects = threads.o mkl_dfti.o

all: $(LIB)

mkl_dfti.o: mkl_dfti.f90
   $F90_NO_WARN -O3 $(RHINCS) $(RHINCS4) -align -cpp -openmp -openmp-report2 -axSSE,AVX $NO_MIC -diag-file=mkl_dfti.compile_txt -opt-report 3 -opt-report-file=mkl_dfti.opt_rpt -Winline -vec-report3 -vec-guard-write $(RHINKS) -o $< -0 $@
...
threads.o: threads.f90 mkl_dfti.o
   $F90_NO_WARN -O3 $(RHINCS) $(RHINCS4) -align -cpp -openmp -openmp-report2 -axSSE,AVX $NO_MIC -diag-file=threads.compile_txt -opt-report 3 -opt-report-file=threads.opt_rpt -Winline -vec-report3 -vec-guard-write $(RHINKS) -o $< -0 $@
...


$(LIB): $(rhobjects)

  ar sr $(LIB) $(rhobjects)

...

 

The library compiles with only the warning:

ifort: command line warning #10152: option '-axSSE,AVX' not supported.

(This machine is using composer_xe_2013 and ifort is pointing to 14.1.106)

So question #2 is in which phase does the -axSSE get used? I am assuming it is not in preprocessor or linking, so it should be in the creation of the .o file?? From the documentation it appears that the -axSSE should work?? However none of the single -ax<> seem to work...

The calling code looks like:

!... reading in the data from file...
!buffer = Sig (N)
...
get = nFFT
CALL GRAB(<datasource>, Sig, get, got)
if(got == get)
Sig1 = Sig(
...
!DIR$ prefetch sig1
CALL RH$RFFT(Sig1, nFFT, 3)
endif
...<more code>

The MKL implementation is giving the same results, but is 1/2 the speed of what was getting using .f77 code, and appears to be saturating a core.

Question #3:
I am assuming to run across many cores I will need something like the following... Or are there some MKL switches that enable the many cores to used?:

!... reading in the data from file...
!buffer = Sig (N * N_Cores)
...
get = nFFT * <N_Cores>
CALL GRAB(<datasource>, Sig, get, got)
...
!$OMP PARALLEL DO SIMD SHARED(SIG), PRIVATE(nFFT)
DO I = 1, N_Cores
  Start = ((I-1)*nFFT) + 1
  End = (I*nFFT)
  CALL RH$RFFT(Sig(Start:End), nFFT, 3)
ENDDO
!$OMP END PARALLEL DO
endif
...

Also it appears that there is no library or API for calling IPP FFT modules from Fortran??

Question #4:

What is the fastest way to run FFTs using the entire host (MKL, IPP)? (Are there any examples of .c++ interface wrappers?)

Am I correct in preferring OpenMP for spreading the processing across the cores? - Or are there MKL switches to achieve this?

Regardless of whether I am heading to the MIC/XEON-Phi, I would want the non-MIC machines to poke along faster that I am doing with the .f77 code. And understanding this should help me work through a totally different problem which I want to put onto the MIC.

Any insights appreciated.


Viewing all articles
Browse latest Browse all 3005

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>