Hi,
I'm currently using the MKL to calculate some DCT on complex-type fortran arrays. My fortran module has a first initialization routine that looks like
subroutine initialize(this,n, ni, nd)
class(costf_odd_t) :: this
!-- Input variables
integer, intent(in) :: n
integer, intent(in) :: ni
integer, intent(in) :: nd
!-- Local variables:
integer :: stat
real(cp) :: fac
real(cp) :: work(n)
this%nRad = n
allocate(this%i_costf_init(128))
allocate(this%d_costf_init(1:5*(n-1)/2+2))
call d_init_trig_transform(n-1,MKL_COSINE_TRANSFORM,this%i_costf_init,this%d_costf_init,stat)
call d_commit_trig_transform(work,this%r2c_handle,this%i_costf_init,this%d_costf_init,stat)
!fac = sqrt(half*real(this%nRad-1,cp))
!stat = DftiSetValue(this%r2c_handle, DFTI_FORWARD_SCALE, fac)
!stat = DftiCommitDescriptor(this%r2c_handle)
end subroutine initializeThe actual DCT is then computed later on as follows
subroutine costf1_complex(this,f,n_f_max,n_f_start,n_f_stop)
class(costf_odd_t) :: this
!-- Input variables:
integer, intent(in) :: n_f_start,n_f_stop ! columns to be transformed
integer, intent(in) :: n_f_max
complex(cp), intent(inout) :: f(n_f_max,this%nRad)
!-- Local variables:
real(cp) :: work_real(this%nRad)
real(cp) :: work_imag(this%nRad)
integer :: stat, n_f
do n_f=n_f_start,n_f_stop
work_real(:) = real(f(n_f,:))
work_imag(:) = aimag(f(n_f,:))
call d_forward_trig_transform(work_real,this%r2c_handle,this%i_costf_init,this%d_costf_init,stat)
call d_forward_trig_transform(work_imag,this%r2c_handle,this%i_costf_init,this%d_costf_init,stat)
f(n_f,:)=sqrt(half*real(this%nRad-1,cp))*cmplx(work_real,work_imag,kind=cp)
end do
end subroutine costf1_complexIt basically does what I want but I have several performance limitations since: (i) there is a pre-factor multiplication which is computed for each DCT, (ii) there is a memory copy due to the unsupported complex-type input arrays in the MKL trig transforms.
I tried fixing the first issue using the 3 last commented lines in the initialize subroutine above, but it didn't work, any idea here? Concerning the second issue, is there a possible way to avoid the memory allocation of the work arrays in the costf1_complex subroutine (maybe using the C-type pointers)?
Thanks!