I'm working on a program that performs several 3 x 3d (N1xN2xN3) DFTs using the MKL DFT algorithm. I'm running most of the program in parallel using OpenMP and I'd like to get as much parallel performance from the DFT section as well as it accounts for a significant portion of the programs runtime. However when I try to increase the number of threads I find that the performance improvement plateaus at 3 threads, i.e., the number of transforms for each call. If instead I break up the transform into 3xN1 2d transforms the parallel performance continues to scale beyond 3 threads. This seems like a lot of extra work for performance gains I would expect to be handled internally. Is there a way of directing MKL's DFT to do this on it's own?
As it may be relevant, I'm already passing the number of available threads to the DFT via the DFTI_NUMBER_OF_THREADS DftiSetValue option and each of the 3 3d tranforms is done by setting the DFTI_NUMBER_OF_TRANSFORMS option. I can provide some pseudo code if that would be helpful.