I am new to the field of MPI. I write my program by using Intel Math Kernel Library and I want to compute a matrix-matrix multiplication by blocks, which means that I split the large matrix X into many small matrixs along the column as the following. My matrix is large, so each time I only compute (N, M) x (M, N) where I can set M manually.
XX^Ty = X_1X_1^Ty + X_2X_2^Ty + ... + X_nX_n^Ty
I first set the number of total threads as 16 and M equals to 1024. Then I run my program directly as the following . I check my cpu state and I find that the cpu usage is 1600%, which is normal.
./MMNET_MPI --block 1024 --numThreads 16
However, I tried to run my program by using MPI as the following. Then I find that cpu usage is only 200-300%. Strangely, I change the block number to 64 and I can get a little performance improvement to cpu usage 1200%.
mpirun -n 1 --bind-to none ./MMNET_MPI --block 1024 --numThreads 16
I do not know what the problem is. It seems that mpirun does some default setting which has an impact on my program. The following is a part of my matrix multiplication code. The command `#pragma amp parallel for` aims to extract the small N by M matrix from compression format parallel. After that I use clubs_dgemv to compute the matrix-matrix multiplication.
void LMMCPU::multXXTTrace(double *out, const double *vec) const {
double *snpBlock = ALIGN_ALLOCATE_DOUBLES(Npad * snpsPerBlock);
double (*workTable)[4] = (double (*)[4]) ALIGN_ALLOCATE_DOUBLES(omp_get_max_threads() * 256 * sizeof(*workTable));
// store the temp result
double *temp1 = ALIGN_ALLOCATE_DOUBLES(snpsPerBlock);
for (uint64 m0 = 0; m0 < M; m0 += snpsPerBlock) {
uint64 snpsPerBLockCrop = std::min(M, m0 + snpsPerBlock) - m0;
#pragma omp parallel for
for (uint64 mPlus = 0; mPlus < snpsPerBLockCrop; mPlus++) {
uint64 m = m0 + mPlus;
if (projMaskSnps[m])
buildMaskedSnpCovCompVec(snpBlock + mPlus * Npad, m,
workTable + (omp_get_thread_num() << 8));
else
memset(snpBlock + mPlus * Npad, 0, Npad * sizeof(snpBlock[0]));
}
// compute A=X^TV
MKL_INT row = Npad;
MKL_INT col = snpsPerBLockCrop;
double alpha = 1.0;
MKL_INT lda = Npad;
MKL_INT incx = 1;
double beta = 0.0;
MKL_INT incy = 1;
cblas_dgemv(CblasColMajor,
CblasTrans,
row,
col,
alpha,
snpBlock,
lda,
vec,
incx,
beta,
temp1,
incy);
// compute XA
double beta1 = 1.0;
cblas_dgemv(CblasColMajor, CblasNoTrans, row, col, alpha, snpBlock, lda, temp1, incx, beta1, out,
incy);
}
ALIGN_FREE(snpBlock);
ALIGN_FREE(workTable);
ALIGN_FREE(temp1);
}
Actually, I have checked the following part can fully use the cpu resources. It seems that there are some problems with cblas_dgemv.
#pragma omp parallel for
for (uint64 mPlus = 0; mPlus < snpsPerBLockCrop; mPlus++) {
uint64 m = m0 + mPlus;
if (projMaskSnps[m])
buildMaskedSnpCovCompVec(snpBlock + mPlus * Npad, m,
workTable + (omp_get_thread_num() << 8));
else
memset(snpBlock + mPlus * Npad, 0, Npad * sizeof(snpBlock[0]));
}
My CPU information is as the following.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 44
On-line CPU(s) list: 0-43
Thread(s) per core: 1
Core(s) per socket: 22
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz
Stepping: 4
CPU MHz: 1252.786
CPU max MHz: 2101.0000
CPU min MHz: 1000.0000
BogoMIPS: 4200.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 30976K
NUMA node0 CPU(s): 0-21
NUMA node1 CPU(s): 22-43