First, I'm not if this is the right forum for this question. I don't know what is as the reason could be due to hardware or MKL or .NET or some other hidden factors.
I have a neural network code in C# which heavily uses MKL via PInvoke. I set a fixed number of threads and disabled dynamic threading of MKL. The C# code is used mainly before and after training. However, during training (i.e. between iterations), MKL carries most of the computational body. No memory is allocated and there's no I/O during training.
I have observed unpredictable performance across iterations (example below) and woud like to understand why. In some other runs, the number of connections processed per second dropped to ~600M for a few iterations (very strange). For the one below, it took 6h to finish the training (i.e. each iteration takes about 12 minutes on average). It's rather consistent that the perf degrates towards the end. The perf accounting is more consistent when I run a smaller job (e.g. 20 minutes to finish).
The code is large and not sharable. If you can't pinpoint why, a hint to help me investigate further would also be appreciated.
Iterations:1/30, 1504.65M connections processed per second Iterations:2/30, 1505.16M connections processed per second Iterations:3/30, 1505.16M connections processed per second Iterations:4/30, 1504.96M connections processed per second Iterations:5/30, 1503.38M connections processed per second Iterations:6/30, 1504.68M connections processed per second Iterations:7/30, 1502.40M connections processed per second Iterations:8/30, 1506.11M connections processed per second Iterations:9/30, 1503.20M connections processed per second Iterations:10/30, 1504.95M connections processed per second Iterations:11/30, 1502.34M connections processed per second Iterations:12/30, 1498.91M connections processed per second Iterations:13/30, 1490.70M connections processed per second Iterations:14/30, 1477.59M connections processed per second Iterations:15/30, 1459.92M connections processed per second Iterations:16/30, 1433.61M connections processed per second Iterations:17/30, 1402.28M connections processed per second Iterations:18/30, 1356.30M connections processed per second Iterations:19/30, 1342.68M connections processed per second Iterations:20/30, 1306.84M connections processed per second Iterations:21/30, 1263.10M connections processed per second Iterations:22/30, 1236.72M connections processed per second Iterations:23/30, 1209.60M connections processed per second Iterations:24/30, 1183.91M connections processed per second Iterations:25/30, 1157.60M connections processed per second Iterations:26/30, 1140.60M connections processed per second Iterations:27/30, 1112.54M connections processed per second Iterations:28/30, 1086.06M connections processed per second Iterations:29/30, 1071.61M connections processed per second Iterations:30/30, 1055.94M connections processed per second