Hi,
I am trying Intel® Optimized LINPACK Benchmark for Linux* OS on Multi-Intel Phi cards configuration.
My test environment :
- AIC Sandy Bridge EP-4S server system with Sandy Bridge EP-4S *4 + 98GB memory
- Intel Xeon Phi : 3 pcs of 3110 and 4 pcs of 3115
- OS: Redhat Enterprise Linux 6.2 x64
- Xeon Phi MPSS: KNC_gold_update_2-2.1.5889-16-rhel-6.2.tar
- Intel Composer XE : l_ccompxe_2013.3.163.tgz
- Intel MPI : l_mpi_p_4.1.0.024.tgz or l_mpi_p_4.1.0.030.tgz
After ran the runme_xeon64_ao script to enables acceleration by offloading computations to Intel Xeon Phi coprocessors available on the system, I found that when I increase the HPL problem size(Ns) to a arrange, Linpack process(xlinpack_xeon64) will run endlessly and can’t be finished and found some relevant error message in host system log . For example, at 7 pcs Phi configuration, I got this problem when I set HPL problem size(Ns) to 46000. It related to Phi card quantity. At 1 pcs Phi configuration, I can increase HPL problem size(Ns) to 100000 without problem.
The below is error message:
__scif_fence_wait 3041 err -16
dma_mark_wait 1080 TO chan 0x0
drain_dma_intr 1151 err -16
micscif_rma_destroy_temp_windows 2082 DMA channel 0 hung ep->state 2 window->dma_mark 0x1c0 channel_mark 0x1c2
------------[ cut here ]------------
WARNING: at /home/build/sandbox/mpss/MPSS_4982/k1om/rhel-6.2/mpss/.rpmbuild_4982/BUILD/intel-mic-kmod-2.1.4982/micscif_rma.c:2084 micscif_rma_destroy_temp_windows+0x314/0x540 [mic]() (Not tainted)
Hardware name: SB301-TO
Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 mic(U) microcode sg ixgbe dca mdio sb_edac edac_core iTCO_wdt iTCO_vendor_support shpchp e1000e i2c_i801 i2c_core ext4 mbcache jbd2 sr_mod cdrom usb_storage sd_mod crc_t10dif ahci isci libsas scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 2812, comm: SCIF_MISC Not tainted 2.6.32-220.el6.x86_64 #1
Call Trace:
[<ffffffff81069b77>] ? warn_slowpath_common+0x87/0xc0
[<ffffffff81069bca>] ? warn_slowpath_null+0x1a/0x20
[<ffffffffa0235664>] ? micscif_rma_destroy_temp_windows+0x314/0x540 [mic]
[<ffffffffa02321b5>] ? micscif_rma_handle_remote_fences+0x155/0x380 [mic]
[<ffffffff814eca40>] ? thread_return+0x4e/0x77e
[<ffffffff8100bc0e>] ? apic_timer_interrupt+0xe/0x20
[<ffffffffa022a0f0>] ? micscif_misc_handler+0x0/0xc0 [mic]
[<ffffffffa022a10a>] ? micscif_misc_handler+0x1a/0xc0 [mic]
[<ffffffffa022a0f0>] ? micscif_misc_handler+0x0/0xc0 [mic]
[<ffffffff8108b2b0>] ? worker_thread+0x170/0x2a0
[<ffffffff81090bf0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff8108b140>] ? worker_thread+0x0/0x2a0
[<ffffffff81090886>] ? kthread+0x96/0xa0
[<ffffffff8100c14a>] ? child_rip+0xa/0x20
[<ffffffff810907f0>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
---[ end trace e0d2c31584645743 ]---