- 论坛徽章:
- 0
|
本帖最后由 peticure 于 2012-11-02 14:58 编辑
各位,我是个初学者,面临一个集群的问题,经验不足求助大家,先谢谢各位。
Radhat 5.4
现在的环境是22台刀片服务器,每台都有Infiniband卡。其中10台服务器通过可以并行运算,后12台不能并行运算。使用的是openmpi-1.3.4。
以下是并行预算提示的错误:
[root@hpcn1 ~]# mpirun -machinefile /etc/hosts.equiv /hpchome/icpi
hpcn13.4787ipath_wait_for_device: The /dev/ipath device failed to appear after 30.0 seconds: Connection timed out
hpcn13.4787PSM Could not find an InfiniPath Unit on device /dev/ipath (30s elapsed) (err=21)
--------------------------------------------------------------------------
PSM was unable to open an endpoint. Please make sure that the network link is
active on the node and the hardware is functioning.
Error: PSM Could not find an InfiniPath Unit
--------------------------------------------------------------------------
[hpcn13:04787] [[42661,1],1] selected pml cm, but peer [[42661,1],0] on hpcn1 selected pml ob1
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[42661,1],0]) is on host: hpcn1
Process 2 ([[42661,1],1]) is on host: inf13
BTLs attempted: self sm tcp
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[hpcn1:10312] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[hpcn13:4787] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 10312 on
node hpcn1 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[hpcn1:10310] 1 more process has sent help message help-mtl-psm.txt / unable to open endpoint
[hpcn1:10310] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[hpcn1:10310] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:internal-failure
|
|