免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 12328 | 回复: 6
打印 上一主题 下一主题

集群Openmpi问题,求助! [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2012-11-02 14:56 |只看该作者 |倒序浏览
本帖最后由 peticure 于 2012-11-02 14:58 编辑

各位,我是个初学者,面临一个集群的问题,经验不足求助大家,先谢谢各位。
Radhat 5.4   
现在的环境是22台刀片服务器,每台都有Infiniband卡。其中10台服务器通过可以并行运算,后12台不能并行运算。使用的是openmpi-1.3.4。
以下是并行预算提示的错误:
[root@hpcn1 ~]# mpirun -machinefile /etc/hosts.equiv /hpchome/icpi
hpcn13.4787ipath_wait_for_device: The /dev/ipath device failed to appear after 30.0 seconds: Connection timed out
hpcn13.4787PSM Could not find an InfiniPath Unit on device /dev/ipath (30s elapsed) (err=21)
--------------------------------------------------------------------------
PSM was unable to open an endpoint. Please make sure that the network link is
active on the node and the hardware is functioning.

  Error: PSM Could not find an InfiniPath Unit
--------------------------------------------------------------------------
[hpcn13:04787] [[42661,1],1] selected pml cm, but peer [[42661,1],0] on hpcn1 selected pml ob1
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[42661,1],0]) is on host: hpcn1
  Process 2 ([[42661,1],1]) is on host: inf13
  BTLs attempted: self sm tcp

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[hpcn1:10312] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[hpcn13:4787] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 10312 on
node hpcn1 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[hpcn1:10310] 1 more process has sent help message help-mtl-psm.txt / unable to open endpoint
[hpcn1:10310] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[hpcn1:10310] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:internal-failure

论坛徽章:
1
IT运维版块每日发帖之星
日期:2015-07-05 22:20:00
2 [报告]
发表于 2012-11-02 16:00 |只看该作者
本帖最后由 yjs_sh 于 2012-11-02 16:01 编辑

infinband卡的驱动安装了吗?
ibhosts
ibstat
看看infiniband正常不

论坛徽章:
0
3 [报告]
发表于 2012-11-02 18:06 |只看该作者
回复 2# yjs_sh


    你好,我安装infiniband卡的驱动了

[root@hpcn1 ~]# ibhosts
Ca      : 0x14feb59097ff86fc ports 2 "hpcn20 HCA-1"
Ca      : 0x14feb59097ff6b9c ports 2 "hpcn19 HCA-1"
Ca      : 0x14feb59097ff86d4 ports 2 "hpcn18 HCA-1"
Ca      : 0x14feb59097ff6bac ports 2 "hpcn16 HCA-1"
Ca      : 0x14feb59097ff8748 ports 2 "hpcn12 HCA-1"
Ca      : 0x14feb59097ff86e4 ports 2 "hpcn14 HCA-1"
Ca      : 0x14feb59097ff6bc0 ports 2 "hpcn13 HCA-1"
Ca      : 0xf04da290977971f4 ports 2 "hpcn8 HCA-1"
Ca      : 0x14feb59097ff3018 ports 2 "hpcn7 HCA-1"
Ca      : 0x14feb59097ff2a58 ports 2 "hpcn12 HCA-1"
Ca      : 0x14feb59097ff3024 ports 2 "hpcn11 HCA-1"
Ca      : 0x14feb59097ff304c ports 2 "hpcn10 HCA-1"
Ca      : 0xf04da29097797234 ports 2 "hpcn9 HCA-1"
Ca      : 0xf04da29097797204 ports 2 "hpcn6 HCA-1"
Ca      : 0xf04da290977971f0 ports 2 "hpcn5 HCA-1"
Ca      : 0xf04da2909779720c ports 2 "hpcn4 HCA-1"
Ca      : 0x14feb59097ff2a4c ports 2 "hpcn2 HCA-1"
Ca      : 0x14feb59097ff2a44 ports 2 "hpcn3 HCA-1"
Ca      : 0xf04da290977971e8 ports 2 "hpcn1 HCA-1"


[root@hpcn1 ~]# ibstat
CA 'mlx4_0'
        CA type: MT26428
        Number of ports: 2
        Firmware version: 2.8.600
        Hardware version: b0
        Node GUID: 0xf04da290977971e8
        System image GUID: 0xf04da290977971eb
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 17
                LMC: 0
                SM lid: 17
                Capability mask: 0x0251086a
                Port GUID: 0xf04da290977971e9
                Link layer: IB
        Port 2:
                State: Down
                Physical state: Polling
                Rate: 70
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x02510868
                Port GUID: 0xf04da290977971ea
                Link layer: IB

论坛徽章:
1
IT运维版块每日发帖之星
日期:2015-07-05 22:20:00
4 [报告]
发表于 2012-11-02 19:06 |只看该作者
我记得openmpi默认是走IP网络的,如果要走infinband需要加参数的,你可以查查openmpi的命令行参数看看

论坛徽章:
0
5 [报告]
发表于 2012-11-07 15:32 |只看该作者
看着是IB设备有问题。你的ibhosts输出怎么有两个12号节点,没有15和17.
用 /etc/hosts.equiv做hostfile不是一个好主意。你也没指定进程数(-np)

稍微常用的openmpi的命令如下,还有,openmpi默认跑在ssh上的,默认应该是ib网络。

使用不同的hostfile,就可以判断出哪个节点出问题。

mpirun --mca btl openib,self  -hostfile ./ma -np 2 /path/to/mpi/app.exe

论坛徽章:
0
6 [报告]
发表于 2012-11-07 16:08 |只看该作者
回复 5# blues083


    15和17两台关机了,12不知道为什么显示两个,您发的命令,我运行不啊,请问有没有openmpi的文档?

论坛徽章:
0
7 [报告]
发表于 2012-11-07 16:59 |只看该作者
那个命令只是个例子。
文档都在openmpi网站上http://www.open-mpi.org/doc/
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP