- 论坛徽章:
- 0
|
问题以解决:可能还是lustre 版本的问题,我用lustre.1.4.6.2就没有问题了
我在vmware gsx上作lustre的试验,系统是redhat as4 U2 ,内核是2.6.9-22.ELsmp,lustre安装包如下:
kernel-smp-2.6.9-22.0.2.EL_lustre.1.4.6
lustre-1.4.6-2.6.9_22.0.2.EL_lustre.1.4.6smp
lustre-debuginfo-1.4.6-2.6.9_22.0.2.EL_lustre.1.4.6smp
lustre-modules-1.4.6-2.6.9_22.0.2.EL_lustre.1.4.6smp
结构为1个ost,1个mds,1个client,每台机器的hosts内容均一样,hosts内容如下:
127.0.0.1 localhost.localdomain localhost
192.168.0.162 n01
192.168.0.164 n03
192.168.0.165 n04
注:n01(ost),n03(mds),n04(client)在ost和mds机器上分别添加了一块硬盘(sdb1),容量1G,作为lustre分区,显示如下:
[root@n01 ~]# fdisk -l
Disk /dev/sda: 4294 MB, 4294967296 bytes
255 heads, 63 sectors/track, 522 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sda1 * 1 16 128488+ 83 Linux
/dev/sda2 17 143 1020127+ 82 Linux swap
/dev/sda3 144 522 3044317+ 83 Linux
Disk /dev/sdb: 4294 MB, 4294967296 bytes
255 heads, 63 sectors/track, 522 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sdb1 1 123 987966 83 Linux
每台机器的modprobe.conf内容如下:
[root@n01 ~]# cat /etc/modprobe.conf
alias eth0 pcnet32
alias scsi_hostadapter mptbase
alias scsi_hostadapter1 mptscsih
install kptlrouter modprobe portals ; modprobe --ignore-install kptlrouter
#install ptlrpc modprobe ksocknal ; modprobe --ignore-install ptlrpc
install llite modprobe lov osc ; modprobe --ignore-install llite
alias lustre llite
options lnet networks=tcp0
lustre脚本内容如下:
[root@n01 ~]# cat newconfig.sh
#!/bin/sh
#config.sh
#Create nodes
rm -f newconfig.xml
lmc -m newconfig.xml --add net --node n03 --nid n03 --nettype lnet
lmc -m newconfig.xml --add net --node n01 --nid n01 --nettype lnet
lmc -m newconfig.xml --add net --node generic-client --nid '*' --nettype lnet
#Configure mds
lmc -m newconfig.xml --add mds --node n03 --mds n03-mds1 --fstype ldiskfs --dev /dev/sdb1 --journal_size 400
#Configure ost
lmc -m newconfig.xml --add lov --lov lov1 --mds n03-mds1 --stripe_sz 1048576 --stripe_cnt 0 --stripe_pattern 0
lmc -m newconfig.xml --add ost --node n01 --lov lov1 --ost n01-ost1 --fstype ldiskfs --dev /dev/sdb1
#Configure client
lmc -m newconfig.xml --add mtpt --node generic-client --path /mnt/lustre --mds n03-mds1 --lov lov1
用sh newconfig.sh生成newconfig.xml文件并分发到n03和n04上,在ost上执行lconf --reformat --node n01 newconfig.xml命令启动ost成功,没有错误
在mds上执行lconf --reformat --node n03 newconfig.xml启动mds,出现kernel: <0>Fatal exception: panic in 5 seconds错误,死机,此时ost也死机,重起后检查log显示如下内容:
- Apr 14 13:30:14 n03 sshd(pam_unix)[2681]: session opened for user root by (uid=0)
- Apr 14 13:30:50 n03 kernel: Lustre: 2701:0:(module.c:381:init_libcfs_module()) maximum lustre stack 8192
- Apr 14 13:30:52 n03 kernel: Lustre: OBD class driver Build Version: 1.4.6-19691231190000-PRISTINE-.tmp.lbuild.lbuild-v1_4_6_RC3-2.6-rhel4-i686.lbuild.BUILD.lustre-kernel-2.6.9.lustre.linux-2.6.9-22.0.2.EL_lustre.1.4.6smp, [email]info@clusterfs.com[/email]
- Apr 14 13:30:53 n03 kernel: Lustre: Added LNI 192.168.0.164@tcp [8/256]
- Apr 14 13:30:54 n03 kernel: Lustre: Accept secure, port 988
- Apr 14 13:31:02 n03 kernel: kjournald starting. Commit interval 5 seconds
- Apr 14 13:31:02 n03 kernel: LDISKFS FS on sdb1, internal journal
- Apr 14 13:31:02 n03 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.
- Apr 14 13:31:03 n03 kernel: Lustre: 2762:0:(mds_fs.c:239:mds_init_server_data()) n03-mds1: initializing new last_rcvd
- Apr 14 13:31:03 n03 kernel: Lustre: MDT n03-mds1 now serving /dev/sdb1 (c4d604e8-2506-42cc-97c3-c0c436f1440e) with recovery enabled
- Apr 14 13:31:16 n03 kernel: Lustre: MDT n03-mds1 has stopped.
- Apr 14 13:31:23 n03 kernel: loop: loaded (max 8 devices)
- Apr 14 13:31:23 n03 hald[2237]: Timed out waiting for hotplug event 269. Rebasing to 273
- Apr 14 13:31:29 n03 kernel: kjournald starting. Commit interval 5 seconds
- Apr 14 13:31:29 n03 kernel: LDISKFS FS on sdb1, internal journal
- Apr 14 13:31:29 n03 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.
- Apr 14 13:31:30 n03 kernel: eip: c8a1e7aa
- Apr 14 13:31:30 n03 kernel: ------------[ cut here ]------------
- Apr 14 13:31:30 n03 kernel: kernel BUG at include/asm/spinlock.h:146!
- Apr 14 13:31:30 n03 kernel: invalid operand: 0000 [#1]
- Apr 14 13:31:30 n03 kernel: SMP
- Apr 14 13:31:30 n03 kernel: Modules linked in: loop(U) fsfilt_ldiskfs(U) ldiskfs(U) mds(U) lov(U) osc(U) mdc(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) md5(U) ipv6(U) autofs4(U) i2c_dev(U) i2c_core(U) sunrpc(U) iptable_filter(U) ip_tables(U) dm_mirror(U) dm_mod(U) button(U) battery(U) ac(U) shpchp(U) pcnet32(U) mii(U) floppy(U) ext3(U) jbd(U) mptscsih(U) mptbase(U) sd_mod(U) scsi_mod(U)
- Apr 14 13:31:30 n03 kernel: CPU: 0
- Apr 14 13:31:30 n03 kernel: EIP: 0060:[<c02d1a41>] Not tainted VLI
- Apr 14 13:31:30 n03 kernel: EFLAGS: 00010016 (2.6.9-22.0.2.EL_lustre.1.4.6smp)
- Apr 14 13:31:30 n03 kernel: EIP is at _spin_lock_irqsave+0x20/0x45
- Apr 14 13:31:30 n03 kernel: eax: c8a1e7aa ebx: 00000002 ecx: c02e8bea edx: c02e8bea
- Apr 14 13:31:30 n03 kernel: esi: c5c44960 edi: c0a800a4 ebp: c7a1a000 esp: c415be98
- Apr 14 13:31:30 n03 kernel: ds: 007b es: 007b ss: 0068
- Apr 14 13:31:30 n03 kernel: Process socknal_cd00 (pid: 2736, threadinfo=c415a000 task=c4d4ebb0)
- Apr 14 13:31:30 n03 kernel: Stack: c6d57f40 c5c44960 c8a1e7aa c7edb8a8 c7a1a000 c0a800a4 00000001 c8a17dab
- Apr 14 13:31:30 n03 kernel: 0000bc84 00000000 c0a800a4 c0a800a2 00000000 00000000 000000b1 00000246
- Apr 14 13:31:30 n03 kernel: c5422100 c6d57f00 c5bb1c80 c7edb880 4a9b692d 0004115d c415bef0 c415bef0
- Apr 14 13:31:30 n03 kernel: Call Trace:
- Apr 14 13:31:30 n03 kernel: [<c8a1e7aa>] ksocknal_queue_tx_locked+0x11e/0x1f7 [ksocklnd]
- Apr 14 13:31:30 n03 kernel: [<c8a17dab>] ksocknal_create_conn+0xd7c/0x1454 [ksocklnd]
- Apr 14 13:31:30 n03 kernel: [<c8dab11c>] lnet_connect+0x277/0x2c7 [lnet]
- Apr 14 13:31:30 n03 kernel: [<c8a232e3>] ksocknal_connect+0xd8/0x254 [ksocklnd]
- Apr 14 13:31:30 n03 kernel: [<c8a23638>] ksocknal_connd+0x1d9/0x326 [ksocklnd]
- Apr 14 13:31:30 n03 kernel: [<c011eb7c>] autoremove_wake_function+0x0/0x2d
- Apr 14 13:31:30 n03 kernel: [<c011eb7c>] autoremove_wake_function+0x0/0x2d
- Apr 14 13:31:30 n03 kernel: [<c02d2d5a>] ret_from_fork+0x6/0x14
- Apr 14 13:31:30 n03 kernel: [<c8a2345f>] ksocknal_connd+0x0/0x326 [ksocklnd]
- Apr 14 13:31:30 n03 kernel: [<c8a2345f>] ksocknal_connd+0x0/0x326 [ksocklnd]
- Apr 14 13:31:30 n03 kernel: [<c01041f1>] kernel_thread_helper+0x5/0xb
- Apr 14 13:31:30 n03 kernel: Code: 81 00 00 00 00 01 c3 f0 ff 00 c3 56 89 c6 53 9c 5b fa 81 78 04 ad 4e ad de 74 18 ff 74 24 08 68 ea 8b 2e c0 e8 0f f5 e4 ff 59 58 <0f> 0b 92 00 a4 7c 2e c0 f0 fe 0e 79 13 f7 c3 00 02 00 00 74 01
- Apr 14 13:31:30 n03 kernel: <0>Fatal exception: panic in 5 seconds
复制代码
那位大哥帮忙看看这是什么问题,小弟先谢了Sample Text
[ 本帖最后由 suran007 于 2006-4-25 17:48 编辑 ] |
|