Sun Flash Accelerator F20 PCIe卡引起SUN T5240重启
我这边有一台SUN T5240小型机,机器上插了一张Sun Flash Accelerator F20 PCIe卡,前天的时候这台服务器突然重启了,由于我没有在现场,所以我让别人在操作系统上收了explorer,发现以下一些问题:1、messages
Aug 30 17:40:41 scradius2 MQSeries: FFST record created in /var/mqm/errors/AMQ2066.0.FDC
Aug 30 17:40:46 scradius2 sshd: Did not receive identification string from 222.211.95.55
Aug 30 17:44:51 scradius2 genunix: NOTICE: SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major
Aug 30 17:44:51 scradius2 unix:
Aug 30 17:44:51 scradius2 ^Mpanic/thread=2a1025f7ca0:
Aug 30 17:44:51 scradius2 unix: Fatal error has occured in: PCIe fabric.(0x0)(0x41)
Aug 30 17:44:51 scradius2 unix:
Aug 30 17:44:51 scradius2 genunix: 000002a10265fbc0 px:px_err_panic+1ac (1947400, 135a400, 41, 2a10265fc70, 0, 0)
Aug 30 17:44:51 scradius2 genunix: %l0-3: 000000000000d801 0000000001947400 0000000000000000 0000000000000001
Aug 30 17:44:51 scradius2 %l4-7: 0000000000000000 0000000001875c00 0000000000000001 0000000000000000
Aug 30 17:44:51 scradius2 genunix: 000002a10265fcd0 px:px_err_fabric_intr+1b4 (300044ef8c0, 0, 368000000000000, 1, 41, 368)
Aug 30 17:44:51 scradius2 genunix: %l0-3: 0000000000000000 00000000019477b8 0000000001947400 0000000000000054
Aug 30 17:44:51 scradius2 %l4-7: 00000000019477a0 0000000001947400 0000000001947798 0000000001947400
Aug 30 17:44:51 scradius2 genunix: 000002a10265fe40 px:px_msiq_intr+1e8 (6002147bc90, 30002c271e0, 134cc04, 0, 1, 300014d9f28)
Aug 30 17:44:51 scradius2 genunix: %l0-3: 00000600214f3e60 00000300044f3850 0000030002c271e0 0000000000000000
Aug 30 17:44:51 scradius2 %l4-7: 0000000000000000 00000000034c4000 000002a10265ff40 0000000000000033
Aug 30 17:44:51 scradius2 genunix: 000002a10265ff50 unix:current_thread+164 (16, 36, ffffffffffffffff, 0, 100, 12)
Aug 30 17:44:51 scradius2 genunix: %l0-3: 0000000001009904 000002a1025f6fe1 000000000000000e 00000000700101c0
Aug 30 17:44:51 scradius2 %l4-7: 0000000000000002 0000000000000010 0000000000000000 000002a1025f7890
Aug 30 17:44:51 scradius2 genunix: 000002a1025f7930 unix:cpu_halt+104 (30005378000, 36, 187c3e0, 187c2b0, 30005378000, 0)
Aug 30 17:44:51 scradius2 genunix: %l0-3: 0000060022d04b64 0000000000000001 0000000000000016 0000000000000000
Aug 30 17:44:51 scradius2 %l4-7: 0000000001000000 0000000000000002 00000000018f4000 0000000000000001
Aug 30 17:44:51 scradius2 genunix: 000002a1025f79e0 unix:idle+128 (182a800, 0, 30005378000, ffffffffffffffff, 37, 1829400)
Aug 30 17:44:51 scradius2 genunix: %l0-3: 0000060022d04b40 000000000000001b 0000000000000000 ffffffffffffffff
Aug 30 17:44:51 scradius2 %l4-7: 0000060022d04b40 ffffffffffffffff 000000000187c2b0 00000000010409e0
Aug 30 17:44:51 scradius2 unix:
Aug 30 17:44:51 scradius2 genunix: syncing file systems...
Aug 30 17:44:52 scradius2 scsi: /pci@400/pci@0/pci@d/LSILogic,sas@0 (mpt1):
Aug 30 17:44:52 scradius2 Log info 31120200 received for target 2.
Aug 30 17:44:52 scradius2 scsi_status=0, ioc_status=804b, scsi_state=c
Aug 30 17:44:52 scradius2 md_stripe: WARNING: md: d90: write error on /dev/dsk/c2t2d0s2
Aug 30 17:44:53 scradius2 genunix: 103
Aug 30 17:44:55 scradius2 genunix: 95
Aug 30 17:44:57 scradius2 genunix: 93
Aug 30 17:45:43 scradius2 last message repeated 20 times
Aug 30 17:45:44 scradius2 genunix: done (not all i/o completed)
Aug 30 17:45:45 scradius2 genunix: dumping to /dev/dsk/c1t0d0s1, offset 65536, content: kernel
Aug 30 17:49:31 scradius2 genunix: ^M100% done: 358960 pages dumped, compression ratio 2.42,
Aug 30 17:49:31 scradius2 genunix: dump succeeded
Aug 30 17:50:51 scradius2 genunix: ^MSunOS Release 5.10 Version Generic_142900-03 64-bit
Aug 30 17:50:51 scradius2 genunix: Copyright 1983-2009 Sun Microsystems, Inc.All rights reserved.
Aug 30 17:50:51 scradius2 Use is subject to license terms.
Aug 30 17:50:51 scradius2 genunix: Ethernet address = 0:21:28:76:ed:e
Aug 30 17:50:51 scradius2 unix: NOTICE: Kernel Cage is ENABLED
Aug 30 17:50:51 scradius2 unix: mem = 16547840K (0x3f2000000)
Aug 30 17:50:51 scradius2 unix: avail mem = 16339820544
Aug 30 17:50:51 scradius2 rootnex: root nexus = T5240
Aug 30 17:50:51 scradius2 rootnex: pseudo0 at root
Aug 30 17:50:51 scradius2 genunix: pseudo0 is /pseudo
Aug 30 17:50:51 scradius2 rootnex: scsi_vhci0 at root
Aug 30 17:50:51 scradius2 genunix: scsi_vhci0 is /scsi_vhci
Aug 30 17:50:51 scradius2 rootnex: px0 at root: 0x400 0x0
Aug 30 17:50:51 scradius2 genunix: px0 is /pci@400
Aug 30 17:50:51 scradius2 px: PCI Express-device: pci@0, pxb_plx0
从以上日志中可以看到,在17:44:51的时候报了i一个NOTICE: SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error的错误信息,接着引起了^Mpanic/thread=2a1025f7ca0: CPU中断,还报了Fatal error has occured in: PCIe fabric.(0x0)(0x41)错误。
2、在showfaults -v中有如下信息:
sc> showfaults -v
Last POST Run: Tue Dec 18 21:05:27 2012
Post Status: Passed all devices
ID Time FRU Class Fault
1 Aug 30 09:22:17 /SYS/MB/RISER0/PCIE3 Host detected fault MSGID: PCIEX-8000-3SUUID: 8a197e17-3fe7-6c37-d105-b8bcd58872af
2 Aug 27 03:44:56 /SYS/MB/RISER0/PCIE3 Host detected fault MSGID: FMD-8000-11UUID: 058785ba-f343-c980-9c47-cbd7a59bbe4f
3 Aug 30 09:22:17 /SYS/MB Host detected fault MSGID: PCIEX-8000-3SUUID: 8a197e17-3fe7-6c37-d105-b8bcd58872af
4 Aug 27 03:44:56 /SYS/MB Host detected fault MSGID: FMD-8000-11UUID: 058785ba-f343-c980-9c47-cbd7a59bbe4f
3、在showlogs -v 中有如下信息:
Aug 27 03:45:00: Chassis |major : "Host detected fault, MSGID: FMD-8000-11"
Aug 30 09:22:20: Chassis |major : "Host detected fault, MSGID: PCIEX-8000-3S"
4、在fmadm -faulty-a.out文件中如下信息:
--------------- -------------------------------------------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- -------------------------------------------------- ---------
Aug 30 17:53:39 8a197e17-3fe7-6c37-d105-b8bcd58872afPCIEX-8000-3SCritical
Host : scradius2
Platform : SUNW,T5240 Chassis_id:
Fault class : fault.io.pciex.device-interr max 40%
fault.io.pciex.bus-linkerr 20%
Affects : dev:////pci@400/pci@0/pci@d/LSILogic,sas@0
dev:////pci@400/pci@0/pci@d
faulted but still in service
FRU : "MB/RISER0/PCIE3" (hc://:product-id=SUNW,T5240:chassis-id=FML1017023:server-id=scradius2/motherboard=0/hostbridge=0/pciexrc=0/pciexbus=2/pciexdev=0/pciexfn=0/pciexbus=3/pciexdev=13/pciexfn=0/pciexbus=9/pciexdev=0) max 40%
"MB" (hc://:product-id=SUNW,T5240:chassis-id=FML1017023:server-id=scradius2:serial=0328MSL-10099K04CJ:part=540793402/motherboard=0) 40%
faulty
Description : A problem has been detected on one of the specified devices or on
one of the specified connecting buses.
Refer to http://sun.com/msg/PCIEX-8000-3S for more information.
Response : One or more device instances may be disabled
Impact : Loss of services provided by the device instances associated with
this fault
Action : If a plug-in card is involved check for badly-seated cards or
bent pins. Otherwise schedule a repair procedure to replace the
affected device(s).Use fmadm faulty to identify the devices or
contact Sun for support.
根据以上 PCIEX-8000-3S信息判断Sun Flash Accelerator F20 PCIe卡有问题,引起服务器重启。
后来我到现场后,检查看到服务器的告警灯亮着,我在操作系统执行fmadm repair uuid去把告警清楚了,告警灯熄灭,观察了两天,fmadm faulty命令没有输出。
我想请问一下,这是卡有问题还是主板有问题,我倾向于卡有问题。 回复 1# 七杀书生
这个讲不清楚,原则上先换容易换的部件,注意把该卡的微码也给刷上去。 回复 1# 七杀书生
是 /SYS/MB/RISER0/PCIE3 , 也就是3号PCIE插槽上的卡有问题了,一般是驱动不合,其次是固件,最后是彻底更换新卡。 回复 2# nimysun
谢谢版主的支持,我想请教三个问题:
1、如果更换卡的话,是只更换pcie卡还是连通FMods和ESM一起更换呢?
2、升级微码的话,是升级pcie卡的微码还是FMod的微码呢?
3、我看手册里面没有提到pcie卡微码升级的步骤,有FMod微码和SAS/SATA Controller微码,这个SAS/SATA Controller微码是不是这样卡的微码?
回复 4# 七杀书生
1、如果更换卡的话,是只更换pcie卡还是连通FMods和ESM一起更换呢?
卡就可以了, ESM一般比较稳定
2、升级微码的话,是升级pcie卡的微码还是FMod的微码呢?
卡的微码, fmod也建议升级,如果可能的话
3、我看手册里面没有提到pcie卡微码升级的步骤,有FMod微码和SAS/SATA Controller微码,这个SAS/SATA Controller微码是不是这样卡的微码?
不知道, 这个卡多用于exadata,普通的server在那里下载不知道。
页:
[1]