七杀书生 发表于 2014-09-02 16:40

Sun Flash Accelerator F20 PCIe卡引起SUN T5240重启

我这边有一台SUN T5240小型机,机器上插了一张Sun Flash Accelerator F20 PCIe卡,前天的时候这台服务器突然重启了,由于我没有在现场,所以我让别人在操作系统上收了explorer,发现以下一些问题:
1、messages
Aug 30 17:40:41 scradius2 MQSeries: FFST record created in /var/mqm/errors/AMQ2066.0.FDC
Aug 30 17:40:46 scradius2 sshd: Did not receive identification string from 222.211.95.55
Aug 30 17:44:51 scradius2 genunix: NOTICE: SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major
Aug 30 17:44:51 scradius2 unix:
Aug 30 17:44:51 scradius2 ^Mpanic/thread=2a1025f7ca0:
Aug 30 17:44:51 scradius2 unix: Fatal error has occured in: PCIe fabric.(0x0)(0x41)
Aug 30 17:44:51 scradius2 unix:
Aug 30 17:44:51 scradius2 genunix: 000002a10265fbc0 px:px_err_panic+1ac (1947400, 135a400, 41, 2a10265fc70, 0, 0)
Aug 30 17:44:51 scradius2 genunix:    %l0-3: 000000000000d801 0000000001947400 0000000000000000 0000000000000001
Aug 30 17:44:51 scradius2   %l4-7: 0000000000000000 0000000001875c00 0000000000000001 0000000000000000
Aug 30 17:44:51 scradius2 genunix: 000002a10265fcd0 px:px_err_fabric_intr+1b4 (300044ef8c0, 0, 368000000000000, 1, 41, 368)
Aug 30 17:44:51 scradius2 genunix:    %l0-3: 0000000000000000 00000000019477b8 0000000001947400 0000000000000054
Aug 30 17:44:51 scradius2   %l4-7: 00000000019477a0 0000000001947400 0000000001947798 0000000001947400
Aug 30 17:44:51 scradius2 genunix: 000002a10265fe40 px:px_msiq_intr+1e8 (6002147bc90, 30002c271e0, 134cc04, 0, 1, 300014d9f28)
Aug 30 17:44:51 scradius2 genunix:    %l0-3: 00000600214f3e60 00000300044f3850 0000030002c271e0 0000000000000000
Aug 30 17:44:51 scradius2   %l4-7: 0000000000000000 00000000034c4000 000002a10265ff40 0000000000000033
Aug 30 17:44:51 scradius2 genunix: 000002a10265ff50 unix:current_thread+164 (16, 36, ffffffffffffffff, 0, 100, 12)
Aug 30 17:44:51 scradius2 genunix:    %l0-3: 0000000001009904 000002a1025f6fe1 000000000000000e 00000000700101c0
Aug 30 17:44:51 scradius2   %l4-7: 0000000000000002 0000000000000010 0000000000000000 000002a1025f7890
Aug 30 17:44:51 scradius2 genunix: 000002a1025f7930 unix:cpu_halt+104 (30005378000, 36, 187c3e0, 187c2b0, 30005378000, 0)
Aug 30 17:44:51 scradius2 genunix:    %l0-3: 0000060022d04b64 0000000000000001 0000000000000016 0000000000000000
Aug 30 17:44:51 scradius2   %l4-7: 0000000001000000 0000000000000002 00000000018f4000 0000000000000001
Aug 30 17:44:51 scradius2 genunix: 000002a1025f79e0 unix:idle+128 (182a800, 0, 30005378000, ffffffffffffffff, 37, 1829400)
Aug 30 17:44:51 scradius2 genunix:    %l0-3: 0000060022d04b40 000000000000001b 0000000000000000 ffffffffffffffff
Aug 30 17:44:51 scradius2   %l4-7: 0000060022d04b40 ffffffffffffffff 000000000187c2b0 00000000010409e0
Aug 30 17:44:51 scradius2 unix:
Aug 30 17:44:51 scradius2 genunix: syncing file systems...
Aug 30 17:44:52 scradius2 scsi: /pci@400/pci@0/pci@d/LSILogic,sas@0 (mpt1):
Aug 30 17:44:52 scradius2         Log info 31120200 received for target 2.
Aug 30 17:44:52 scradius2         scsi_status=0, ioc_status=804b, scsi_state=c
Aug 30 17:44:52 scradius2 md_stripe: WARNING: md: d90: write error on /dev/dsk/c2t2d0s2
Aug 30 17:44:53 scradius2 genunix: 103
Aug 30 17:44:55 scradius2 genunix: 95
Aug 30 17:44:57 scradius2 genunix: 93
Aug 30 17:45:43 scradius2 last message repeated 20 times
Aug 30 17:45:44 scradius2 genunix: done (not all i/o completed)
Aug 30 17:45:45 scradius2 genunix: dumping to /dev/dsk/c1t0d0s1, offset 65536, content: kernel
Aug 30 17:49:31 scradius2 genunix: ^M100% done: 358960 pages dumped, compression ratio 2.42,
Aug 30 17:49:31 scradius2 genunix: dump succeeded
Aug 30 17:50:51 scradius2 genunix: ^MSunOS Release 5.10 Version Generic_142900-03 64-bit
Aug 30 17:50:51 scradius2 genunix: Copyright 1983-2009 Sun Microsystems, Inc.All rights reserved.
Aug 30 17:50:51 scradius2 Use is subject to license terms.
Aug 30 17:50:51 scradius2 genunix: Ethernet address = 0:21:28:76:ed:e
Aug 30 17:50:51 scradius2 unix: NOTICE: Kernel Cage is ENABLED
Aug 30 17:50:51 scradius2 unix: mem = 16547840K (0x3f2000000)
Aug 30 17:50:51 scradius2 unix: avail mem = 16339820544
Aug 30 17:50:51 scradius2 rootnex: root nexus = T5240
Aug 30 17:50:51 scradius2 rootnex: pseudo0 at root
Aug 30 17:50:51 scradius2 genunix: pseudo0 is /pseudo
Aug 30 17:50:51 scradius2 rootnex: scsi_vhci0 at root
Aug 30 17:50:51 scradius2 genunix: scsi_vhci0 is /scsi_vhci
Aug 30 17:50:51 scradius2 rootnex: px0 at root: 0x400 0x0
Aug 30 17:50:51 scradius2 genunix: px0 is /pci@400
Aug 30 17:50:51 scradius2 px: PCI Express-device: pci@0, pxb_plx0

从以上日志中可以看到,在17:44:51的时候报了i一个NOTICE: SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error的错误信息,接着引起了^Mpanic/thread=2a1025f7ca0: CPU中断,还报了Fatal error has occured in: PCIe fabric.(0x0)(0x41)错误。

2、在showfaults -v中有如下信息:
sc> showfaults -v

Last POST Run: Tue Dec 18 21:05:27 2012


Post Status: Passed all devices

ID Time                           FRU               Class             Fault

   1 Aug 30 09:22:17                /SYS/MB/RISER0/PCIE3                   Host detected fault MSGID: PCIEX-8000-3SUUID: 8a197e17-3fe7-6c37-d105-b8bcd58872af

   2 Aug 27 03:44:56                /SYS/MB/RISER0/PCIE3                   Host detected fault MSGID: FMD-8000-11UUID: 058785ba-f343-c980-9c47-cbd7a59bbe4f

   3 Aug 30 09:22:17                /SYS/MB                           Host detected fault MSGID: PCIEX-8000-3SUUID: 8a197e17-3fe7-6c37-d105-b8bcd58872af

   4 Aug 27 03:44:56                /SYS/MB                           Host detected fault MSGID: FMD-8000-11UUID: 058785ba-f343-c980-9c47-cbd7a59bbe4f

3、在showlogs -v 中有如下信息:
Aug 27 03:45:00: Chassis |major   : "Host detected fault, MSGID: FMD-8000-11"
Aug 30 09:22:20: Chassis |major   : "Host detected fault, MSGID: PCIEX-8000-3S"

4、在fmadm -faulty-a.out文件中如下信息:
--------------- -------------------------------------------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- -------------------------------------------------- ---------
Aug 30 17:53:39 8a197e17-3fe7-6c37-d105-b8bcd58872afPCIEX-8000-3SCritical

Host      : scradius2
Platform    : SUNW,T5240        Chassis_id:

Fault class : fault.io.pciex.device-interr max 40%
            fault.io.pciex.bus-linkerr 20%
Affects   : dev:////pci@400/pci@0/pci@d/LSILogic,sas@0
            dev:////pci@400/pci@0/pci@d
                  faulted but still in service
FRU         : "MB/RISER0/PCIE3" (hc://:product-id=SUNW,T5240:chassis-id=FML1017023:server-id=scradius2/motherboard=0/hostbridge=0/pciexrc=0/pciexbus=2/pciexdev=0/pciexfn=0/pciexbus=3/pciexdev=13/pciexfn=0/pciexbus=9/pciexdev=0) max 40%
            "MB" (hc://:product-id=SUNW,T5240:chassis-id=FML1017023:server-id=scradius2:serial=0328MSL-10099K04CJ:part=540793402/motherboard=0) 40%
                  faulty

Description : A problem has been detected on one of the specified devices or on
            one of the specified connecting buses.
            Refer to http://sun.com/msg/PCIEX-8000-3S for more information.

Response    : One or more device instances may be disabled

Impact      : Loss of services provided by the device instances associated with
            this fault

Action      : If a plug-in card is involved check for badly-seated cards or
            bent pins. Otherwise schedule a repair procedure to replace the
            affected device(s).Use fmadm faulty to identify the devices or
            contact Sun for support.

根据以上 PCIEX-8000-3S信息判断Sun Flash Accelerator F20 PCIe卡有问题,引起服务器重启。
后来我到现场后,检查看到服务器的告警灯亮着,我在操作系统执行fmadm repair uuid去把告警清楚了,告警灯熄灭,观察了两天,fmadm faulty命令没有输出。

我想请问一下,这是卡有问题还是主板有问题,我倾向于卡有问题。

nimysun 发表于 2014-09-03 08:48

回复 1# 七杀书生

这个讲不清楚,原则上先换容易换的部件,注意把该卡的微码也给刷上去。

znnnz 发表于 2014-09-03 10:04

回复 1# 七杀书生


   是 /SYS/MB/RISER0/PCIE3 , 也就是3号PCIE插槽上的卡有问题了,一般是驱动不合,其次是固件,最后是彻底更换新卡。

七杀书生 发表于 2014-09-03 14:06

回复 2# nimysun


    谢谢版主的支持,我想请教三个问题:
    1、如果更换卡的话,是只更换pcie卡还是连通FMods和ESM一起更换呢?
    2、升级微码的话,是升级pcie卡的微码还是FMod的微码呢?
    3、我看手册里面没有提到pcie卡微码升级的步骤,有FMod微码和SAS/SATA Controller微码,这个SAS/SATA Controller微码是不是这样卡的微码?

nimysun 发表于 2014-09-04 08:36

回复 4# 七杀书生


    1、如果更换卡的话,是只更换pcie卡还是连通FMods和ESM一起更换呢?
卡就可以了, ESM一般比较稳定

    2、升级微码的话,是升级pcie卡的微码还是FMod的微码呢?
卡的微码, fmod也建议升级,如果可能的话

    3、我看手册里面没有提到pcie卡微码升级的步骤,有FMod微码和SAS/SATA Controller微码,这个SAS/SATA Controller微码是不是这样卡的微码?
不知道, 这个卡多用于exadata,普通的server在那里下载不知道。
页: [1]
查看完整版本: Sun Flash Accelerator F20 PCIe卡引起SUN T5240重启