免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 2245 | 回复: 4
打印 上一主题 下一主题

E4900报错 [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2012-10-16 21:26 |只看该作者 |倒序浏览
osssvr-sc0:SC> showerrorbuffer
ErrorData[0]
  Date: Fri May 18 20:50:20 CST 2012
  Device: /partition0/domain0/SB2/dx2
  ErrorID: 0x32081ff2
  Port: 2
  Syndrome: 0x122(CE bit 82)
  Direction: incoming read
  First error: true
  TargetAid: 0x2
  Transid: 0x4
ErrorData[1]
  Date: Sat May 19 08:20:18 CST 2012
  Device: /partition0/domain0/SB0/dx3
  ErrorID: 0x33081ff0
  Port: 0
  Syndrome: 0x122(CE bit 82)
  Direction: outgoing read
  TargetAid: 0x0
  Transid: 0x4
ErrorData[2]
  Date: Sat May 19 08:20:18 CST 2012
  Device: /partition0/domain0/SB2/dx3
  ErrorID: 0x33081ff2
  Port: 2
  Syndrome: 0x122(CE bit 82)
  Direction: incoming read
  First error: true
  TargetAid: 0x0
  Transid: 0x4
ErrorData[3]
  Date: Sat May 19 08:27:48 CST 2012
  Device: /partition0/domain0/SB0/dx3
  ErrorID: 0x33081ff2
  Port: 2
  Syndrome: 0x122(CE bit 82)
  Direction: outgoing read
  TargetAid: 0x2
  Transid: 0x7
ErrorData[4]
  Date: Sat May 19 08:27:48 CST 2012
  Device: /partition0/domain0/SB2/dx3
  ErrorID: 0x33081ff2
  Port: 2
  Syndrome: 0x122(CE bit 82)
  Direction: incoming read
  First error: true
  TargetAid: 0x2
  Transid: 0x7
ErrorData[5]
  Date: Sat May 19 08:50:19 CST 2012
  Device: /partition0/domain0/SB0/dx2
  ErrorID: 0x32081ff2
  Port: 2
  Syndrome: 0x122(CE bit 82)
  Direction: outgoing read
  TargetAid: 0x2
  Transid: 0x5
ErrorData[6]
  Date: Sat May 19 08:50:19 CST 2012
  Device: /partition0/domain0/SB2/dx2
  ErrorID: 0x32081ff2
  Port: 2
  Syndrome: 0x122(CE bit 82)
  Direction: incoming read
  First error: true
  TargetAid: 0x2
  Transid: 0x5
ErrorData[7]
  Date: Sat May 19 20:27:48 CST 2012
  Device: /partition0/domain0/SB0/dx3
  ErrorID: 0x33081ff2
  Port: 2
  Syndrome: 0x122(CE bit 82)
  Direction: outgoing read
  TargetAid: 0x2
  Transid: 0x7
ErrorData[8]
  Date: Sat May 19 20:27:48 CST 2012
  Device: /partition0/domain0/SB2/dx3
  ErrorID: 0x33081ff2
  Port: 2
  Syndrome: 0x122(CE bit 82)
  Direction: incoming read
  First error: true
  TargetAid: 0x2
  Transid: 0x7
ErrorData[9]
  Date: Sat May 19 20:50:19 CST 2012
  Device: /partition0/domain0/SB0/dx2
  ErrorID: 0x32081ff0
  Port: 0
  Syndrome: 0x122(CE bit 82)
  Direction: outgoing read
  TargetAid: 0x0
  Transid: 0x4
ErrorData[10]
  Date: Sat May 19 20:50:19 CST 2012
  Device: /partition0/domain0/SB2/dx2
  ErrorID: 0x32081ff2
  Port: 2
  Syndrome: 0x122(CE bit 82)
  Direction: incoming read
  First error: true
  TargetAid: 0x0
  Transid: 0x4
ErrorData[11]
  Date: Sun May 20 08:27:48 CST 2012
  Device: /partition0/domain0/SB0/dx3
  ErrorID: 0x33081ff1
  Port: 1
  Syndrome: 0x122(CE bit 82)
  Direction: outgoing read
  TargetAid: 0x1
  Transid: 0xb
ErrorData[12]
  Date: Sun May 20 08:27:48 CST 2012
  Device: /partition0/domain0/SB2/dx3
  ErrorID: 0x33081ff2
  Port: 2
  Syndrome: 0x122(CE bit 82)
  Direction: incoming read
  First error: true
  TargetAid: 0x1
  Transid: 0xb
ErrorData[13]
  Date: Sun May 20 08:50:18 CST 2012
  Device: /partition0/domain0/SB0/dx2
  ErrorID: 0x32081ff1
  Port: 1
  Syndrome: 0x122(CE bit 82)
  Direction: outgoing read
  TargetAid: 0x1
  Transid: 0x2
ErrorData[14]
  Date: Sun May 20 08:50:18 CST 2012
  Device: /partition0/domain0/SB2/dx2
  ErrorID: 0x32081ff2
  Port: 2
  Syndrome: 0x122(CE bit 82)
  Direction: incoming read
  First error: true
  TargetAid: 0x1
  Transid: 0x2
ErrorData[15]
  Date: Sun May 20 20:20:18 CST 2012
  Device: /partition0/domain0/SB0/dx3
  ErrorID: 0x33081ff3
  Port: 3
  Syndrome: 0x122(CE bit 82)
  Direction: outgoing read
  TargetAid: 0x3
  Transid: 0xc
ErrorData[16]
  Date: Sun May 20 20:20:18 CST 2012
  Device: /partition0/domain0/SB2/dx3
  ErrorID: 0x33081ff2
  Port: 2
  Syndrome: 0x122(CE bit 82)
  Direction: incoming read
  First error: true
  TargetAid: 0x3
  Transid: 0xc
ErrorData[17]
  Date: Sun May 20 20:27:48 CST 2012
  Device: /partition0/domain0/SB0/dx3
  ErrorID: 0x33081ff0
  Port: 0
  Syndrome: 0x122(CE bit 82)
  Direction: outgoing read
  TargetAid: 0x0
  Transid: 0x7
ErrorData[18]
  Date: Sun May 20 20:27:48 CST 2012
  Device: /partition0/domain0/SB2/dx3
  ErrorID: 0x33081ff2
  Port: 2
  Syndrome: 0x122(CE bit 82)
  Direction: incoming read
  First error: true
  TargetAid: 0x0
  Transid: 0x7
ErrorData[19]
  Date: Sun May 20 20:50:19 CST 2012
  Device: /partition0/domain0/SB2/dx2
  ErrorID: 0x32081ff2
  Port: 2
  Syndrome: 0x122(CE bit 82)
  Direction: incoming read
  First error: true
  TargetAid: 0x10
  Transid: 0x5
showlogs中:Oct 16 15:28:00 osssvr-sc0 Domain-A.POST: [ID 290417 local0.warning] /N0/SB2/P0/B0/D2 is CHS disabled.
Oct 16 15:28:00 osssvr-sc0 Domain-A.POST: [ID 531535 local0.warning] /N0/SB2/P0/B1/D2 is CHS disabled.

请问这就是说  SB2中坏了两个内存条吗?   
showerrorbuffer中哪些是报错信息??表示看不懂啊。。。新人跪求大神们。。。。。

另外,我重启时,发现panic[cpu0]/thread=180e000: vfs_mountroot: cannot remount root

这该怎么处理,修复文件系统还是直接单用户mount /  ????这个我也不会,,求详细步骤。。。求好心人。。。。。

论坛徽章:
20
申猴
日期:2013-09-12 19:39:05狮子座
日期:2014-07-20 21:19:51寅虎
日期:2014-08-16 18:37:47水瓶座
日期:2014-10-15 18:58:25天蝎座
日期:2015-01-22 18:19:15NBA常规赛纪念章
日期:2015-05-04 22:32:032015亚冠之胡齐斯坦钢铁
日期:2015-06-03 11:28:502015亚冠之吉达阿赫利
日期:2015-09-19 12:41:47午马
日期:2013-09-18 14:36:40戌狗
日期:2013-09-18 14:44:39处女座
日期:2013-09-24 17:46:41CU十二周年纪念徽章
日期:2013-10-24 15:41:34
2 [报告]
发表于 2012-10-17 10:52 |只看该作者
showboards

showcom  sb2

showchs -b

论坛徽章:
0
3 [报告]
发表于 2012-10-18 10:45 |只看该作者
osssvr-sc1:SC> showchs -b
Component           Status   
---------------     --------  
SB2/P0/B0/D2        Faulty   
SB2/P0/B1/D2        Suspect

你好,请问从showerrorbuffer中怎么看报错呢。比如内存报错。显示incoming red 就是报错吗?回复 2# znnnz


   

论坛徽章:
20
申猴
日期:2013-09-12 19:39:05狮子座
日期:2014-07-20 21:19:51寅虎
日期:2014-08-16 18:37:47水瓶座
日期:2014-10-15 18:58:25天蝎座
日期:2015-01-22 18:19:15NBA常规赛纪念章
日期:2015-05-04 22:32:032015亚冠之胡齐斯坦钢铁
日期:2015-06-03 11:28:502015亚冠之吉达阿赫利
日期:2015-09-19 12:41:47午马
日期:2013-09-18 14:36:40戌狗
日期:2013-09-18 14:44:39处女座
日期:2013-09-24 17:46:41CU十二周年纪念徽章
日期:2013-10-24 15:41:34
4 [报告]
发表于 2012-10-18 17:37 |只看该作者
回复 3# justin_fl


      Sun System Handbook - ISO 4.0 August 2012 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback
       


Asset ID:         1-71-1002710.1
Update Date:        2012-06-04
Keywords:       

Solution Type  Technical Instruction Sure

Solution  1002710.1 :   Sun Fire[TM] v1280, 3800, 4800, 4810, 6800, E2900, E4900, E6900, and Netra[TM] 1280, and 1290 systems: Incoming versus Outgoing errors.   
Related Items

Sun Fire 4810 Server

Sun Fire 3800 Server

Sun Netra 1290 Server

Sun Fire E6900 Server

Sun Fire 6800 Server

Sun Fire V1280 Server

Sun Fire 4800 Server

Sun Fire E2900 Server

Sun Fire E4900 Server

Sun Netra 1280 Server


Related Categories

PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Exx00

.Old GCS Categories>Sun Microsystems>Servers>Midrange Servers

.Old GCS Categories>Sun Microsystems>Servers>Midrange V and Netra Servers




PreviouslyPublishedAs
203717

Applies to:
Sun Fire E6900 Server - Version Not Applicable and later
Sun Netra 1290 Server - Version Not Applicable and later
Sun Fire E4900 Server - Version Not Applicable and later
Sun Fire 4810 Server - Version Not Applicable and later
Sun Fire V1280 Server - Version Not Applicable and later
All Platforms
Goal

Description
This document applies to Sun Fire[TM] v1280, 3800, 4800, 4810, 6800, E2900, E4900, E6900, and Netra[TM] 1280, and 1290 systems.

This document relates to the diagnosis of error events that get logged to a file called the error buffer on the System Controller (SC) on the systems shown above.  The error buffer log file data is collected by the command showerrorbuffer when running an Explorer using the scextended or 1280extended option.  Alternatively, a user can display this information directly on the System Controller by executing the command as follows (This example is from the lomprompt on an E2900 server):
lom> showerrorbuffer

ErrorData[0]
Date: Sat Aug 18 09:50:39 EDT 2007
Device: /SB0/dx3
ErrorID: 0x33071ff3
Port: 3
Syndrome: 0xd(CE bit 41)
Direction: outgoing read
TargetAid: 0x3
Transid: 0x1
ErrorData[1]
Date: Sat Aug 18 09:50:39 EDT 2007
Device: /SB2/dx3
ErrorID: 0x33071ff3
Port: 3
Syndrome: 0xd(CE bit 41)
Direction: incoming read
First error: true
TargetAid: 0x3
Transid: 0x1

The error example above will be used in the remainder of this article to explain the relation of Incoming to Outgoing as it relates to error message diagnosis.
Fix
Diagnosing incoming versus outgoing errors in the showerrorbuffer file.
What is the relation of the terms Incoming and Outgoing?

The answer is actually kind of easy, because the terms are related to a direction of a data transaction.  There are two possible directions for an error event to "travel" and the direction is "as it relates to the dx asic" (picture below illustrates the data path in question here between DX and DCDS):
Outgoing - An error that is moving away from the dx asic (Ultimately to a DCDS/CPU/Memory on the board or off to some other board).
Incoming - An error that is moving towards the dx asic (From a DCDS/CPU/Memory on the reporting dx asic's board).

Why do we care about what direction the error "travels"?

The short answer is that because this is an error.

The longer answer is that the event(s) may mean that there is defective hardware involved if the errors are uncorrectable or excessive (exceeding Oracle's Memory Error Best Practice) in nature.  Knowing the direction of the event allows a user to identify the source of the error which is crucial to resolving the event and stopping the errors.

The direction of the transaction identifies for us the source and thus Root Cause to the event.
Now, how we do identify the direction that an event is "traveling" and identify the source?  Using the same error example as before:
ErrorData[0]

Date: Sat Aug 18 09:50:39 EDT 2007

Device: /SB0/dx3            <--- This dx is reporting the event.

ErrorID: 0x33071ff3

Port: 3  <--- This is the CPU number implicated.

Syndrome: 0xd(CE bit 41)  <--- This is the error syndrome.

Direction: outgoing read  <--- This is the direction of the event

TargetAid: 0x3  as it relates to the dx.

Transid: 0x1

Outgoing means that the error's direction went from the dx asic (SB0/dx3) to the CPU (SB0/P3) or it's Memory (through the DCDS).  This is what is called a "Victim" event because the error came from somewhere else and the dx asic "passed it along".



The next error from the example error log file shows a "Source" event.  Source events are root cause events.
ErrorData[1]

Date: Sat Aug 18 09:50:39 EDT 2007

Device: /SB2/dx3            <--- This dx is reporting the event.

ErrorID: 0x33071ff3

Port: 3  <--- This is the CPU number implicated.

Syndrome: 0xd(CE bit 41)  <--- This is the error syndrome.

Direction: incoming read  <--- This is the direction of the event

First error: true as it relates to the dx.

TargetAid: 0x3

Transid: 0x1

Incoming means that the error's direction went from the CPU (SB2/P3) or memory via the DCDS to the dx asic (SB2/dx3).  This means that the error sourced from the DCDS, the CPU or it's memory (the CPU is a memory controller).  The dx is simply reporting that a CPU it monitors has seen the error and forwards it along - to become a different dx asic's Outgoing event.
In the above example, the Root Cause suspects would be SB2 DIMM pair J16500/J16501 because data bit 41 (ESYN 0xd) translates to that DIMM pair.  
If there were correlating ecc errors in the domain's /var/adm/messages file that showed only one DIMM bank in error, then the error would be further isolated to a single DIMM (either Bank 0 or Bank 1).  
The suspect(s) should be replaced ONLY IF meeting the Best Practice rules as defined in Document 1010905.1 Oracle Enhanced Memory DIMM Replacement Policy
NOTE: syndroms translations to DIMM pairs can be done using internal tools





NOTES:
It is worth mentioning that this document discusses one of the easiest error examples to diagnose as it relates to Incoming/Outgoing directions.  It showed "read" transactions.  
A read is almost always sourced to a memory DIMM.
If you see an "incoming write" from a single CPU location with many different "outgoing reads", suspect the CPU who is related to the "incoming write" transaction as Root Cause.
Big rule:  CPUs "write" and DIMMs "read" so, when only "read's.


Internal Comments
There is an ESYN Translator located at http://panacea.uk.oracle.com/twi ... EsynDecoderUniboard
which can be used to translate ECC syndromes as shown in this article's example.

Previously Published As 90269

Attachments
This solution has no attachment
                       
   Copyright © 2012 Sun Microsystems, Inc.  All rights reserved.
Feedback

论坛徽章:
0
5 [报告]
发表于 2012-10-23 16:43 |只看该作者
谢谢啊。要是是中文就更好了。。。哈哈回复 4# znnnz


   
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP