E4900报错
osssvr-sc0:SC> showerrorbufferErrorData
Date: Fri May 18 20:50:20 CST 2012
Device: /partition0/domain0/SB2/dx2
ErrorID: 0x32081ff2
Port: 2
Syndrome: 0x122(CE bit 82)
Direction: incoming read
First error: true
TargetAid: 0x2
Transid: 0x4
ErrorData
Date: Sat May 19 08:20:18 CST 2012
Device: /partition0/domain0/SB0/dx3
ErrorID: 0x33081ff0
Port: 0
Syndrome: 0x122(CE bit 82)
Direction: outgoing read
TargetAid: 0x0
Transid: 0x4
ErrorData
Date: Sat May 19 08:20:18 CST 2012
Device: /partition0/domain0/SB2/dx3
ErrorID: 0x33081ff2
Port: 2
Syndrome: 0x122(CE bit 82)
Direction: incoming read
First error: true
TargetAid: 0x0
Transid: 0x4
ErrorData
Date: Sat May 19 08:27:48 CST 2012
Device: /partition0/domain0/SB0/dx3
ErrorID: 0x33081ff2
Port: 2
Syndrome: 0x122(CE bit 82)
Direction: outgoing read
TargetAid: 0x2
Transid: 0x7
ErrorData
Date: Sat May 19 08:27:48 CST 2012
Device: /partition0/domain0/SB2/dx3
ErrorID: 0x33081ff2
Port: 2
Syndrome: 0x122(CE bit 82)
Direction: incoming read
First error: true
TargetAid: 0x2
Transid: 0x7
ErrorData
Date: Sat May 19 08:50:19 CST 2012
Device: /partition0/domain0/SB0/dx2
ErrorID: 0x32081ff2
Port: 2
Syndrome: 0x122(CE bit 82)
Direction: outgoing read
TargetAid: 0x2
Transid: 0x5
ErrorData
Date: Sat May 19 08:50:19 CST 2012
Device: /partition0/domain0/SB2/dx2
ErrorID: 0x32081ff2
Port: 2
Syndrome: 0x122(CE bit 82)
Direction: incoming read
First error: true
TargetAid: 0x2
Transid: 0x5
ErrorData
Date: Sat May 19 20:27:48 CST 2012
Device: /partition0/domain0/SB0/dx3
ErrorID: 0x33081ff2
Port: 2
Syndrome: 0x122(CE bit 82)
Direction: outgoing read
TargetAid: 0x2
Transid: 0x7
ErrorData
Date: Sat May 19 20:27:48 CST 2012
Device: /partition0/domain0/SB2/dx3
ErrorID: 0x33081ff2
Port: 2
Syndrome: 0x122(CE bit 82)
Direction: incoming read
First error: true
TargetAid: 0x2
Transid: 0x7
ErrorData
Date: Sat May 19 20:50:19 CST 2012
Device: /partition0/domain0/SB0/dx2
ErrorID: 0x32081ff0
Port: 0
Syndrome: 0x122(CE bit 82)
Direction: outgoing read
TargetAid: 0x0
Transid: 0x4
ErrorData
Date: Sat May 19 20:50:19 CST 2012
Device: /partition0/domain0/SB2/dx2
ErrorID: 0x32081ff2
Port: 2
Syndrome: 0x122(CE bit 82)
Direction: incoming read
First error: true
TargetAid: 0x0
Transid: 0x4
ErrorData
Date: Sun May 20 08:27:48 CST 2012
Device: /partition0/domain0/SB0/dx3
ErrorID: 0x33081ff1
Port: 1
Syndrome: 0x122(CE bit 82)
Direction: outgoing read
TargetAid: 0x1
Transid: 0xb
ErrorData
Date: Sun May 20 08:27:48 CST 2012
Device: /partition0/domain0/SB2/dx3
ErrorID: 0x33081ff2
Port: 2
Syndrome: 0x122(CE bit 82)
Direction: incoming read
First error: true
TargetAid: 0x1
Transid: 0xb
ErrorData
Date: Sun May 20 08:50:18 CST 2012
Device: /partition0/domain0/SB0/dx2
ErrorID: 0x32081ff1
Port: 1
Syndrome: 0x122(CE bit 82)
Direction: outgoing read
TargetAid: 0x1
Transid: 0x2
ErrorData
Date: Sun May 20 08:50:18 CST 2012
Device: /partition0/domain0/SB2/dx2
ErrorID: 0x32081ff2
Port: 2
Syndrome: 0x122(CE bit 82)
Direction: incoming read
First error: true
TargetAid: 0x1
Transid: 0x2
ErrorData
Date: Sun May 20 20:20:18 CST 2012
Device: /partition0/domain0/SB0/dx3
ErrorID: 0x33081ff3
Port: 3
Syndrome: 0x122(CE bit 82)
Direction: outgoing read
TargetAid: 0x3
Transid: 0xc
ErrorData
Date: Sun May 20 20:20:18 CST 2012
Device: /partition0/domain0/SB2/dx3
ErrorID: 0x33081ff2
Port: 2
Syndrome: 0x122(CE bit 82)
Direction: incoming read
First error: true
TargetAid: 0x3
Transid: 0xc
ErrorData
Date: Sun May 20 20:27:48 CST 2012
Device: /partition0/domain0/SB0/dx3
ErrorID: 0x33081ff0
Port: 0
Syndrome: 0x122(CE bit 82)
Direction: outgoing read
TargetAid: 0x0
Transid: 0x7
ErrorData
Date: Sun May 20 20:27:48 CST 2012
Device: /partition0/domain0/SB2/dx3
ErrorID: 0x33081ff2
Port: 2
Syndrome: 0x122(CE bit 82)
Direction: incoming read
First error: true
TargetAid: 0x0
Transid: 0x7
ErrorData
Date: Sun May 20 20:50:19 CST 2012
Device: /partition0/domain0/SB2/dx2
ErrorID: 0x32081ff2
Port: 2
Syndrome: 0x122(CE bit 82)
Direction: incoming read
First error: true
TargetAid: 0x10
Transid: 0x5
showlogs中:Oct 16 15:28:00 osssvr-sc0 Domain-A.POST: /N0/SB2/P0/B0/D2 is CHS disabled.
Oct 16 15:28:00 osssvr-sc0 Domain-A.POST: /N0/SB2/P0/B1/D2 is CHS disabled.
请问这就是说SB2中坏了两个内存条吗?
showerrorbuffer中哪些是报错信息??表示看不懂啊。。。新人跪求大神们。。。。。
另外,我重启时,发现panic/thread=180e000: vfs_mountroot: cannot remount root
这该怎么处理,修复文件系统还是直接单用户mount /????这个我也不会,,求详细步骤。。。求好心人。。。。。 showboards
showcomsb2
showchs -b osssvr-sc1:SC> showchs -b
Component Status
--------------- --------
SB2/P0/B0/D2 Faulty
SB2/P0/B1/D2 Suspect
你好,请问从showerrorbuffer中怎么看报错呢。比如内存报错。显示incoming red 就是报错吗?回复 2# znnnz
回复 3# justin_fl
Sun System Handbook - ISO 4.0 August 2012 Internal/Partner Edition
Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback
Asset ID: 1-71-1002710.1
Update Date: 2012-06-04
Keywords:
Solution TypeTechnical Instruction Sure
Solution1002710.1 : Sun Fire v1280, 3800, 4800, 4810, 6800, E2900, E4900, E6900, and Netra 1280, and 1290 systems: Incoming versus Outgoing errors.
Related Items
Sun Fire 4810 Server
Sun Fire 3800 Server
Sun Netra 1290 Server
Sun Fire E6900 Server
Sun Fire 6800 Server
Sun Fire V1280 Server
Sun Fire 4800 Server
Sun Fire E2900 Server
Sun Fire E4900 Server
Sun Netra 1280 Server
Related Categories
PLA-Support>Sun Systems>SPARC>Enterprise>SN-SPARC: Exx00
.Old GCS Categories>Sun Microsystems>Servers>Midrange Servers
.Old GCS Categories>Sun Microsystems>Servers>Midrange V and Netra Servers
PreviouslyPublishedAs
203717
Applies to:
Sun Fire E6900 Server - Version Not Applicable and later
Sun Netra 1290 Server - Version Not Applicable and later
Sun Fire E4900 Server - Version Not Applicable and later
Sun Fire 4810 Server - Version Not Applicable and later
Sun Fire V1280 Server - Version Not Applicable and later
All Platforms
Goal
Description
This document applies to Sun Fire v1280, 3800, 4800, 4810, 6800, E2900, E4900, E6900, and Netra 1280, and 1290 systems.
This document relates to the diagnosis of error events that get logged to a file called the error buffer on the System Controller (SC) on the systems shown above.The error buffer log file data is collected by the command showerrorbuffer when running an Explorer using the scextended or 1280extended option.Alternatively, a user can display this information directly on the System Controller by executing the command as follows (This example is from the lomprompt on an E2900 server):
lom> showerrorbuffer
ErrorData
Date: Sat Aug 18 09:50:39 EDT 2007
Device: /SB0/dx3
ErrorID: 0x33071ff3
Port: 3
Syndrome: 0xd(CE bit 41)
Direction: outgoing read
TargetAid: 0x3
Transid: 0x1
ErrorData
Date: Sat Aug 18 09:50:39 EDT 2007
Device: /SB2/dx3
ErrorID: 0x33071ff3
Port: 3
Syndrome: 0xd(CE bit 41)
Direction: incoming read
First error: true
TargetAid: 0x3
Transid: 0x1
The error example above will be used in the remainder of this article to explain the relation of Incoming to Outgoing as it relates to error message diagnosis.
Fix
Diagnosing incoming versus outgoing errors in the showerrorbuffer file.
What is the relation of the terms Incoming and Outgoing?
The answer is actually kind of easy, because the terms are related to a direction of a data transaction.There are two possible directions for an error event to "travel" and the direction is "as it relates to the dx asic" (picture below illustrates the data path in question here between DX and DCDS):
Outgoing - An error that is moving away from the dx asic (Ultimately to a DCDS/CPU/Memory on the board or off to some other board).
Incoming - An error that is moving towards the dx asic (From a DCDS/CPU/Memory on the reporting dx asic's board).
Why do we care about what direction the error "travels"?
The short answer is that because this is an error.
The longer answer is that the event(s) may mean that there is defective hardware involved if the errors are uncorrectable or excessive (exceeding Oracle's Memory Error Best Practice) in nature.Knowing the direction of the event allows a user to identify the source of the error which is crucial to resolving the event and stopping the errors.
The direction of the transaction identifies for us the source and thus Root Cause to the event.
Now, how we do identify the direction that an event is "traveling" and identify the source?Using the same error example as before:
ErrorData
Date: Sat Aug 18 09:50:39 EDT 2007
Device: /SB0/dx3 <--- This dx is reporting the event.
ErrorID: 0x33071ff3
Port: 3<--- This is the CPU number implicated.
Syndrome: 0xd(CE bit 41)<--- This is the error syndrome.
Direction: outgoing read<--- This is the direction of the event
TargetAid: 0x3as it relates to the dx.
Transid: 0x1
Outgoing means that the error's direction went from the dx asic (SB0/dx3) to the CPU (SB0/P3) or it's Memory (through the DCDS).This is what is called a "Victim" event because the error came from somewhere else and the dx asic "passed it along".
The next error from the example error log file shows a "Source" event.Source events are root cause events.
ErrorData
Date: Sat Aug 18 09:50:39 EDT 2007
Device: /SB2/dx3 <--- This dx is reporting the event.
ErrorID: 0x33071ff3
Port: 3<--- This is the CPU number implicated.
Syndrome: 0xd(CE bit 41)<--- This is the error syndrome.
Direction: incoming read<--- This is the direction of the event
First error: true as it relates to the dx.
TargetAid: 0x3
Transid: 0x1
Incoming means that the error's direction went from the CPU (SB2/P3) or memory via the DCDS to the dx asic (SB2/dx3).This means that the error sourced from the DCDS, the CPU or it's memory (the CPU is a memory controller).The dx is simply reporting that a CPU it monitors has seen the error and forwards it along - to become a different dx asic's Outgoing event.
In the above example, the Root Cause suspects would be SB2 DIMM pair J16500/J16501 because data bit 41 (ESYN 0xd) translates to that DIMM pair.
If there were correlating ecc errors in the domain's /var/adm/messages file that showed only one DIMM bank in error, then the error would be further isolated to a single DIMM (either Bank 0 or Bank 1).
The suspect(s) should be replaced ONLY IF meeting the Best Practice rules as defined in Document 1010905.1 Oracle Enhanced Memory DIMM Replacement Policy
NOTE: syndroms translations to DIMM pairs can be done using internal tools
NOTES:
It is worth mentioning that this document discusses one of the easiest error examples to diagnose as it relates to Incoming/Outgoing directions.It showed "read" transactions.
A read is almost always sourced to a memory DIMM.
If you see an "incoming write" from a single CPU location with many different "outgoing reads", suspect the CPU who is related to the "incoming write" transaction as Root Cause.
Big rule:CPUs "write" and DIMMs "read" so, when only "read's.
Internal Comments
There is an ESYN Translator located at http://panacea.uk.oracle.com/twiki/bin/view/Tools/ToolPageEsynDecoderUniboard
which can be used to translate ECC syndromes as shown in this article's example.
Previously Published As 90269
Attachments
This solution has no attachment
Copyright © 2012 Sun Microsystems, Inc.All rights reserved.
Feedback 谢谢啊。要是是中文就更好了。。。哈哈回复 4# znnnz
页:
[1]