freebug 发表于 2013-08-01 21:10

有台rx6600总是经常报” TEMPERATURE_HIGH_WARNING”,帮忙看一下哦

有台rx6600总是经常报” TEMPERATURE_HIGH_WARNING”,帮忙看一下哦,应该不会是机房的问题,因为同样位置的另一台RX6600就从来没有报过错。


#Location|Alert| Encoded Field    |Data Field    |   Keyword / Timestamp
-------------------------------------------------------------------------------
25    BMC      20x2051F9FC8F0201E0 625E5781DA010300 TEMPERATURE_NORMAL_FROM_WA
RNING
                                                      01 Aug 2013 06:13:35
24    BMC      20x2051F9E7FF0201D0 625E5781D9010300 TEMPERATURE_NORMAL_FROM_WA
RNING
                                                      01 Aug 2013 04:45:51
23    BMC   *30x2051F8DE210201C0 62625701DA010300 TEMPERATURE_HIGH_WARNING
                                                      31 Jul 2013 09:51:29
22    BMC   *30x2051F8DB4C0201B0 62625701D9010300 TEMPERATURE_HIGH_WARNING
                                                      31 Jul 2013 09:39:24
21    BMC      20x2051F8CB180201A0 625E5781DA010300 TEMPERATURE_NORMAL_FROM_WA
RNING
                                                      31 Jul 2013 08:30:16
20    BMC      20x2051F554E6020190 625E5781D9010300 TEMPERATURE_NORMAL_FROM_WA
RNING
                                                      28 Jul 2013 17:29:10
19    BMC   *30x2051F2B34F020180 62625701DA010300 TEMPERATURE_HIGH_WARNING
                                                      26 Jul 2013 17:35:11

haizdl 发表于 2013-08-01 22:04

看看风扇,以及空气流通状况。


MP LOG:
============
cm > ps

SYSTEM LOG:
============
event.log

tangye 发表于 2013-08-01 22:19

通风不好,检查滤网

freebug 发表于 2013-08-03 01:01

PS是正常的:

For System Processor Status see the SS command.
System Power state: On
System Power usage: 658 Watts
Temperature       : Normal


Power supplies                State
-----------------------------------------------------------
Power Supply 0                Normal
Power Supply 1                Normal

Fans                        State
-----------------------------------------------------------
System Fan 1                  Normal
System Fan 2                  Normal
System Fan 3                  Normal
System Fan 4                  Normal
System Fan 5                  Normal
System Fan 6                  Normal


event.log里面有这些报错:

>------------ Event Monitoring Service Event Notification ------------<

Notification Time: Thu Aug1 21:18:14 2013

xxdb sent Event Monitor notification information:

/system/events/ia64_corehw/core_hw is >= 1.
Its current value is MAJORWARNING(3).



Event data from monitor:

Event Time..........: Thu Aug1 21:18:14 2013
Severity............: MAJORWARNING
Monitor.............: ia64_corehw
Event #.............: 101011            
System..............: xxdb

Summary:
   System temperature is out of normal range.


Description of Error:

   The system temperature is not within normal operating range. It is higher
   than required operating range.

Probable Cause / Recommended Action:

   Something may be blocking the cooling intakes of the fans. Check for
   obstruction.
   One or more fans may be operating at lower speed than normal. Check the
   fan performance.

   Check for problems with the room air conditioning.

   If the problem is not fixed, the operating temperature may become
   non-recoverable, in which case there are chances that the hardware may be
   damaged.At that temperature level, on Integrity servers, the firmware
   will shutdown the system automatically. However on HP 9000 servers, the
   action specified in the envd config file will be taken - which may be to
   shutdown the system automatically.

   For information on the sensor that generated this event, refer to FRU ID
   in Event Details section.

Additional Event Data:
   System IP Address...: 137.1.100.12
   Event Id............: 0x51fa601600000000
   Monitor Version.....: B.01.00
   Event Class.........: System
   Client Configuration File...........:
   /var/stm/config/tools/monitor/default_ia64_corehw.clcfg
   Client Configuration File Version...: A.01.00
          Qualification criteria met.
               Number of events..: 1
   Associated OS error log entry id(s):
          None
   Additional System Data:
          System Model Number.............: ia64 hp server rx6600
          EMS Version.....................: A.04.20
          STM Version.....................: C.58.00
          System Serial Number............: SGH49016E9
   Latest information on this event:
          http://docs.hp.com/hpux/content/hardware/ems/ia64_corehw.htm#101011

v-v-v-v-v-v-v-v-v-v-v-v-v    DETAILS    v-v-v-v-v-v-v-v-v-v-v-v-v


Event Details :

   Event Date .............: Thu Aug1 21:18:13 2013
   Sensor Number ..........: 0xd9
   Sensor Type ............: Temperature
   Sensor Class ...........: Threshold based
   Sensor Reading/Offset...: 0x07 (Offset)
   EventType.............: Assertion
   Entity ID ..............: 3
   Generic Message.........:
       Temperature :Upper non-critical - going high
   Entity FRU Id Info......:
       processor (Sensor ID: Processor 0)



>---------- End Event Monitoring Service Event Notification ----------<

>------------ Event Monitoring Service Event Notification ------------<

Notification Time: Thu Aug1 23:11:43 2013

xxdb sent Event Monitor notification information:

/system/events/ia64_corehw/core_hw is >= 1.
Its current value is MAJORWARNING(3).



Event data from monitor:

Event Time..........: Thu Aug1 23:11:43 2013
Severity............: MAJORWARNING
Monitor.............: ia64_corehw
Event #.............: 101011            
System..............: xxdb

Summary:
   System temperature is out of normal range.


Description of Error:

   The system temperature is not within normal operating range. It is higher
   than required operating range.

Probable Cause / Recommended Action:

   Something may be blocking the cooling intakes of the fans. Check for
   obstruction.
   One or more fans may be operating at lower speed than normal. Check the
   fan performance.

   Check for problems with the room air conditioning.

   If the problem is not fixed, the operating temperature may become
   non-recoverable, in which case there are chances that the hardware may be
   damaged.At that temperature level, on Integrity servers, the firmware
   will shutdown the system automatically. However on HP 9000 servers, the
   action specified in the envd config file will be taken - which may be to
   shutdown the system automatically.

   For information on the sensor that generated this event, refer to FRU ID
   in Event Details section.

Additional Event Data:
   System IP Address...: 137.1.100.12
   Event Id............: 0x51fa7aaf00000000
   Monitor Version.....: B.01.00
   Event Class.........: System
   Client Configuration File...........:
   /var/stm/config/tools/monitor/default_ia64_corehw.clcfg
   Client Configuration File Version...: A.01.00
          Qualification criteria met.
               Number of events..: 1
   Associated OS error log entry id(s):
          None
   Additional System Data:
          System Model Number.............: ia64 hp server rx6600
          EMS Version.....................: A.04.20
          STM Version.....................: C.58.00
          System Serial Number............: SGH49016E9
   Latest information on this event:
          http://docs.hp.com/hpux/content/hardware/ems/ia64_corehw.htm#101011

v-v-v-v-v-v-v-v-v-v-v-v-v    DETAILS    v-v-v-v-v-v-v-v-v-v-v-v-v


Event Details :

   Event Date .............: Thu Aug1 23:11:34 2013
   Sensor Number ..........: 0xda
   Sensor Type ............: Temperature
   Sensor Class ...........: Threshold based
   Sensor Reading/Offset...: 0x07 (Offset)
   EventType.............: Assertion
   Entity ID ..............: 3
   Generic Message.........:
       Temperature :Upper non-critical - going high
   Entity FRU Id Info......:
       processor (Sensor ID: Processor 1)



>---------- End Event Monitoring Service Event Notification ----------<

lbseraph 发表于 2013-08-04 12:43

之前已经说过了~那是CPU报出来的~

http://bbs.chinaunix.net/forum.php?mod=redirect&goto=findpost&ptid=4092553&pid=23932415&fromuid=21562621

freebug 发表于 2013-08-04 12:51

lbseraph 发表于 2013-08-04 12:43 static/image/common/back.gif
之前已经说过了~那是CPU报出来的~

http://bbs.chinaunix.net/forum.php?mod=redirect&goto=findpost&pti ...

lbseraph 发表于 2013-07-27 10:12 static/image/common/back.gif
回复 1# freebug

那是CPU的温度告警,超过98度了(100度)。有几个物理CPU?告警提示的是Processor 0和 ...


多谢版大,不过还是很奇怪,我把另外一台正常的RX6600的CPU抽屉跟有问题这台直接交换了,然后另外一台正常的还是正常,这台还是报警。

MP卡里看风扇都运行正常,就是时不时报温度高,我看过CPU抽屉里面灰尘也不多。真是想不通啊。

lbseraph 发表于 2013-08-04 17:46

如果可以,把BMC所在的板子MP卡先交换下看看,看问题是否随板走。

freebug 发表于 2013-08-06 00:13

回复 7# lbseraph


    我把调整了一路进风口直接对着这台RX6600观察了三天倒是没有再报错了,另外还想问一下版大,是怎么看出来CPU温度超过100度的啊?是有专业的解码工具么?

television 发表于 2013-08-06 11:31

             800问一下

lbseraph 发表于 2013-08-06 12:54

回复 8# freebug

这个是HP内部的工具来decode出来的,你能看的话就是用sl的text方式检查一下简要的说明,不过应该能看到CPU报出来的,这就足够了。超了多少度不用太过关心,了解到触发谁的内部超温告警就足够了,不同部件对应的超温告警温度是不一样的;没记错的话,主板貌似是大概42度就会强制shutdown的(防止高温烧坏备件)。
页: [1] 2
查看完整版本: 有台rx6600总是经常报” TEMPERATURE_HIGH_WARNING”,帮忙看一下哦