SRDB ID   Synopsis   Date
49982   Sun Fire[TM] 12K/15K: esmd incorrectly reports which CP it powered off   14 Jan 2003

Status Issued

Description

- Problem Statement/Title:

esmd incorrectly reports which CP it powered off

- Symptoms:

esmd will detect an overlimit temperature event on a CP asic but powers off CP1. It appears that esmd powered off the wrong CP. From the platform messages:

Jan 6 15:48:20 2003 EAIprod1-sc0 esmd[5746]: [1909 5363364972458233 ERR DetectorT.cc 234] An unsafe high temperature
has been detected on DMX5, located on CP at CP0. CSB at CS0 is being powered off. The temperature detected is 96.13c; should be 5.00c to 70.00c

Jan 6 15:48:20 2003 EAIprod1-sc0 esmd[5746]: [0 5363365098622035 NOTICE SysControl.cc 3358] Component CSB at CS1
has been blacklisted

Jan 6 15:48:21 2003 EAIprod1-sc0 esmd[5746]: [6221 5363365848396031 ERR PowerControl.cc 344] Failed to send a power
event with event code 3 for CSB at CS1

Jan 6 15:48:21 2003 EAIprod1-sc0 esmd[5746]: [1930 5363365881079720 NOTICE SysControl.cc 3914] CSB at CS1 has
been powered off: ecode=0      

SOLUTION SUMMARY:

- Troubleshooting:

From previous Sun[TM] Explorer output, we can see that DMX5 (in this example) on CP1 is in fact showing temperature well above the others.

showenvironment reports:

CP at CP0 dmx0 DMX0 Temp 33.12 C 34.0 sec OK
CP at CP0 dmx1 DMX1 Temp 32.88 C 34.0 sec OK
CP at CP0 dmx3 DMX3 Temp 31.00 C 34.0 sec OK
CP at CP0 dmx5 DMX5 Temp 30.84 C 34.0 sec OK
CP at CP0 amx0 AMX0 Temp 36.56 C 34.0 sec OK
CP at CP0 amx1 AMX1 Temp 34.60 C 34.0 sec OK
CP at CP0 rmx RMX Temp 34.49 C 34.0 sec OK
CP at CP0 darb DARB Temp 31.00 C 34.0 sec OK
CP at CP1 dmx0 DMX0 Temp 30.84 C 24.6 sec OK
CP at CP1 dmx1 DMX1 Temp 30.79 C 24.6 sec OK
CP at CP1 dmx3 DMX3 Temp 28.89 C 24.6 sec OK

CP at CP1 dmx5 DMX5 Temp 62.00 C 24.6 sec OK <= **

CP at CP1 amx0 AMX0 Temp 32.65 C 24.6 sec OK
CP at CP1 amx1 AMX1 Temp 32.53 C 24.6 sec OK
CP at CP1 rmx RMX Temp 32.50 C 24.6 sec OK
CP at CP1 darb DARB Temp 28.91 C 24.6 sec OK      

After the unsafe temperature is detected and esmd takes the appropriate response, confirm that the correct CP is deconfigured using showenvironment, showbus, and showcomponent commands.

showenvironment reports:

CP at CP1 -- -- -- -- -- OFF      

showbus reports:

Location Data Address Response SOCX
----------------------------------------------------
EX0 CSB0 CSB0 CSB0 0x0007f
EX1 CSB0 CSB0 CSB0 0x0007f
EX2 CSB0 CSB0 CSB0 0x0007f
EX3 CSB0 CSB0 CSB0 0x0007f
EX4 CSB0 CSB0 CSB0 0x0007f
EX5 CSB0 CSB0 CSB0 0x0007f
EX6 CSB0 CSB0 CSB0 0x0007f
EX7 CSB0 CSB0 CSB0 0x00180
EX8 CSB0 CSB0 CSB0 0x00180
EX9 CSB0 CSB0 CSB0 0x00600
EX10 CSB0 CSB0 CSB0 0x00600
EX11 UNCONF UNCONF UNCONF UNCONF
EX12 UNCONF UNCONF UNCONF UNCONF
EX13 UNCONF UNCONF UNCONF UNCONF
EX14 CSB0 CSB0 CSB0 0x0c000
EX15 CSB0 CSB0 CSB0 0x0c000
EX16 CSB0 CSB0 CSB0 0x30000
EX17 CSB0 CSB0 CSB0 0x30000      

showcomponent -a or the $SMSETC/config/asr blacklist file reports:

cplane 1 # ESMD Overlimit Temperature 0106.1548.20
      

DMX5 (in this example) on CP1 overheated, but the warning message is accidentally hard-coded to say CP0/CSB0 regardless of which CP overheated:

Jan 6 15:48:20 2003 EAIprod1-sc0 esmd[5746]: [1909 5363364972458233 ERR DetectorT.cc 234] An unsafe high temperature has been detected on DMX5,
located on CP at CP0. CSB at CS0 is being powered off. The temperature detected is 96.13c; should be 5.00c to 70.00c
      

However, esmd does take the correct action by powering off and blacklisting the correct CP1/CSB1:

Jan 6 15:48:20 2003 EAIprod1-sc0 esmd[5746]: [0 5363365098622035 NOTICE SysControl.cc 3358] Component CSB at CS1 has been blacklisted
Jan 6 15:48:21 2003 EAIprod1-sc0 esmd[5746]: [1930 5363365881079720 NOTICE SysControl.cc 3914] CSB at CS1 has been powered off: ecode=0      

- Resolution:

Ignore the warning message and believe what is actually performed.

The bug is that esmd is hard-coded to report CP0/CSB0 in the reporting of the high/extremely high/unsafe temperature condition and the action it is going to take, but it correctly reports which CP/CSB is blacklisted and powered off.

- Summary of part number and patch ID's:

TBD

- References and bug IDs:

esc543156: 4799899: Synopsis: esmd incorrectly reports which CP it powered off

The code, src/sms/cmd/esmd/DetectorT.cc:

 // Set up centerplane SMSComponent object for the purpose of messaging.
cpbID = csb->pos;
cp.loc = SMSCL_CP;
cp.type = SMSCT_CP;
! cp.pos = 0;      

will be changed to:

 // Set up centerplane SMSComponent object for the purpose of messaging.
cpbID = csb->pos;
cp.loc = SMSCL_CP;
cp.type = SMSCT_CP;
! cp.pos = cpbID;      

- Additional background information:

From the $SMSETC/config/esmd_tuning.txt file:

cpb_asic_high_warn = 70.0
cpb_asic_high_crit = 80.0
cpb_asic_overlimit = 85.0      

When these conditions are detected, esmd reports:

high_warn = "A high temperature..."
high_crit = "An extremely high temperature..."
overlimit = "An unsafe high temperature..."      

Actions taken by esmd for a CP or CSB high temperature:

high_warn: Turn fan speed to high until temp drops at least 5 deg below high_warn.

high_crit: Use Dynamic Bus Configuration to reconfigure the domains so they don't use that CP half. Then blacklist and power off the CSB. The domain operations should not be interrupted, but the CP bandwidth is decreased.

overlimit: In this case there is no grace time available to use Dynamic Bus Configuration. Just blacklist and power off the CSB; the domains will crash. esmd uses a "0 second timeout" in this case to indicate that it is responsible for crashing the domains.

- Meta-Data/Problem categorization:

Product/Platform: SF12K/SF15K

Category:

- Keywords:

15K, 12K, SF15K, SF12K, Sun Fire 15K, Enterprise, Server, Sun Fire 12K, esmd, CP, CSB, temperature, asr, blacklist, overlimit, unsafe

INTERNAL SUMMARY:

SUBMITTER: Gino Valencia BUG REPORT ID: 4799899 APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.