SRDB ID | Synopsis | Date | ||
49982 | Sun Fire[TM] 12K/15K: esmd incorrectly reports which CP it powered off | 14 Jan 2003 |
Status | Issued |
Description |
- Problem Statement/Title:
esmd incorrectly reports which CP it powered off
- Symptoms:
esmd will detect an overlimit temperature event on a CP asic but powers off CP1. It appears that esmd powered off the wrong CP. From the platform messages:
Jan 6 15:48:20 2003 EAIprod1-sc0 esmd[5746]: [1909 5363364972458233 ERR DetectorT.cc 234] An unsafe high temperature has been detected on DMX5, located on CP at CP0. CSB at CS0 is being powered off. The temperature detected is 96.13c; should be 5.00c to 70.00c Jan 6 15:48:20 2003 EAIprod1-sc0 esmd[5746]: [0 5363365098622035 NOTICE SysControl.cc 3358] Component CSB at CS1 has been blacklisted Jan 6 15:48:21 2003 EAIprod1-sc0 esmd[5746]: [6221 5363365848396031 ERR PowerControl.cc 344] Failed to send a power event with event code 3 for CSB at CS1 Jan 6 15:48:21 2003 EAIprod1-sc0 esmd[5746]: [1930 5363365881079720 NOTICE SysControl.cc 3914] CSB at CS1 has been powered off: ecode=0
SOLUTION SUMMARY:
- Troubleshooting:
From previous Sun[TM] Explorer output, we can see that DMX5 (in this example) on CP1 is in fact showing temperature well above the others.
showenvironment reports:
CP at CP0 dmx0 DMX0 Temp 33.12 C 34.0 sec OK CP at CP0 dmx1 DMX1 Temp 32.88 C 34.0 sec OK CP at CP0 dmx3 DMX3 Temp 31.00 C 34.0 sec OK CP at CP0 dmx5 DMX5 Temp 30.84 C 34.0 sec OK CP at CP0 amx0 AMX0 Temp 36.56 C 34.0 sec OK CP at CP0 amx1 AMX1 Temp 34.60 C 34.0 sec OK CP at CP0 rmx RMX Temp 34.49 C 34.0 sec OK CP at CP0 darb DARB Temp 31.00 C 34.0 sec OK CP at CP1 dmx0 DMX0 Temp 30.84 C 24.6 sec OK CP at CP1 dmx1 DMX1 Temp 30.79 C 24.6 sec OK CP at CP1 dmx3 DMX3 Temp 28.89 C 24.6 sec OK CP at CP1 dmx5 DMX5 Temp 62.00 C 24.6 sec OK <= ** CP at CP1 amx0 AMX0 Temp 32.65 C 24.6 sec OK CP at CP1 amx1 AMX1 Temp 32.53 C 24.6 sec OK CP at CP1 rmx RMX Temp 32.50 C 24.6 sec OK CP at CP1 darb DARB Temp 28.91 C 24.6 sec OK
After the unsafe temperature is detected and esmd takes the appropriate response, confirm that the correct CP is deconfigured using showenvironment, showbus, and showcomponent commands.
showenvironment reports:
CP at CP1 -- -- -- -- -- OFF
showbus reports:
Location Data Address Response SOCX ---------------------------------------------------- EX0 CSB0 CSB0 CSB0 0x0007f EX1 CSB0 CSB0 CSB0 0x0007f EX2 CSB0 CSB0 CSB0 0x0007f EX3 CSB0 CSB0 CSB0 0x0007f EX4 CSB0 CSB0 CSB0 0x0007f EX5 CSB0 CSB0 CSB0 0x0007f EX6 CSB0 CSB0 CSB0 0x0007f EX7 CSB0 CSB0 CSB0 0x00180 EX8 CSB0 CSB0 CSB0 0x00180 EX9 CSB0 CSB0 CSB0 0x00600 EX10 CSB0 CSB0 CSB0 0x00600 EX11 UNCONF UNCONF UNCONF UNCONF EX12 UNCONF UNCONF UNCONF UNCONF EX13 UNCONF UNCONF UNCONF UNCONF EX14 CSB0 CSB0 CSB0 0x0c000 EX15 CSB0 CSB0 CSB0 0x0c000 EX16 CSB0 CSB0 CSB0 0x30000 EX17 CSB0 CSB0 CSB0 0x30000
showcomponent -a or the $SMSETC/config/asr blacklist file reports:
cplane 1 # ESMD Overlimit Temperature 0106.1548.20
DMX5 (in this example) on CP1 overheated, but the warning message is accidentally hard-coded to say CP0/CSB0 regardless of which CP overheated:
Jan 6 15:48:20 2003 EAIprod1-sc0 esmd[5746]: [1909 5363364972458233 ERR DetectorT.cc 234] An unsafe high temperature has been detected on DMX5, located on CP at CP0. CSB at CS0 is being powered off. The temperature detected is 96.13c; should be 5.00c to 70.00c
However, esmd does take the correct action by powering off and blacklisting the correct CP1/CSB1:
Jan 6 15:48:20 2003 EAIprod1-sc0 esmd[5746]: [0 5363365098622035 NOTICE SysControl.cc 3358] Component CSB at CS1 has been blacklisted Jan 6 15:48:21 2003 EAIprod1-sc0 esmd[5746]: [1930 5363365881079720 NOTICE SysControl.cc 3914] CSB at CS1 has been powered off: ecode=0
- Resolution:
Ignore the warning message and believe what is actually performed.
The bug is that esmd is hard-coded to report CP0/CSB0 in the reporting of the high/extremely high/unsafe temperature condition and the action it is going to take, but it correctly reports which CP/CSB is blacklisted and powered off.
- Summary of part number and patch ID's:
TBD
- References and bug IDs:
esc543156:
The code, src/sms/cmd/esmd/DetectorT.cc:
// Set up centerplane SMSComponent object for the purpose of messaging. cpbID = csb->pos; cp.loc = SMSCL_CP; cp.type = SMSCT_CP; ! cp.pos = 0;
will be changed to:
// Set up centerplane SMSComponent object for the purpose of messaging. cpbID = csb->pos; cp.loc = SMSCL_CP; cp.type = SMSCT_CP; ! cp.pos = cpbID;
- Additional background information:
From the $SMSETC/config/esmd_tuning.txt file:
cpb_asic_high_warn = 70.0 cpb_asic_high_crit = 80.0 cpb_asic_overlimit = 85.0
When these conditions are detected, esmd reports:
high_warn = "A high temperature..." high_crit = "An extremely high temperature..." overlimit = "An unsafe high temperature..."
Actions taken by esmd for a CP or CSB high temperature:
high_warn: Turn fan speed to high until temp drops at least 5 deg below high_warn.
high_crit: Use Dynamic Bus Configuration to reconfigure the domains so they don't use that CP half. Then blacklist and power off the CSB. The domain operations should not be interrupted, but the CP bandwidth is decreased.
overlimit: In this case there is no grace time available to use Dynamic Bus Configuration. Just blacklist and power off the CSB; the domains will crash. esmd uses a "0 second timeout" in this case to indicate that it is responsible for crashing the domains.
- Meta-Data/Problem categorization:
Product/Platform: SF12K/SF15K
Category:
- Keywords:
15K, 12K, SF15K, SF12K, Sun Fire 15K, Enterprise, Server, Sun Fire 12K, esmd, CP, CSB, temperature, asr, blacklist, overlimit, unsafe
INTERNAL SUMMARY:
SUBMITTER: Gino Valencia BUG REPORT ID: 4799899 APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS: