SRDB ID | Synopsis | Date | ||
48223 | Sun Fire[TM] 12K/15K: Dstop: CP arbiter lockstep consistency check error | 31 Oct 2002 |
Status | Issued |
Description |
- Problem Statement: Dstop: CP arbiter lockstep consistency check error - Symptoms: One or more domains suffer a Dstop with a CP arbiter lockstep consistency check error. A typical Dstop signature is: redxl> wfail SDI EX00/S0 Master_Stop_Status0[31:0] = 7004000F MStop0[3:0]: All SDI logic is DStopped + Recordstopped. SDI EX00/S0 Dstop0[31:0] = 00418040 Dstop0[16]: D DARB texp requests all Dstop (M) Dstop0[22]: D 1E SDI internal CP port requested Dstop SDI EX00/S0 CP_Error0[31:0] = 2000A000 Mask = 580067FF CPErr0[29]: D 1E CP arbiter lockstep consistency check error (M) cp0_{dembusp,texp,unload,demand[1:0]} = 14 cp1_{dembusp,texp,unload,demand[1:0]} = 00 FAIL EXB EX0: Dstop/Rstop detected by SDI EX0/S0. Primary service FRU is EXB EX0.
SOLUTION SUMMARY:
- Troubleshooting: Cases of CP arbiter lockstep Dstops have been observed when one or more domains was executing a 'setkeyswitch standby' operation, a 'setkeyswitch on' operation, or when inserting an expander into the chassis. The domain in the setkeyswitch operation is NOT the domain that experiences the Dstop. 1. Note the time stamp of the Dstop. For example: % cd EXPLORER_DIR/sf15k % ls [A-R]/adm/dump/dsmd.dstop.020615.20* A/adm/dump/dsmd.dstop.020615.2006.33 C/adm/dump/dsmd.dstop.020615.2009.19 B/adm/dump/dsmd.dstop.020615.2008.03 D/adm/dump/dsmd.dstop.020615.2010.22 If multiple Dstops are present, use the time stamp of the earliest Dstop: in this case, the stop for domain A. The time stamp is of the form YYMMDD.HHMM.SS. Domain A's stop occurred at Jun 15 20:06:33 2002. 2. Examine the other logs on the system for evidence of a setkeyswitch operation in near proximity to the Dstop time. 'setkeyswitch standby' operations are not logged, so administrator testimony must be used. 'setkeyswitch on' operations are also not logged, but since a POST is launched, the POST log can be used to infer the time of a 'setkeyswitch on'. From explorer data: % cd EXPLORER_DIR/sf15k % ls [A-R]/adm/post/post020615.20* A/adm/post/post020615.2006.34.log C/adm/post/post020615.2019.57.log A/adm/post/post020615.2009.36.log C/adm/post/post020615.2027.45.log B/adm/post/post020615.2008.03.log C/adm/post/post020615.2036.06.log B/adm/post/post020615.2010.18.log C/adm/post/post020615.2050.55.log B/adm/post/post020615.2019.51.log D/adm/post/post020615.2010.23.log B/adm/post/post020615.2027.48.log D/adm/post/post020615.2012.01.log B/adm/post/post020615.2036.15.log F/adm/post/post020615.2006.52.log B/adm/post/post020615.2050.56.log F/adm/post/post020615.2024.42.log C/adm/post/post020615.2009.19.log F/adm/post/post020615.2038.34.log C/adm/post/post020615.2011.02.log F/adm/post/post020615.2051.56.log 'A/adm/post/post020615.2006.34.log' and 'F/adm/post/post020615.2006.52.log' are very close. Checking the contents of the log, we can infer a 'setkeyswitch on' operation: % grep Cmdline A/adm/post/post020615.2006.34.log # Cmdline: /opt/SUNWSMS/SMS1.2/bin/hpost -d A -D/var/opt/SUNWSMS/SMS1.2/ adm/A/dump/dsmd.dstop.020615.2006.33 -y "DSMD DomainStop Dump" % grep Cmdline F/adm/post/post020615.2006.52.log # Cmdline: /opt/SUNWSMS/SMS1.2/bin/hpost -d F Domain A's is the processing of the Dstop. Domain F's is most likely a 'setkeyswitch on' operation. 'setkeyswitch on' spawns an 'hpost -d X' process. Also note that the POST log for Domain F is time stamped after the Dstop. This indicates the Dstop occurred during initial processing of the 'setkeyswitch on' before the hpost was spawned. It is possible that a user executed hpost manually, but it's unlikely. So, there is alignment in this case with known causes of CP arbiter lockstep. It may be possible to further narrow the cause to a specific bug. Refer to Background Information for more details. 3. Identify any missing patches that address CP arbiter lockstep causes. Relevant patches are listed below. 4. In the vast majority of cases, a re-POST (either automatically issued by SMS or manually executed) of the stopped domain(s) clears the CP arbiter lockstep condition and the domain recovers. In rare cases, the condition is persistent. In such cases, use the following workaround: a) Set the keyswitch position for ALL effected domains to OFF. % setkeyswitch -d X off b) Degrade the centerplane to half capacity. % setbus -c cs0 c) Return the centerplane to full capacity. % setbus -c cs0,cs1 d) Set the keyswitch position for ALL affected domains to ON. Only use this workaround if domains do not recover via normal operations. - Resolution: Ensure the following patches are applied to SMS 1.2 systems:112481-06 (or higher)112488-06 (or higher) ** Make sure the special install instructions for 112481 are ** ** performed. If omitted, the system is still vulnerable. ** There are currently no patches for SMS 1.1. Upgrading to SMS 1.2 or higher is strongly recommended. - Summary of part number and patch ID's112481-06 (or higher)112488-06 (or higher) - References and bug IDs4671526 - libPower needs to clear board test status when boards are reset4671531 - libKeyswitch needs to deconfigure L1 boards before the expander4699827 - Deconfigure L1 boards should reset Darb ports if necessary4712287 - EXB asic LBIST needs to be skipped when CP is in use4724771 - LibPower should send events sychronously - Additional background information: Note: "Dstop time stamp" in this discussion refers to the time stamp of the _first_ Dstop occurrence, if multiples are present. Each DARB has 18 individual ports, each port serving a given expander. At start-of-day for a port, if an error is present on that port it will delay the initiation of the arbitration cycle because the error must be handled/cleared. Remaining error free ports are not affected. Because of the delay, all ports do not start arbitration cycling at the same time. Hence, they fall out of lockstep. There are three known triggers to CP arbiter lockstep errors: 1. Incomplete deconfiguration of centerplane ports when a domain goes to keyswitch position STANDBY/OFF (4671526 ,4671531 , and4699827 ) 2. Failure to deconfigure the centerplane ports (4724771 ) 3. LBIST test execution on the SDI ASIC (4712287 ) By interpretation of the data, it is often possible to determine precisely which trigger and bug was encountered. The first step is to obtain the 'cp-ports' utility from: http://cpre-amer.west.sun.com/esg/hsg/starcat/tools/cp-ports-dl.html This utility examines either a live system or a hardware state dump and outputs the state of the centerplane ports in relevant centerplane ASICs. For a given port, the state of the DARB masks and the ASIC ports should be consistent (either all "Enabled" or all "Disabled"). For example: % cp-ports dsmd.dstop.020615.2010.22 Using dump file dsmd.dstop.020615.2010.22... Collecting DARB 0 information...mask is 01E00...done. Collecting DARB 1 information...mask is 01E00...done. Collecting AMX 0.0 information...done. Collecting AMX 0.1 information...done. Collecting AMX 1.0 information...done. Collecting AMX 1.1 information...done. Collecting RMX 0 information...done. Collecting RMX 1 information...done. Domain Mask Port Status Port Status Port Status Port Status Port DARB 0 DARB 1 DARB 0 DARB 1 AMX 0.0 AMX 0.1 AMX 1.0 AMX 1.1 RMX 0 RMX 1 ---- -------- -------- -------- -------- -------- -------- -------- -------- -------- -------- 0 Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled 1 Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled 2 Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled 3 Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled 4 Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled 5 Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled 6 Disabled Disabled Enabled Enabled Enabled Enabled Enabled Enabled Enabled Enabled 7 Disabled Disabled Enabled Enabled Enabled Enabled Enabled Enabled Enabled Enabled 8 Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled 9 Enabled Enabled Enabled Enabled Enabled Enabled Enabled Enabled Enabled Enabled 10 Enabled Enabled Enabled Enabled Enabled Enabled Enabled Enabled Enabled Enabled 11 Enabled Enabled Enabled Enabled Enabled Enabled Enabled Enabled Enabled Enabled 12 Enabled Enabled Enabled Enabled Enabled Enabled Enabled Enabled Enabled Enabled 13 Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled 14 Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled 15 Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled 16 Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled 17 Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled Disabled From this output, we see that ports 6 and 7 are not completely deconfigured. Thus, trigger #1 is the cause. Further verification is that either a nearby POST log starts after the Dstop time stamp (i.e. 'setkeyswitch on') or there is no nearby POST log (i.e. 'setkeyswitch standby'). To specifically determine which bug, the patch level of the system can be examined. Perusing the patch README for details on which patch level corrected which bug is left to an exercise for the reader. If the cp-ports output shows all centerplane ports completely deconfigured, the timings of the Dstop versus a nearby POST log can be used to distinguish between triggers #2 and #3. As above, if the POST log starts after the Dstop time stamp, this is trigger #2. If the POST log starts before the Dstop time stamp, the Dstop occurred during POST. This is trigger #3 - Meta-Data/Problem categorization: Product/Platform: SF12K/SF15K Category: - Keywords dstop, 15K, 12K, SF15K, SF12K, starcat, CP arbiter lockstep consistency check error
INTERNAL SUMMARY:
SUBMITTER: Scott Davenport BUG REPORT ID: 4671526, 4671531, 4699827, 4712287, 4724771, 4671526, 4671531, 4699827, 4724771, 4712287 PATCH ID: 112481-06, 112488-06, 112481-06, 112488-06 APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS: