SRDB ID | Synopsis | Date | ||
48224 | Sun Fire[TM] 12K/15K: POST: Dstop detected, no reason found | 31 Oct 2002 |
Status | Issued |
Description |
- Problem Statement: POST: Dstop detected, no reason found. - Symptoms: During a POST, when entering the cpu_lpost or pci_lpost stage, a "no reason found" Dstop is reported. For example: % cat post020815.0910.22.log ... stage cpu_lpost: Test all L1 CPU boards... Performing ASIC config with bus config a/d/r = 333... Slot0 in domain: 02000 Slot1 in domain: 02000 EXBs in use: 00004 DSTOP Detected for Slot SB13 Dumping system state Boards in dump: master SC CPs/CSBs[1:0]: 3 EXB[17:0]: 02000 Slot0[17:0]: 02000 Slot1[17:0]: 00000 SDI EX13/S0: Slot 0 port is DStopped, SDI is RStopped, requested by DARB. System state dumped to /var/opt/SUNWSMS/SMS1.2/adm/F/dump/xcstate.020815.0910.40 Boards in dump: master SC CPs/CSBs[1:0]: 3 EXB[17:0]: 02000 Slot0[17:0]: 02000 Slot1[17:0]: 00000 FAIL Slot SB13: Dstop detected, no reason found. Primary service FRU is Slot SB13. wfail's analysis is also clean: redxl> dumpf load xcstate.020815.0910.40 Created Thu Aug 15 09:10:40 2002 By hpost v. 1.2 Generic 112488-06 Jun 18 2002 15:53:15 executing as pid=28522 On ssc name = xc92p2-sc0.SD_Lab.West.Sun.COM Domain = 5=F Platform = sun15 Boards in dump: master SC CPs/CSBs[1:0]: 3 EXB[17:0]: 02000 Slot0[17:0]: 02000 Slot1[17:0]: 00000 Stop on EXB EX13 during stage cpu_lpost Created in a Sun Microsystems Inc. internal environment. 0 errors occurred while creating this dump. redxl> wfail SDI EX13/S0: Slot 0 port is DStopped, SDI is RStopped, requested by DARB. No components would be failed based on this state.
SOLUTION SUMMARY:
- Troubleshooting: This error has only been seen on the first POST performed after a power on of the domain. To confirm a match to the issue covered here, first examine the SDI indicated by the wfail. The only "errors" present are the stop signals issued by the DARB (lines 14, 17): 01 redxl> shsdi -e 13 02 Note: Data is displayed from the currently loaded dump file. 03 SDI EX13/S0 Component ID = 54317049 04 Master_Stop_Status0[31:0] = 2000000A 05 MStop0[3,1]: Slot 0 port is DStopped, SDI is Recordstopped. 06 Master_Stop_Status1[31:0] = EDED0002 07 0x0D CP1StopExp[4:0] MSS1[20:16] 08 3 CP1StopSlot[0:1] MSS1[22:21] Dstop is 1st stop 09 1 CP1StopInfoValid MSS1[23] 10 0x0D CP0StopExp[4:0] MSS1[28:24] 11 3 CP0StopSlot[0:1] MSS1[30:29] Dstop is 1st stop 12 1 CP0StopInfoValid MSS1[31] 13 Dstop0[31:0] = 00028002 14 Dstop0[17]: D 1E DARB texp requests Slot0 Dstop (M) 15 Dstop1[31:0] = 00000000 16 Recordstop0[31:0] = 00018001 17 Rstop0[16]: R 1E DARB texp request Recordstop (M) 18 Recordstop1[31:0] = 00000000 19 Core_Error0[31:0] = 00000000 Mask = 0051FFFF 20 Core_Error1[31:0] = 00000000 Mask = FFFFFFFF 21 Sysreg_Error[31:0] = 00000000 Mask = 780377FF 22 STB_Error[31:0] = 00000000 Mask = 7F00FFFF 23 CP_Error0[31:0] = 00000000 Mask = 580067FF 24 CP_Error1[31:0] = 00000000 Mask = 7FFCFFFF 25 Slot0_Error0[31:0] = 00000000 Mask = 7000FFFF 26 Slot0_Error1[31:0] = 00000000 Mask = 31444EBF 27 Slot0_Error2[31:0] = 00000000 Mask = 7FFCFFFF 28 Slot1_Error0[31:0] = 00000000 Mask = FFFFFFFF 29 Slot1_Error1[31:0] = 08000000 Mask = FFFFFFFF 30 Slot1_Error2[31:0] = 00000000 Mask = FFFFFFFF No other SDI errors should be present. Next, look at the DARBs and the only error indication is generic (line 35): 31 redxl> shdarb -e 0 32 Note: Data is displayed from the currently loaded dump file. 33 DARB C0 (0) Component ID = 34303049 34 MasterStatus[31:0] = C0000000 35 MStat[30,31]: Domainstop + Recordstop has occurred. 36 MStat[28:25]: ErrorCount[3:0] = 0 37 DARB C0 has ports (expanders) enabled [17:0]: 02004 DARB 1 contains the same information. Finally, look at DMX 0.0's port associated with the SDI called out in wfail for a Recordstop error (line 44): 38 redxl> shdmx -e 0 0 13 39 Note: Data is displayed from the currently loaded dump file. 40 DMX C0/D0 Component ID = 44413049 41 1stErr Reg[31:0] = 1A000000 RStop Reg[31:0] = 00000000 42 EXB EX13 Error_Control[31:0] = 555587FF 43 EXB EX13 Error = 00400000 DataErr = 00000000 44 Error[22]: Recordstop 45 EXB EX13 Fifo_Test_Ctl[31:0]= 00000000 FifoWrPatt[31:0]= 00000000 If the above three items are true, this is match to bug 4724771. - Resolution: Apply patch112481-06 (or higher). - Summary of part number and patch ID's112481-06 (or higher) - References and bug IDs4724771 - LibPower should send events sychronously4696868 - asic_config_dmx() doesn't clear 1stErr bits correctly Escalation 539406 - Additional background information: The root cause of this error is the centerplane ports for a set of domain resources are not deconfigured when the domain is powered off with a 'setkeyswitch off' from the ON position. As each board is powered off, its corresponding test state in the PCD is cleared. When the last board for a given expander is powered off, the centerplane is deconfigured. The logic used by SMS to deconfigure the centerplane at last-board power off used the PCD as a sanity check. If both boards in the expander were clear in the PCD, the centerplane deconfiguration proceeded. Prior to patch112481-06 , the events sent to the PCD to clear the board test state were asynchronous. A small window of time existed where the PCD events had not completed processing, and SMS was referencing the board test state to determine if the centerplane port should be deconfigured. If this occurred, the deconfiguration would not occur. CPRE testing showed this window was exploited in 9% of all keyswitch ON-->OFF transitions. The second piece to this puzzle is when power is returned to the components. Whenever a component first powers up, signals driven out of a given ASIC are undefined. Testing showed that when the SDI was powered up, it would assert a "Header (Tslot) parity error" in DARB 1. This error constitutes a Dstop. Occasionally, the SDIs could also induce a Rstop error in DMX 0.0 and DMX 0.5. If a centerplane port is not deconfigured AND the errors described above are asserted at expander power on, a "no reason found" Dstop occurs. Since the DMX ports are not deconfigured, the Rstop signal is raised to the DARBs. So the DARB has both a Rstop and Dstop flagged. When POST begins, one of its first tasks is to clear any errors present for the domain resources being configured. However, because of Bug4696868 , the errors in the DMX are not cleared. As a result, the DARB maintains the record of both Dstop and Rstop, and finally a Dstop is noted by POST. Since all other errors are cleared, the Dstop appears to have "no reason". For completeness, the Dstop is noted in cpu_lpost or pci_lpost because it is in the lpost stages when Dstop detection is enabled. In cpu_lpost, detection is enabled for all expanders that contain a CPU board that is part of the domain being configured. Similarly, for pci_lpost, detection is enabled for all expanders that contain an I/O board for the domain being configured. Precisely which phase the Dstop is detected in is a function of the domain configuration. - Meta-Data/Problem categorization: Product/Platform: SF12K/15K Category: - Keywords dstop, starcat, 15K, 12K, SF15K, SF12K, xcstate, Dstop detected, no reason found
INTERNAL SUMMARY:
SUBMITTER: Scott Davenport BUG REPORT ID: 4724771, 4696868, 4696868 PATCH ID: 112481-06, 112481-06, 112481-06 APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS: