SRDB ID   Synopsis   Date
48224   Sun Fire[TM] 12K/15K: POST: Dstop detected, no reason found   31 Oct 2002

Status Issued

Description
- Problem Statement: 

POST: Dstop detected, no reason found.

- Symptoms:

	During a POST, when entering the cpu_lpost or pci_lpost stage, a "no 
	reason found" Dstop is reported. For example:

	   % cat post020815.0910.22.log
	   ...
	   stage cpu_lpost: Test all L1 CPU boards...
	   Performing ASIC config with bus config a/d/r = 333...
	   	Slot0 in domain: 02000
	   	Slot1 in domain: 02000
	   	    EXBs in use: 00004
	   DSTOP Detected for Slot SB13
	   Dumping system state
	   Boards in dump: master SC    CPs/CSBs[1:0]: 3
	   	  EXB[17:0]: 02000
	   	Slot0[17:0]: 02000
	   	Slot1[17:0]: 00000
	   SDI EX13/S0: Slot 0 port is DStopped, SDI is RStopped, requested by DARB.
	   System state dumped to /var/opt/SUNWSMS/SMS1.2/adm/F/dump/xcstate.020815.0910.40
	   Boards in dump: master SC    CPs/CSBs[1:0]: 3
		  EXB[17:0]: 02000
		Slot0[17:0]: 02000
		Slot1[17:0]: 00000
  	   FAIL Slot SB13: Dstop detected, no reason found.
	   Primary service FRU is Slot SB13.

	wfail's analysis is also clean:

	   redxl> dumpf load xcstate.020815.0910.40
	   Created Thu Aug 15 09:10:40 2002
	   By hpost v. 1.2 Generic 112488-06 Jun 18 2002 15:53:15  executing as pid=28522
	   On ssc name =  xc92p2-sc0.SD_Lab.West.Sun.COM
	   Domain =  5=F    Platform = sun15
	   Boards in dump: master SC    CPs/CSBs[1:0]: 3
	             EXB[17:0]: 02000
	           Slot0[17:0]: 02000
	           Slot1[17:0]: 00000
	   Stop on EXB EX13 during stage cpu_lpost
	   Created in a Sun Microsystems Inc. internal environment.
	   0 errors occurred while creating this dump.
	   redxl> wfail
	   SDI EX13/S0: Slot 0 port is DStopped, SDI is RStopped, requested by DARB.
	   	   No components would be failed based on this state.
            

SOLUTION SUMMARY:
- Troubleshooting:

	This error has only been seen on the first POST performed after a power
	on of the domain. To confirm a match to the issue covered here, first 
	examine the SDI indicated by the wfail. The only "errors" present are 
	the stop signals issued by the DARB (lines 14, 17):

	   01 redxl> shsdi -e 13
	   02 Note: Data is displayed from the currently loaded dump file.
	   03 SDI EX13/S0    Component ID = 54317049
	   04          Master_Stop_Status0[31:0] = 2000000A
	   05         MStop0[3,1]: Slot 0 port is DStopped, SDI is Recordstopped.
	   06          Master_Stop_Status1[31:0] = EDED0002
	   07         0x0D   CP1StopExp[4:0]        MSS1[20:16]    
	   08            3   CP1StopSlot[0:1]       MSS1[22:21]    Dstop is 1st stop
	   09            1   CP1StopInfoValid       MSS1[23]       
	   10         0x0D   CP0StopExp[4:0]        MSS1[28:24]    
	   11            3   CP0StopSlot[0:1]       MSS1[30:29]    Dstop is 1st stop
	   12            1   CP0StopInfoValid       MSS1[31]       
	   13          Dstop0[31:0] = 00028002
	   14         Dstop0[17]: D 1E DARB texp requests Slot0 Dstop (M)
	   15          Dstop1[31:0] = 00000000
	   16          Recordstop0[31:0]  = 00018001
	   17         Rstop0[16]: R 1E DARB texp request Recordstop (M)
	   18          Recordstop1[31:0]  = 00000000
	   19          Core_Error0[31:0]  = 00000000  Mask = 0051FFFF
	   20          Core_Error1[31:0]  = 00000000  Mask = FFFFFFFF
	   21          Sysreg_Error[31:0] = 00000000  Mask = 780377FF
	   22          STB_Error[31:0]    = 00000000  Mask = 7F00FFFF
	   23          CP_Error0[31:0]    = 00000000  Mask = 580067FF
	   24          CP_Error1[31:0]    = 00000000  Mask = 7FFCFFFF
	   25          Slot0_Error0[31:0] = 00000000  Mask = 7000FFFF
	   26          Slot0_Error1[31:0] = 00000000  Mask = 31444EBF
	   27          Slot0_Error2[31:0] = 00000000  Mask = 7FFCFFFF
	   28          Slot1_Error0[31:0] = 00000000  Mask = FFFFFFFF
	   29          Slot1_Error1[31:0] = 08000000  Mask = FFFFFFFF
	   30          Slot1_Error2[31:0] = 00000000  Mask = FFFFFFFF

	No other SDI errors should be present. Next, look at the DARBs and the
	only error indication is generic (line 35):

	   31 redxl> shdarb -e 0
	   32 Note: Data is displayed from the currently loaded dump file.
	   33 DARB C0 (0)  Component ID = 34303049
	   34       MasterStatus[31:0] = C0000000
	   35         MStat[30,31]: Domainstop + Recordstop has occurred.
	   36         MStat[28:25]: ErrorCount[3:0] = 0
	   37       DARB C0 has ports (expanders) enabled  [17:0]: 02004

	DARB 1 contains the same information. Finally, look at DMX 0.0's port
	associated with the SDI called out in wfail for a Recordstop error
	(line 44):

	   38 redxl> shdmx -e 0 0 13
	   39 Note: Data is displayed from the currently loaded dump file.
	   40 DMX C0/D0    Component ID = 44413049
	   41             1stErr Reg[31:0] = 1A000000  RStop Reg[31:0] = 00000000
	   42             EXB EX13  Error_Control[31:0] = 555587FF
	   43             EXB EX13  Error = 00400000  DataErr = 00000000
	   44         Error[22]: Recordstop
	   45             EXB EX13  Fifo_Test_Ctl[31:0]= 00000000  FifoWrPatt[31:0]= 00000000

	If the above three items are true, this is match to bug 4724771.

- Resolution:

	Apply patch 112481-06 (or higher).

- Summary of part number and patch ID's 

	112481-06 (or higher)

- References and bug IDs

	4724771 - LibPower should send events sychronously
	4696868 - asic_config_dmx() doesn't clear 1stErr bits correctly
	Escalation 539406

- Additional background information:

	The root cause of this error is the centerplane ports for a set of 
	domain resources are not deconfigured when the domain is powered off 
	with a 'setkeyswitch off' from the ON position. As each board is 
	powered off, its corresponding test state in the PCD is cleared. 
	When the last board for a given expander is powered off, the 
	centerplane is deconfigured.

	The logic used by SMS to deconfigure the centerplane at last-board
	power off used the PCD as a sanity check. If both boards in the
	expander were clear in the PCD, the centerplane deconfiguration 
	proceeded.

	Prior to patch 112481-06, the events sent to the PCD to clear the
	board test state were asynchronous. A small window of time existed
	where the PCD events had not completed processing, and SMS was 
	referencing the board test state to determine if the centerplane port
	should be deconfigured. If this occurred, the deconfiguration would
	not occur. CPRE testing showed this window was exploited in 9% of 
	all keyswitch ON-->OFF transitions.

	The second piece to this puzzle is when power is returned to the
	components. Whenever a component first powers up, signals driven
	out of a given ASIC are undefined. Testing showed that when the SDI
	was powered up, it would assert a "Header (Tslot) parity error" in 
	DARB 1. This error constitutes a Dstop. Occasionally, the SDIs could
	also induce a Rstop error in DMX 0.0 and DMX 0.5.

	If a centerplane port is not deconfigured AND the errors described
	above are asserted at expander power on, a "no reason found" Dstop
	occurs. Since the DMX ports are not deconfigured, the Rstop signal
	is raised to the DARBs. So the DARB has both a Rstop and Dstop flagged.
	When POST begins, one of its first tasks is to clear any errors 
	present for the domain resources being configured. However, because
	of Bug 4696868, the errors in the DMX are not cleared. As a result,
	the DARB maintains the record of both Dstop and Rstop, and finally
	a Dstop is noted by POST. Since all other errors are cleared, the
	Dstop appears to have "no reason".
	
	For completeness, the Dstop is noted in cpu_lpost or pci_lpost because 
	it is in the lpost stages when Dstop detection is enabled. In cpu_lpost,
	detection is enabled for all expanders that contain a CPU board that
	is part of the domain being configured. Similarly, for pci_lpost,
	detection is enabled for all expanders that contain an I/O board for
	the domain being configured. Precisely which phase the Dstop is 
	detected in is a function of the domain configuration.

- Meta-Data/Problem categorization:

Product/Platform: SF12K/15K
Category:

- Keywords

dstop, starcat, 15K, 12K, SF15K, SF12K, xcstate, Dstop detected, no reason found            

INTERNAL SUMMARY:

SUBMITTER: Scott Davenport BUG REPORT ID: 4724771, 4696868, 4696868 PATCH ID: 112481-06, 112481-06, 112481-06 APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.