SRDB ID   Synopsis   Date
48207   Sun Fire[TM] 12K/15K: Dstop: AMX 0-3 flow control didn't arrive simultaneously   31 Oct 2002

Status Issued

Description
- Problem Statement/Title:

	Dstop: AMX 0-3 hio flow control didn't arrive simultaneously

- Symptoms:

	'wfail' output reports something similar to the following:

	   01  redxl> dumpf load dsmd.dstop.020204.1016.58
	   02  Created Mon Feb  4 10:43:33 2002
	   03  By hpost v. 1.1 Generic 112099-05 Nov 27 2001 12:41:09  executing as pid=2464
	   04  On ssc name =  mayflower-sc0.
	   05  Domain =  1=B = pilgrim2    Platform = mayflower
	   06  Boards in dump: master SC    CPs/CSBs[1:0]: 3
	   07            EXB[17:0]: 001E4
	   08          Slot0[17:0]: 001E4
	   09          Slot1[17:0]: 00100
	   10  -D option, -d
	   11  "DSMD DomainStop Dump"
	   12  0 errors occurred while creating this dump.
	   13  redxl> wfail
	   14  SDI EX02/S0  Master_Stop_Status0[31:0] = A000004F
	   15          MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
	   16  SDI EX02/S0  Dstop0[31:0] = 12028200
	   17          Dstop0[17]: D    DARB texp requests Slot0 Dstop (M)
	   18          Dstop0[25]: D 1E AXQ requests all Dstop (M)
	   19          Dstop0[28]: D    Slot0 asserted Error, enabled to cause Dstop (M)
	   20  AXQ EX02 ( 2) Error_Flag_03[31:0] = 10009000  Mask = 21005EFF
	   21          Err3[28]: D 1E AMX data ECC uncorrectable error
	   22  FAIL EXB EX2:  Dstop/Rstop detected by AXQ.
	   23  Primary service FRU is EXB EX2.
	   24  SDI EX05/S0  Master_Stop_Status0[31:0] = 8004004F
	   25          MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
	   26  SDI EX05/S0  Dstop0[31:0] = 12028200
	   27          Dstop0[17]: D    DARB texp requests Slot0 Dstop (M)
	   28          Dstop0[25]: D 1E AXQ requests all Dstop (M)
	   29          Dstop0[28]: D    Slot0 asserted Error, enabled to cause Dstop (M)
	   30  AXQ EX05 ( 5) Error_Flag_02[31:0] = 04008400  Mask = 0000FFFF
	   31          Err2[26]: D 1E AMX 0-3 hs flow control didn't arrive simultaneously
	   32  FAIL EXB EX5:  Dstop/Rstop detected by AXQ.
	   33  Primary service FRU is EXB EX5.
	   34  SDI EX06/S0  Master_Stop_Status0[31:0] = A004000F
	   35          MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
	   36  SDI EX06/S0  Dstop0[31:0] = 02028200
	   37          Dstop0[17]: D    DARB texp requests Slot0 Dstop (M)
	   38          Dstop0[25]: D 1E AXQ requests all Dstop (M)
	   39  AXQ EX06 ( 6) Error_Flag_03[31:0] = 10009000  Mask = 21005EFF
	   40          Err3[28]: D 1E AMX data ECC uncorrectable error
	   41  FAIL EXB EX6:  Dstop/Rstop detected by AXQ.
	   42  Primary service FRU is EXB EX6.
	   43  SDI EX07/S0: Slot 0 port is DStopped, SDI is RStopped, requested by DARB.
	   44  SDI EX08/S0: All SDI is DStopped and RStopped,         requested by DARB.
            

SOLUTION SUMMARY:
- Troubleshooting:

	The dump header tells us that this Dstop was generated by dsmd (lines 10,11) 
	while a domain was active. This is also evident by the dumpf file name - 
	dsmd.dstop files are created by dsmd as part of an ASR. Walking the
	error chain:

	 - The SDI on EX2 calls for Dstop as directed by AXQ2 (line 18).
	 - The AXQ on EX2 detected an uncorrectable ECC error (line 21).
	 - The SDI on EX5 calls for Dstop as directed by AXQ5 (line 28).
	 - The AXQ on EX5 detected a flow control issue from the centerplane (line 31).
	 - The SDI on EX6 calls for Dstop as directed by AXQ6 (line 38).
	 - The AXQ on EX6 detected an uncorrectable ECC error (line 40).
	 - Expanders 2, 5 and 6 are FAILed from the configuration (lines 22,33,41).
	 - Expanders 2, 5 and 6 are named as FRUs (lines 23,34,42).

	This is a large amount of hardware to replace. It calls for more analysis.
	The key signature is the AMX flow control error (line 31). Several bugs
	exist in SMS that result in this type of Dstop. For all of these bugs (listed
	below), a key piece was that a POST on another domain was in process at
	the time of the Dstop.

	From the name of the dump file, we know the Dstop was detected at 10:16:58
	on Feb 2 2002 (line 01). Note that this is different from the time when the
	dump is created (line 02). Creation of the dump file (i.e., hpost -D runs
	and collects the dump) can be significantly delayed because of either
	POST throttling by tmd or split expander configurations (per-expander locks).

	Next, look through explorer to find if another POST was started prior to the
	Dstop. In this case, a POST was initiated at 10:15:01 Feb 2 2002. Also check
	the POST patch level from the dump header (line 03) or explorer.

	In this example, bug 4712287 was encountered. 

- Resolution:

	Ensure patches and workarounds (if applicable) are up to date.

	If all patches/workarounds were in place, use the Mstop registers in the SDI
	to identify the first SDI calling for stop (see below). Initiate hardware 	
	actions with that boardset. 

	Note that known hardware causes of AMX flow control have had the following
	characteristics:
	   o All first errors are isolated to a single boardset (expander)
	   o The reporting AXQ also logs ans AMX flow control/nack parity error

- Summary of part number and patch ID's 

	Bugs
	  4505473 - when 2 domains share an expander, hpost -Q on one causes the other to DStop
	  4644905 - POST of one domain Dstops another - AMXs lose lockstep
	  4712287 - EXB asic LBIST needs to be skipped when CP is in use.

	SMS 1.1
	  112080-05 (or higher)
	  Use "no_asic_lbist	axq	# BugiD 4712287" .postrc workaround

	SMS 1.2
	  112488-07 (or higher)

- References and bug IDs

	SunSolve Article 48122	

- Additional background information:

	As seen in this dump, it is possible for multiple SDIs to claim first error
	system wide. This is because from the time one component calls for Dstop
	to when the SDIs receive the broadcast from the DARBs is large enough that
	other errors may be recorded in the SDIs.

	When multiple SDIs all report first errors, the Mstop0 registers in the 
	master SDI can be used to determine the first DARB stop message sent. The
	message indicates which expander sourced the request. Thus, it implies
	which expander was the first to call for stop.

	For example:

	   45  redxl> shsdi -e 2
	   46  Note: Data is displayed from the currently loaded dump file.
	   47  SDI EX02/S0    Component ID = 54317049
	   48           Master_Stop_Status0[31:0] = A000004F
	   49          MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
	   50           Master_Stop_Status1[31:0] = E2E2000E
	   51          0x02   CP1StopExp[4:0]        MSS1[20:16]    
	   52             3   CP1StopSlot[0:1]       MSS1[22:21]    Dstop is 1st stop
	   53             1   CP1StopInfoValid       MSS1[23]       
	   54          0x02   CP0StopExp[4:0]        MSS1[28:24]    
	   55             3   CP0StopSlot[0:1]       MSS1[30:29]    Dstop is 1st stop
	   56             1   CP0StopInfoValid       MSS1[31]       

	redx parses the stop message. Lines 51 and 54 indicate which expander is 
	calling for the stop. Here, it is EX2. The other SDIs in the domain configuration
	can be examined. Typically, all agree and the demanding boardset can be
	focused upon.

- Meta-Data/Problem categorization:

Product/Platform: SF12K/SF15K
Category:

- Keywords

15K, 12K, SF15K, SF12K, starcat, dstop, AMX 0-3 flow control didn't arrive simultaneously

            

INTERNAL SUMMARY:

SUBMITTER: Scott Davenport BUG REPORT ID: 4712287, 4505473, 4644905, 4712287 PATCH ID: 112080-05, 112488-07 APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.