SRDB ID   Synopsis   Date
48233   Sun Fire[TM] 12K/15K: Rstop: Mtag CEs Corrected by SDI (M)   31 Oct 2002

Status Issued

Description
- Problem Statement: 

	Rstop: Mtag CEs Corrected by SDI (M).

- Symptoms: 

	redx wfail, shsdc and shdx command output reports the following 
	failure signature: 

	   01  redxl> dumpf load dsmd.rstop.020927.0951.39
	   02  Created Fri Sep 27 09:51:39 2002
	   03  By hpost v. 1.2 Generic 112488-06 Jun 18 2002 15:53:15  executing as pid=18215
	   04  On ssc name =  swmtftk01.
	   05  Domain =  0=A = MES    Platform = f15k4
	   06  Boards in dump: master SC    CPs/CSBs[1:0]: 3
	   07            EXB[17:0]: 00007
	   08          Slot0[17:0]: 00007
	   09          Slot1[17:0]: 00007
	   10  -D option, -d
	   11  "DSMD RecordStop Dump"
	   12  0 errors occurred while creating this dump.
	   13  redxl> wfail
	   14  SDI EX00/S0: SDI is RStopped, requested by DARB.
	   15  SDI EX01/S0  Master_Stop_Status0[31:0] = 70000308
	   16          MStop0[3]: SDI is Recordstopped
	   17  SDI EX01/S0  Recordstop0[31:0]  = 04218020
	   18          Rstop0[16]: R    DARB texp request Recordstop (M)
	   19          Rstop0[21]: R 1E SDI internal Slot0 port request Recordstop
	   20          Rstop0[26]: R    Slot0 asserted EccErr, enabled to cause Rstop (M)
	   21  SDI EX01/S0  Slot0_Error1[31:0] = 00408040  Mask = 31444EBF
	   22          S0Err1[22]: R 1E Slot0 quadword 0 Mtag correctable ECC error (M)
	   23              {q0_mtag[2:0],q0_mtag_ecc[3:0]} = 02. Calc'd Syndrome[3:0] = 2
	   24                  CE check bit 1 (bit 4 of [6:0], 141 of [143:0])
	   25              {q1_mtag[2:0],q1_mtag_ecc[3:0]} = 00. Calc'd Syndrome[3:0] = 0
	   26  DX SB1/DX2 reports from-port MTAG ECC syndrome 2 (port 0/1) matching
	   27          one of the MTAG syndromes recorded by the SDI.
	   28  SDI detected an MTAG ECC error from Slot SB1, and a non-1st EccErr asserted
	   29          from the same slot. There is also a from-port MTAG ecc error
	   30          detected by a DX on this slot, with the same syndrome.
	   31          Analysis will assume the actual first error is that recorded
	   32          by the DX, for better fault identification.
	   33  EPLD SB01  Ecc_Err:   Mask= F7  Err= 08  SDC reports EccErr
	   34  SDC SB01  EccStatus[31:0] = 0000E041
	   35          EccSt[15]: Safari port 0/1 Ecc error logged.
	   36            Received by DXs from local Safari port 1, read operation.
	   37  DX SB01/DX2  Ecc_Syndrome[31:0] = 00000800
	   38    Syndr[13:10]: P01 Mtag:   2: CE check bit 1 (bit 4 of [6:0], 141 of [143:0])
	   39    Syndr[   15]: P01 Direction: 0: Safari port to DX (Incoming)
	   40  ECC correctable errors detected from Processor Port SB1/P1, no corresponding
	   41          parity error in DXs or DCDSs.
	   42          Assuming the error originated in memory on this port.
	   43          Mtag syndrome 2 is CE check bit 1 (bit 4 of [6:0], 141 of [143:0]).
	   44          This bit is in one of Dimm SB1/P1/B0/D2 or Dimm SB1/P1/B1/D2.
	   45  MTag CEs should be corrected by the SDIs, so there will be no resulting
	   46          processor trap with address information that would permit
	   47          bank attribution. Error isolation to the bank is not possible.
	   48          We must therefore FAIL all memory on this port.
	   49  FAIL All memory on Port SB1/P1:  Rstop detected by DXs/SDC.
	   50  Primary service FRU is All memory on Port SB1/P1.
	   51  Secondary service FRU is Slot SB1.
	   52  SDI EX02/S0: SDI is RStopped, requested by DARB.
	   53  DARB C0: enabled ports (expanders)          [17:0]: 03E3F
	   54  DARB C0: exps request Rstop                 [17:0]: 00002
	   55  DARB C0: other darb req Rstop for exps      [17:0]: 00002
	   56  DARB C1: enabled ports (expanders)          [17:0]: 03E3F
	   57  DARB C1: exps request Rstop                 [17:0]: 00002
	   58  DARB C1: other darb req Rstop for exps      [17:0]: 00002
	   59  redxl> shsdc -e 1 0
	   60  Note: Data is displayed from the currently loaded dump file.
	   61  SDC SB01   Component ID = 416C107D
	   62          Lockstep_Err[19:0] = 00000
	   63          L2_Check__Err[23:0] = 000000
	   64          EccStatus[31:0] = 0000E041
	   65          EccSt[15]: Safari port 0/1 Ecc error logged.
	   66            Received by DXs from local Safari port 1, read operation.
	   67          No Safari port 2/3 Ecc error logged.
	   68          Enabled SDC ports  [9:0] = 01F
	   69          Non-0 port errregs [9:0] = 000
	   70  redxl> shdx -e 1 0 2
	   71  Note: Data is displayed from the currently loaded dump file.
	   72  DX SB1/DX2  Component ID = 416C307D
	   73          Gen_Err[31:0] = 00000000
	   74          Ecc_Syndrome[31:0] = 00000800
	   75    Syndr[13:10]: P01 Mtag:   2: CE check bit 1 (bit 4 of [6:0], 141 of [143:0])
	   76    Syndr[   15]: P01 Direction: 0: Safari port to DX (Incoming)
	   77          Enabled DX ports[9:0] = 03F
	   78          PortErr is non-0[9:0] = 000
	   79  redxl> shsdi -e 1
	   80  Note: Data is displayed from the currently loaded dump file.
	   81  SDI EX01/S0    Component ID = 64317049
	   82           Master_Stop_Status0[31:0] = 70000308
	   83          MStop0[3]: SDI is Recordstopped
	   84           Master_Stop_Status1[31:0] = 8181000D
	   85          0x01   CP1StopExp[4:0]        MSS1[20:16]    
	   86             0   CP1StopSlot[0:1]       MSS1[22:21]    Rstop is 1st stop
	   87             1   CP1StopInfoValid       MSS1[23]       
	   88          0x01   CP0StopExp[4:0]        MSS1[28:24]    
	   89             0   CP0StopSlot[0:1]       MSS1[30:29]    Rstop is 1st stop
	   90             1   CP0StopInfoValid       MSS1[31]       
	   91           Dstop0[31:0] = 00000000
	   92           Dstop1[31:0] = 00000000
	   93           Recordstop0[31:0]  = 04218020
	   94          Rstop0[16]: R    DARB texp request Recordstop (M)
	   95          Rstop0[21]: R 1E SDI internal Slot0 port request Recordstop
	   96          Rstop0[26]: R    Slot0 asserted EccErr, enabled to cause Rstop (M)
	   97           Recordstop1[31:0]  = 00000000
	   98           Core_Error0[31:0]  = 00000000  Mask = 0051FFFF
	   99           Core_Error1[31:0]  = 00000000  Mask = FFFFFFFF
	  100           Sysreg_Error[31:0] = 00000000  Mask = 780377FF
	  101           STB_Error[31:0]    = 00000000  Mask = 7F00FFFF
	  102           CP_Error0[31:0]    = 00000000  Mask = 580067FF
	  103           CP_Error1[31:0]    = 00000000  Mask = 7FFCFFFF
	  104           Slot0_Error0[31:0] = 00000000  Mask = 7000FFFF
	  105           Slot0_Error1[31:0] = 00408040  Mask = 31444EBF
	  106          S0Err1[22]: R 1E Slot0 quadword 0 Mtag correctable ECC error (M)
	  107              {q0_mtag[2:0],q0_mtag_ecc[3:0]} = 02. Calc'd Syndrome[3:0] = 2
	  108                  CE check bit 1 (bit 4 of [6:0], 141 of [143:0])
	  109              {q1_mtag[2:0],q1_mtag_ecc[3:0]} = 00. Calc'd Syndrome[3:0] = 0
	  110              Slot0_ErrData[4:2][31:0] = 001FBC00 000C0000 00000000
	  111              Slot0_ErrData[1:0][31:0] = 00080000 00000000
	  112           Slot0_Error2[31:0] = 00000000  Mask = 7FFCFFFF
	  113           Slot1_Error0[31:0] = 00000000  Mask = 3000FFFF
	  114           Slot1_Error1[31:0] = 00000000  Mask = 31404EBF
	  115           Slot1_Error2[31:0] = 00000000  Mask = 7FFCFFFF
            

SOLUTION SUMMARY:
- Troubleshooting: 

	The dump header tells us this Rstop dumpfile was generated by dsmd while
	the domain was running (lines 10,11). Walking the error chain:

	 - Master SDI on EX1 reports a Slot 0 MTag ECC (lines 22-25) 
	 - The SDC reports an error on Safari port 0/1 (lines 34-36)
	 - The DX reports an ECC error from Safari port 1 (lines 37-39)
	 - Suspect DIMMs are identified (lines 42-44)
	 - Since an individual DIMM cannot be identified by 'wfail', the
	   processor controlling that memory is FAILed (line 49)
	 - FRUs are named as memory of SB1/P1 (primary) and the system
	   board itself (secondary) (lines 50-51)

	The DXs works in pairs (0&2, 1&3) to compute the ECC of each of the 
	two Quadwords. From DX2, data is incoming from Safari port and the 
	computed Mtag ECC syndrome is 2 which is CE check bit 1 or 141 of 
	the quadword.

	With no corresponding parity error in DXs or DCDSs, we can assume the 
	bit error originates from the memory of that Processor Port (SB1/P1). 
	This bit would be one of Dimm SB1/P1/B0/D2 or Dimm SB1/P1/B1/D2.

	The reason there is no bank attribution is that the Mtag CE has been 
	corrected by SDI(M), and no processor would ever see the Mtag CE. As a 
	result, there is no processor trap taken with address information that 
	would permit bank attribution unless it is from local memory. In this 
	case the data would not pass through the SDI(M).

	There would be no console or domain messages indicating that this Mtag CE
	event occurred other than this Rstop dump.

- Resolution:

	Treat Mtag CE the same as memory CE best practices. Do not replace 
	DIMM(s) on the first occurrence. 

- Summary of part number and patch ID's 

- References and bug IDs 

	Safari Specification/Starcat Architecture.

- Additional background information: 

	Safari Data Structure
	--------------------

	<127--Data--0><--8ECC--0><2--Mtag--0><3--MtagECC--0>

	<-------------------------------------------------->
	                        144-bit

	Mtag Syndrome Table
	-------------------

	Mtag ECC syndrome 7: CE bit 0 (bit 137 of [143:0])
	Mtag ECC syndrome B: CE bit 1 (bit 138 of [143:0])
	Mtag ECC syndrome D: CE bit 2 (bit 139 of [143:0])
	Mtag ECC syndrome 1: CE check bit 0 (bit 3 of [6:0], 140 of [143:0])
	Mtag ECC syndrome 2: CE check bit 1 (bit 4 of [6:0], 141 of [143:0])
	Mtag ECC syndrome 4: CE check bit 2 (bit 5 of [6:0], 142 of [143:0])
	Mtag ECC syndrome 8: CE check bit 3 (bit 6 of [6:0], 143 of [143:0])
	Multiple-bit syndromes are anything else except 0 (since 0 indicates on error)

- Meta-Data/Problem categorization: 

Product/Platform: SF12K/SF15K 
Category: 


- Keywords

15K, 12K, SF15K, SF12K, starcat, rstop, SDI(M), Mtag correctable ECC error.            

INTERNAL SUMMARY:

SUBMITTER: Tong-Pheng Koh APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.