SRDB ID | Synopsis | Date | ||
48233 | Sun Fire[TM] 12K/15K: Rstop: Mtag CEs Corrected by SDI (M) | 31 Oct 2002 |
Status | Issued |
Description |
- Problem Statement: Rstop: Mtag CEs Corrected by SDI (M). - Symptoms: redx wfail, shsdc and shdx command output reports the following failure signature: 01 redxl> dumpf load dsmd.rstop.020927.0951.39 02 Created Fri Sep 27 09:51:39 2002 03 By hpost v. 1.2 Generic 112488-06 Jun 18 2002 15:53:15 executing as pid=18215 04 On ssc name = swmtftk01. 05 Domain = 0=A = MES Platform = f15k4 06 Boards in dump: master SC CPs/CSBs[1:0]: 3 07 EXB[17:0]: 00007 08 Slot0[17:0]: 00007 09 Slot1[17:0]: 00007 10 -D option, -d 11 "DSMD RecordStop Dump" 12 0 errors occurred while creating this dump. 13 redxl> wfail 14 SDI EX00/S0: SDI is RStopped, requested by DARB. 15 SDI EX01/S0 Master_Stop_Status0[31:0] = 70000308 16 MStop0[3]: SDI is Recordstopped 17 SDI EX01/S0 Recordstop0[31:0] = 04218020 18 Rstop0[16]: R DARB texp request Recordstop (M) 19 Rstop0[21]: R 1E SDI internal Slot0 port request Recordstop 20 Rstop0[26]: R Slot0 asserted EccErr, enabled to cause Rstop (M) 21 SDI EX01/S0 Slot0_Error1[31:0] = 00408040 Mask = 31444EBF 22 S0Err1[22]: R 1E Slot0 quadword 0 Mtag correctable ECC error (M) 23 {q0_mtag[2:0],q0_mtag_ecc[3:0]} = 02. Calc'd Syndrome[3:0] = 2 24 CE check bit 1 (bit 4 of [6:0], 141 of [143:0]) 25 {q1_mtag[2:0],q1_mtag_ecc[3:0]} = 00. Calc'd Syndrome[3:0] = 0 26 DX SB1/DX2 reports from-port MTAG ECC syndrome 2 (port 0/1) matching 27 one of the MTAG syndromes recorded by the SDI. 28 SDI detected an MTAG ECC error from Slot SB1, and a non-1st EccErr asserted 29 from the same slot. There is also a from-port MTAG ecc error 30 detected by a DX on this slot, with the same syndrome. 31 Analysis will assume the actual first error is that recorded 32 by the DX, for better fault identification. 33 EPLD SB01 Ecc_Err: Mask= F7 Err= 08 SDC reports EccErr 34 SDC SB01 EccStatus[31:0] = 0000E041 35 EccSt[15]: Safari port 0/1 Ecc error logged. 36 Received by DXs from local Safari port 1, read operation. 37 DX SB01/DX2 Ecc_Syndrome[31:0] = 00000800 38 Syndr[13:10]: P01 Mtag: 2: CE check bit 1 (bit 4 of [6:0], 141 of [143:0]) 39 Syndr[ 15]: P01 Direction: 0: Safari port to DX (Incoming) 40 ECC correctable errors detected from Processor Port SB1/P1, no corresponding 41 parity error in DXs or DCDSs. 42 Assuming the error originated in memory on this port. 43 Mtag syndrome 2 is CE check bit 1 (bit 4 of [6:0], 141 of [143:0]). 44 This bit is in one of Dimm SB1/P1/B0/D2 or Dimm SB1/P1/B1/D2. 45 MTag CEs should be corrected by the SDIs, so there will be no resulting 46 processor trap with address information that would permit 47 bank attribution. Error isolation to the bank is not possible. 48 We must therefore FAIL all memory on this port. 49 FAIL All memory on Port SB1/P1: Rstop detected by DXs/SDC. 50 Primary service FRU is All memory on Port SB1/P1. 51 Secondary service FRU is Slot SB1. 52 SDI EX02/S0: SDI is RStopped, requested by DARB. 53 DARB C0: enabled ports (expanders) [17:0]: 03E3F 54 DARB C0: exps request Rstop [17:0]: 00002 55 DARB C0: other darb req Rstop for exps [17:0]: 00002 56 DARB C1: enabled ports (expanders) [17:0]: 03E3F 57 DARB C1: exps request Rstop [17:0]: 00002 58 DARB C1: other darb req Rstop for exps [17:0]: 00002 59 redxl> shsdc -e 1 0 60 Note: Data is displayed from the currently loaded dump file. 61 SDC SB01 Component ID = 416C107D 62 Lockstep_Err[19:0] = 00000 63 L2_Check__Err[23:0] = 000000 64 EccStatus[31:0] = 0000E041 65 EccSt[15]: Safari port 0/1 Ecc error logged. 66 Received by DXs from local Safari port 1, read operation. 67 No Safari port 2/3 Ecc error logged. 68 Enabled SDC ports [9:0] = 01F 69 Non-0 port errregs [9:0] = 000 70 redxl> shdx -e 1 0 2 71 Note: Data is displayed from the currently loaded dump file. 72 DX SB1/DX2 Component ID = 416C307D 73 Gen_Err[31:0] = 00000000 74 Ecc_Syndrome[31:0] = 00000800 75 Syndr[13:10]: P01 Mtag: 2: CE check bit 1 (bit 4 of [6:0], 141 of [143:0]) 76 Syndr[ 15]: P01 Direction: 0: Safari port to DX (Incoming) 77 Enabled DX ports[9:0] = 03F 78 PortErr is non-0[9:0] = 000 79 redxl> shsdi -e 1 80 Note: Data is displayed from the currently loaded dump file. 81 SDI EX01/S0 Component ID = 64317049 82 Master_Stop_Status0[31:0] = 70000308 83 MStop0[3]: SDI is Recordstopped 84 Master_Stop_Status1[31:0] = 8181000D 85 0x01 CP1StopExp[4:0] MSS1[20:16] 86 0 CP1StopSlot[0:1] MSS1[22:21] Rstop is 1st stop 87 1 CP1StopInfoValid MSS1[23] 88 0x01 CP0StopExp[4:0] MSS1[28:24] 89 0 CP0StopSlot[0:1] MSS1[30:29] Rstop is 1st stop 90 1 CP0StopInfoValid MSS1[31] 91 Dstop0[31:0] = 00000000 92 Dstop1[31:0] = 00000000 93 Recordstop0[31:0] = 04218020 94 Rstop0[16]: R DARB texp request Recordstop (M) 95 Rstop0[21]: R 1E SDI internal Slot0 port request Recordstop 96 Rstop0[26]: R Slot0 asserted EccErr, enabled to cause Rstop (M) 97 Recordstop1[31:0] = 00000000 98 Core_Error0[31:0] = 00000000 Mask = 0051FFFF 99 Core_Error1[31:0] = 00000000 Mask = FFFFFFFF 100 Sysreg_Error[31:0] = 00000000 Mask = 780377FF 101 STB_Error[31:0] = 00000000 Mask = 7F00FFFF 102 CP_Error0[31:0] = 00000000 Mask = 580067FF 103 CP_Error1[31:0] = 00000000 Mask = 7FFCFFFF 104 Slot0_Error0[31:0] = 00000000 Mask = 7000FFFF 105 Slot0_Error1[31:0] = 00408040 Mask = 31444EBF 106 S0Err1[22]: R 1E Slot0 quadword 0 Mtag correctable ECC error (M) 107 {q0_mtag[2:0],q0_mtag_ecc[3:0]} = 02. Calc'd Syndrome[3:0] = 2 108 CE check bit 1 (bit 4 of [6:0], 141 of [143:0]) 109 {q1_mtag[2:0],q1_mtag_ecc[3:0]} = 00. Calc'd Syndrome[3:0] = 0 110 Slot0_ErrData[4:2][31:0] = 001FBC00 000C0000 00000000 111 Slot0_ErrData[1:0][31:0] = 00080000 00000000 112 Slot0_Error2[31:0] = 00000000 Mask = 7FFCFFFF 113 Slot1_Error0[31:0] = 00000000 Mask = 3000FFFF 114 Slot1_Error1[31:0] = 00000000 Mask = 31404EBF 115 Slot1_Error2[31:0] = 00000000 Mask = 7FFCFFFF
SOLUTION SUMMARY:
- Troubleshooting: The dump header tells us this Rstop dumpfile was generated by dsmd while the domain was running (lines 10,11). Walking the error chain: - Master SDI on EX1 reports a Slot 0 MTag ECC (lines 22-25) - The SDC reports an error on Safari port 0/1 (lines 34-36) - The DX reports an ECC error from Safari port 1 (lines 37-39) - Suspect DIMMs are identified (lines 42-44) - Since an individual DIMM cannot be identified by 'wfail', the processor controlling that memory is FAILed (line 49) - FRUs are named as memory of SB1/P1 (primary) and the system board itself (secondary) (lines 50-51) The DXs works in pairs (0&2, 1&3) to compute the ECC of each of the two Quadwords. From DX2, data is incoming from Safari port and the computed Mtag ECC syndrome is 2 which is CE check bit 1 or 141 of the quadword. With no corresponding parity error in DXs or DCDSs, we can assume the bit error originates from the memory of that Processor Port (SB1/P1). This bit would be one of Dimm SB1/P1/B0/D2 or Dimm SB1/P1/B1/D2. The reason there is no bank attribution is that the Mtag CE has been corrected by SDI(M), and no processor would ever see the Mtag CE. As a result, there is no processor trap taken with address information that would permit bank attribution unless it is from local memory. In this case the data would not pass through the SDI(M). There would be no console or domain messages indicating that this Mtag CE event occurred other than this Rstop dump. - Resolution: Treat Mtag CE the same as memory CE best practices. Do not replace DIMM(s) on the first occurrence. - Summary of part number and patch ID's - References and bug IDs Safari Specification/Starcat Architecture. - Additional background information: Safari Data Structure -------------------- <127--Data--0><--8ECC--0><2--Mtag--0><3--MtagECC--0> <--------------------------------------------------> 144-bit Mtag Syndrome Table ------------------- Mtag ECC syndrome 7: CE bit 0 (bit 137 of [143:0]) Mtag ECC syndrome B: CE bit 1 (bit 138 of [143:0]) Mtag ECC syndrome D: CE bit 2 (bit 139 of [143:0]) Mtag ECC syndrome 1: CE check bit 0 (bit 3 of [6:0], 140 of [143:0]) Mtag ECC syndrome 2: CE check bit 1 (bit 4 of [6:0], 141 of [143:0]) Mtag ECC syndrome 4: CE check bit 2 (bit 5 of [6:0], 142 of [143:0]) Mtag ECC syndrome 8: CE check bit 3 (bit 6 of [6:0], 143 of [143:0]) Multiple-bit syndromes are anything else except 0 (since 0 indicates on error) - Meta-Data/Problem categorization: Product/Platform: SF12K/SF15K Category: - Keywords 15K, 12K, SF15K, SF12K, starcat, rstop, SDI(M), Mtag correctable ECC error.
INTERNAL SUMMARY:
SUBMITTER: Tong-Pheng Koh APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS: