SRDB ID | Synopsis | Date | ||
48125 | Sun Fire[TM] 12K/15K: Rstop: Slot0 asserted EccErr, enabled to cause Rstop | 29 Oct 2002 |
Status | Issued |
Description |
- Problem Statement: Rstop: Slot0 asserted EccErr, enabled to cause Rstop - Symptoms: 'wfail' output reports something similar to the following: 01 redxl> dumpf load xcstate.020213.2331.22 02 Created Wed Feb 13 23:31:22 2002 03 By hpost v. 1.1 Generic 112099-05 Nov 27 2001 12:41:09 executing as pid=892 04 On ssc name = n017.new-sc1. 05 Domain = 0=A Platform = n017.new 06 Boards in dump: master SC CPs/CSBs[1:0]: 3 07 EXB[17:0]: 0007F 08 Slot0[17:0]: 0007F 09 Slot1[17:0]: 0003F 10 Stop on EXB EX0 during stage cpu_lpost_II 11 0 errors occurred while creating this dump. 12 redxl> wfail 13 SDI EX00/S0: SDI is RStopped, requested by DARB. 14 SDI EX01/S0: SDI is RStopped, requested by DARB. 15 SDI EX02/S0: SDI is RStopped, requested by DARB. 16 SDI EX03/S0: SDI is RStopped, requested by DARB. 17 SDI EX04/S0: SDI is RStopped, requested by DARB. 18 SDI EX05/S0 Master_Stop_Status0[31:0] = 80040308 19 MStop0[3]: SDI is Recordstopped 20 SDI EX05/S0 Recordstop0[31:0] = 04018400 21 Rstop0[16]: R DARB texp request Recordstop (M) 22 Rstop0[26]: R 1E Slot0 asserted EccErr, enabled to cause Rstop (M) 23 EPLD SB05 Ecc_Err: Mask= F7 Err= 08 SDC reports EccErr 24 SDC SB05 EccStatus[31:0] = 0000E073 25 EccSt[15]: Safari port 0/1 Ecc error logged. 26 Received by DXs from local Safari port 1, read operation. 27 DX SB05/DX2 Ecc_Syndrome[31:0] = 00000121 28 Syndr[ 8: 0]: P01 Data: 121: CE bit 97 29 Syndr[ 15]: P01 Direction: 0: Safari port to DX (Incoming) 30 ECC correctable errors detected from Processor Port SB5/P1, no corresponding 31 parity error in DXs or DCDSs. 32 Assuming the error originated in memory on this port. 33 Data syndrome 121 is CE bit 97. 34 This bit is in one of Dimm SB5/P1/B0/D3 or Dimm SB5/P1/B1/D3. 35 Bank/Dimm fault attribution for data CEs is the responsibility of 36 lpost or domain software which has address information that 37 allows error attribution to a bank. No action taken here. 38 SDI EX06/S0: SDI is RStopped, requested by DARB. 39 DARB C0: enabled ports (expanders) [17:0]: 0007F 40 DARB C0: exps request Rstop [17:0]: 00020 41 DARB C0: other darb req Rstop for exps [17:0]: 00020 42 DARB C1: enabled ports (expanders) [17:0]: 0007F 43 DARB C1: exps request Rstop [17:0]: 00020 44 DARB C1: other darb req Rstop for exps [17:0]: 00020
SOLUTION SUMMARY:
- Troubleshooting: The dump header tells us that this error was encountered during the cpu_lpost_II stage of POST (line 10). This is also evident by the dump file name - xcstate files are created when POST detects a stop condition. In fact, the output in the POST log will look like the above, save the dump header. Walking the error chain: - SDI5 reports a first error of its Slot 0 board asserting an ECC error (line 22). This equates to SB5. - Next, the EPLD on SB5 is examined, and we see it's reporting an error from the SDC (line 23). - Continuing, the SDC's error registers reveal that it had an ECC error on a DX (lines 24-26). Note also that the SDC can distinguish the erring operation is a read operation involving Safari Port 1 (line 26). For the SDC, Safari Port 1 is SB5/P1. - Finally, the DX shows the syndrome and errored bit (lines 27,28). The direction is also trapped by the DX (line 29). wfail then examines the DXs and DCDSs for any corresponding parity errors that are pertinent to the failure and reports its findings (lines 30-32). In this case, there are no pertinent parity errors found, so the error is assumed to be sourced in memory. Finally, wfail is able to narrow down to one of two DIMMs, but states that either Solaris[TM] or LPOST must identify the exact DIMM (lines 34-37). - Resolution: If the stop was encountered during POST (as above), examine the POST log for an error that identifies the exact DIMM. Example message: Primary service FRU is Dimm SB5/P1/B0/D3 If the stop was encountered while Solaris was running, consult the /var/adm/messages logs on the domain for the faulty DIMM. Example message: SUNW,UltraSPARC-III+: NOTICE: [AFT0] Corrected system bus (CE) Event on CPU96 at TL=0, errID 0x000122cf.377e4dcb AFSR 0x00000002<CE>.00000052 AFAR 0x00000060.64a9b870 Fault_PC 0x10024d20 Esynd 0x0052 SB3/P1/B0/D1 J14400 SUNW,UltraSPARC-III+: [AFT0] errID 0x000122cf.377e4dcb Corrected Memory Error on SB3/P1/B0/D1 J14400 is Persistent SUNW,UltraSPARC-III+: [AFT0] errID 0x000122cf.377e4dcb Data Bit 28 was in error and corrected Note that wfail assumes that the error originated in memory. But, it is possible that the memory was written into memory with bad ECC. The failure history of the system can be taken into account to troubleshoot further. A brief discussion of a "bad writer" is in the Sun Fire 15K Architecture document. - Summary of part number and patch ID's - References and bug IDs SunSolve Article48122 http://webhome.eng.sun.com/alanc/sunfire/arch/sf15Karch.pdf - Additional background information: On occasion, there have been Rstops encountered during LPOST where a DIMM is not identified by LPOST. A theory is that a memory access beyond the visibility of LPOST encountered the ECC error; for example, when the AXQ accesses memory to fetch MTags - Meta-Data/Problem categorization: Product/Platform: SF12K/SF15K Category: - Keywords 15K, 12K, SF15K, SF12K, starcat, rstop, Slot0 asserted EccErr, enabled to cause Rstop
INTERNAL SUMMARY:
SUBMITTER: Scott Davenport APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS: