SRDB ID   Synopsis   Date
48125   Sun Fire[TM] 12K/15K: Rstop: Slot0 asserted EccErr, enabled to cause Rstop   29 Oct 2002

Status Issued

Description
 - Problem Statement:

    Rstop: Slot0 asserted EccErr, enabled to cause Rstop 

- Symptoms:

    'wfail' output reports something similar to the following:

       01  redxl> dumpf load xcstate.020213.2331.22
       02  Created Wed Feb 13 23:31:22 2002
       03  By hpost v. 1.1 Generic 112099-05 Nov 27 2001 12:41:09  executing as pid=892
       04  On ssc name =  n017.new-sc1.
       05  Domain =  0=A    Platform = n017.new
       06  Boards in dump: master SC    CPs/CSBs[1:0]: 3
       07            EXB[17:0]: 0007F
       08          Slot0[17:0]: 0007F
       09          Slot1[17:0]: 0003F
       10  Stop on EXB EX0 during stage cpu_lpost_II
       11  0 errors occurred while creating this dump.
       12  redxl> wfail
       13  SDI EX00/S0: SDI is RStopped, requested by DARB.
       14  SDI EX01/S0: SDI is RStopped, requested by DARB.
       15  SDI EX02/S0: SDI is RStopped, requested by DARB.
       16  SDI EX03/S0: SDI is RStopped, requested by DARB.
       17  SDI EX04/S0: SDI is RStopped, requested by DARB.
       18  SDI EX05/S0  Master_Stop_Status0[31:0] = 80040308
       19         MStop0[3]: SDI is Recordstopped
       20  SDI EX05/S0  Recordstop0[31:0]  = 04018400
       21         Rstop0[16]: R    DARB texp request Recordstop (M)
       22         Rstop0[26]: R 1E Slot0 asserted EccErr, enabled to cause Rstop (M)
       23  EPLD SB05  Ecc_Err:   Mask= F7  Err= 08  SDC reports EccErr
       24  SDC SB05  EccStatus[31:0] = 0000E073
       25         EccSt[15]: Safari port 0/1 Ecc error logged.
       26            Received by DXs from local Safari port 1, read operation.
       27  DX SB05/DX2  Ecc_Syndrome[31:0] = 00000121
       28    Syndr[ 8: 0]: P01 Data: 121: CE bit 97
       29    Syndr[   15]: P01 Direction: 0: Safari port to DX (Incoming)
       30  ECC correctable errors detected from Processor Port SB5/P1, no corresponding
       31          parity error in DXs or DCDSs.
       32          Assuming the error originated in memory on this port.
       33          Data syndrome 121 is CE bit 97.
       34          This bit is in one of Dimm SB5/P1/B0/D3 or Dimm SB5/P1/B1/D3.
       35  Bank/Dimm fault attribution for data CEs is the responsibility of
       36          lpost or domain software which has address information that
       37          allows error attribution to a bank. No action taken here.
       38  SDI EX06/S0: SDI is RStopped, requested by DARB.
       39  DARB C0: enabled ports (expanders)          [17:0]: 0007F
       40  DARB C0: exps request Rstop                 [17:0]: 00020
       41  DARB C0: other darb req Rstop for exps      [17:0]: 00020
       42  DARB C1: enabled ports (expanders)          [17:0]: 0007F
       43  DARB C1: exps request Rstop                 [17:0]: 00020
       44  DARB C1: other darb req Rstop for exps      [17:0]: 00020                             

SOLUTION SUMMARY:
 - Troubleshooting:

    The dump header tells us that this error was encountered during the cpu_lpost_II 
    stage of POST (line 10). This is also evident by the dump file name - xcstate 
    files are created when POST detects a stop condition. In fact, the output in 
    the POST log will look like the above, save the dump header. Walking the error 
    chain:

     - SDI5 reports a first error of its Slot 0 board asserting an ECC error (line 22). 
       This equates to SB5. 
     - Next, the EPLD on SB5 is examined, and we see it's reporting an error from the 
       SDC (line 23). 
     - Continuing, the SDC's error registers reveal that it had an ECC error on a 
       DX (lines 24-26). Note also that the SDC can distinguish the erring operation 
       is a read operation involving Safari Port 1 (line 26). For the SDC, Safari 
       Port 1 is SB5/P1. 
     - Finally, the DX shows the syndrome and errored bit (lines 27,28). The direction 
       is also trapped by the DX (line 29).

    wfail then examines the DXs and DCDSs for any corresponding parity errors that
    are pertinent to the failure and reports its findings (lines 30-32). In this case,
    there are no pertinent parity errors found, so the error is assumed to be sourced
    in memory. Finally, wfail is able to narrow down to one of two DIMMs, but states
    that either Solaris[TM] or LPOST must identify the exact DIMM (lines 34-37).

- Resolution:

    If the stop was encountered during POST (as above), examine the POST log for 
    an error that identifies the exact DIMM. Example message:

       Primary service FRU is Dimm SB5/P1/B0/D3

    If the stop was encountered while Solaris was running, consult the /var/adm/messages
    logs on the domain for the faulty DIMM. Example message:

    SUNW,UltraSPARC-III+: NOTICE: [AFT0] Corrected system bus (CE) Event on CPU96 at TL=0, errID 0x000122cf.377e4dcb
              AFSR 0x00000002<CE>.00000052 AFAR 0x00000060.64a9b870
              Fault_PC 0x10024d20 Esynd 0x0052 SB3/P1/B0/D1 J14400
    SUNW,UltraSPARC-III+: [AFT0] errID 0x000122cf.377e4dcb Corrected Memory Error on SB3/P1/B0/D1 J14400 is Persistent
    SUNW,UltraSPARC-III+: [AFT0] errID 0x000122cf.377e4dcb Data Bit 28 was in error and corrected

    Note that wfail assumes that the error originated in memory. But, it is possible 
    that the memory was written into memory with bad ECC. The failure history of the 
    system can be taken into account to troubleshoot further. A brief discussion of 
    a "bad writer" is in the Sun Fire 15K Architecture document.

- Summary of part number and patch ID's 

    
- References and bug IDs

    SunSolve Article 48122 
    http://webhome.eng.sun.com/alanc/sunfire/arch/sf15Karch.pdf

- Additional background information:

    On occasion, there have been Rstops encountered during LPOST where a DIMM is not 
    identified by LPOST. A theory is that a memory access beyond the visibility of 
    LPOST encountered the ECC error; for example, when the AXQ accesses memory to 
    fetch MTags

- Meta-Data/Problem categorization:

Product/Platform: SF12K/SF15K
Category:

- Keywords

15K, 12K, SF15K, SF12K, starcat, rstop, Slot0 asserted EccErr, enabled to cause Rstop  
                           

INTERNAL SUMMARY:

SUBMITTER: Scott Davenport APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.