SRDB ID   Synopsis   Date
48124   Sun Fire[TM] 12K/15K: Rstop: Slot1 data parity bit 1 error   30 Oct 2002

Status Issued

Description
- Problem Statement:

	Rstop:  Slot1 data parity bit 1 error

- Symptoms:

    'wfail' output reports something similar to the following:

       01 redxl> dumpf load dsmd.rstop.020915.1751.52
       02 Created Sun Sep 15 17:51:53 2002
       03 By hpost v. 1.2 Generic 112488-06 Jun 18 2002 15:53:15  executing as pid=13269
       04 On ssc name =  starcat1-sc0.bestbuy.com
       05 Domain =  5=F = ds01ux    Platform = starcat1
       06 Boards in dump: master SC    CPs/CSBs[1:0]: 3
       07          EXB[17:0]: 0E000
       08        Slot0[17:0]: 0E000
       09        Slot1[17:0]: 0E000
       10 -D option, -d
       11 "DSMD RecordStop Dump"
       12 0 errors occurred while creating this dump.
       13 redxl> wfail
       14 SDI EX13/S0: SDI is RStopped, requested by DARB.
       15 SDI EX14/S0: SDI is RStopped, requested by DARB.
       16 SDI EX14/S0  Recordstop0[31:0]  = 04018001
       17         Rstop0[16]: R 1E DARB texp request Recordstop (M)
       18         Rstop0[26]: R    Slot0 asserted EccErr, enabled to cause Rstop (M)
       19 SDI EX15/S0  Master_Stop_Status0[31:0] = 80040008
       20         MStop0[3]: SDI is Recordstopped
       21 SDI EX15/S0  Recordstop0[31:0]  = 00010001
       22         Rstop0[16]: R    DARB texp request Recordstop (M)
       23 SDI EX15/S0  Recordstop1[31:0]  = 00408040
       24         Rstop1[22]: R 1E SDI Slave 1 requested all Recordstop
       25 SDI EX15/S1  Master_Stop_Status0[31:0] = 00000008
       26         MStop0[3]: SDI is Recordstopped
       27 SDI EX15/S1  Recordstop0[31:0]  = 00408040
       28         Rstop0[22]: R 1E SDI internal Slot1 port request Recordstop
       29 SDI EX15/S1  Slot1_Error1[31:0] = 2000A000  Mask = FFFF4FFF
       30         S1Err1[29]: R 1E Slot1 data parity bit 1 error
       31             slt1_datap[1:0], slt1_data[23:0] = 3 000020
       32 FAIL Slot IO15:  Dstop/Rstop detected by SDI EX15/S1
       33 Primary service FRU is Slot IO15.
       34 Secondary service FRU is EXB EX15.
       35 DARB C0: enabled ports (expanders)          [17:0]: 0EDFF
       36 DARB C0: exps request Rstop                 [17:0]: 08000
       37 DARB C0: other darb req Rstop for exps      [17:0]: 08000
       38 DARB C1: enabled ports (expanders)          [17:0]: 0EDFF
       39 DARB C1: exps request Rstop                 [17:0]: 08000
       40 DARB C1: other darb req Rstop for exps      [17:0]: 08000
                              

SOLUTION SUMMARY:
- Troubleshooting:

    The dump header tells us that this Rstop was generated by dsmd (lines 10,11) while
    a domain was active. This is also evident by the dump file name.  dsmd.rstop files are
    created by dsmd as part of error capturing. Walking the error chain:

     - EX14 shows errors, but the first error is the DARB requesting the stop (lines 16-18).
     - EX15/S0 (SDI0) reports a first error from slave SDI1 (line 24).
     - EX15/S1 (SDI1) reports a first error of a parity bit 1 error from slot 1 (line 30).
     - 'wfail' then informs us to fail IO15 (line 32) to avoid the error.
     - IO15 and EX15 are identified as primary/secondary FRUs (lines 33,34)

    When EX15/SDI1 detects the parity error, it implies that a bit error occurred to
    data in transit between the DXs on IO15 and SDI1 on EX15. Therefore, the error happened
    across an interconnect, and both IO15 and EX15 are suspect.
    
    Data path parity for the SDIs is only utilized as an interconnect diagnostic tool. 
    Because the underlying data is protected by ECC, and the individual SDIs only have 
    information on their specific data slice (i.e., multi-bit errors are unknown to a
    single SDI), data is allowed to pass despite parity errors. However, the SDI records
    the parity error if further diagnosis is needed.

    Since the data is allowed to pass, Solaris[TM] will also record an ECC event. For example:

       41 Sep 15 17:51:52 ds01ux SUNW,UltraSPARC-III+: [ID 729643 kern.notice] NOTICE: 
       42   [AFT0] Corrected system bus (CE) Event on CPU448 at TL=0, errID 0x0000593b.a4ebc76b
       43 Sep 15 17:51:52 ds01ux     AFSR 0x00000002<CE>.000001b8 AFAR 0x00000400.fef171d0
       44 Sep 15 17:51:52 ds01ux     Fault_PC 0x103484dc Esynd 0x01b8 Not memory
       45 Sep 15 17:51:52 ds01ux SUNW,UltraSPARC-III+: [ID 317326 kern.notice] [AFT0] errID 0x0000593b.a4ebc76b 
       46   Data Bit 127 was in error and corrected
       47 Sep 15 17:51:52 ds01ux SUNW,UltraSPARC-III+: [ID 370641 kern.info] [AFT2] errID 0x0000593b.a4ebc76b
       48   E$tag PA=0x000001e1.9af171c0 does not match AFAR=0x00000400.fef171c0
       49 Sep 15 17:51:52 ds01ux SUNW,UltraSPARC-III+: [ID 684557 kern.info] [AFT2] errID 0x0000593b.a4ebc76b 
       50   PA=0x000001e1.9af171c0
       51 Sep 15 17:51:52 ds01ux     E$tag 0x00000786.6b000001 E$state_7 Invalid

    In this example, Solaris corrects the single bit error. The Rstop file can pinpoint where in
    the interconnect the bit error occurred.

- Resolution:

    First the severity of the error must be judged. If the error was uncorrectable and 
    resulted in a domain interruption, a component replacement is in order. Or, if the
    error is correctable, but repeating relatively often, a replacement may be best to
    avoid a future interruption.

    If a replacement is suitable, follow the suggestion of 'wfail'. Start with the IO
    board. During replacement, examine the interconnect for pin damage. If the IO board
    exchange does not correct the problem, the Expander is the secondary FRU.

- Summary of part number and patch ID's 

    501-5179 Expander
    http://infoserver.central.sun.com/data/sshandbook/Devices/I_O/IO_SunFire_15K_hsPCI_IO_Board.html
    
- References and bug IDs

    SunSolve Article 48122 

- Additional background information:

- Meta-Data/Problem categorization:

Product/Platform: SF12K/SF15K
Category:

- Keywords

15K, 12K, SF15K, SF12K, starcat, rstop, Slot1 data parity bit 1 error                              

INTERNAL SUMMARY:

SUBMITTER: Scott Davenport APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.