SRDB ID   Synopsis   Date
47420   Sun Fire[TM] 12K/15K: Rstop: CP1 half data parity error   22 Nov 2002

Status Issued

Description
- Problem Statement/Title: SF15K Troubleshooting Article:

        Rstop: CP1 half data parity error

- Symptoms:

        'wfail' output reports something similar to the following:

           01  redxl> dumpf load dsmd.rstop.020804.1527.51
           02  Created Sun Aug  4 15:27:51 2002
           03  By hpost v. 1.2 Generic 112488-05 May  8 2002 17:05:18  executing as pid=14770
           04  On ssc name =  sms01-sc0.
           05  Domain =  1=B = cris01    Platform = ocesf15k1
           06  Boards in dump: master SC    CPs/CSBs[1:0]: 3
           07            EXB[17:0]: 047E0
           08          Slot0[17:0]: 007E0
           09          Slot1[17:0]: 041E0
           10  -D option, -d
           11  "DSMD RecordStop Dump"
           12  0 errors occurred while creating this dump.
           13  redxl> wfail
           14  SDI EX05/S0: SDI is RStopped, requested by DARB.
           15  SDI EX06/S0: SDI is RStopped, requested by DARB.
           16  SDI EX07/S0: SDI is RStopped, requested by DARB.
           17  SDI EX08/S0: SDI is RStopped, requested by DARB.
           18  SDI EX09/S0  Master_Stop_Status0[31:0] = F0040008
           19          MStop0[3]: SDI is Recordstopped
           20  SDI EX09/S0  Recordstop0[31:0]  = 00010001
           21          Rstop0[16]: R    DARB texp request Recordstop (M)
           22  SDI EX09/S0  Recordstop1[31:0]  = 00018001
           23          Rstop1[16]: R 1E SDI Slave 3 requested all Recordstop
           24  SDI EX09/S3  Master_Stop_Status0[31:0] = 00000008
           25          MStop0[3]: SDI is Recordstopped
           26  SDI EX09/S3  Recordstop0[31:0]  = 00108010
           27          Rstop0[20]: R 1E SDI internal CP port request Recordstop
           28  SDI EX09/S3  CP_Error0[31:0]    = 10009000  Mask = 7F3F67FF
           29          CPErr0[28]: R 1E CP1 half data parity error
           30              {cp1_datap,cp1_data[24:0]} = 126001C
           31  FAIL EXB EX9 with Data Bus C1:  Dstop/Rstop detected by SDI EX9/S3.
           32  Primary service FRU is EXB EX9.
           33  Secondary service FRU is CSB C1 or the logic centerplane.
           34  SDI EX10/S0: SDI is RStopped, requested by DARB.
           35  SDI EX14/S0: SDI is RStopped, requested by DARB.
           36  DARB C0: enabled ports (expanders)          [17:0]: 047E3
           37  DARB C0: exps request Rstop                 [17:0]: 00200
           38  DARB C0: other darb req Rstop for exps      [17:0]: 00200
           39  DARB C1: enabled ports (expanders)          [17:0]: 047E3
           40  DARB C1: exps request Rstop                 [17:0]: 00200
           41  DARB C1: other darb req Rstop for exps      [17:0]: 00200
           42  DARB C1 Port  9 InterAsicStatus[31:0] = 00201015
           43          IAStat[12]: DMX D0 requests Recordstop for this exp
            

SOLUTION SUMMARY:
- Troubleshooting:

        The dump header tells us that this Rstop was generated by dsmd (lines 10,11) 
        while a domain was active. This is also evident by the dumpf file name - 
        dsmd.rstop files are created by dsmd to capture the error state. Walking the
        error chain:

         - Master SDI on EX9 is directed to Rstop by slave SDI 3(line 23).
         - Slave SDI3 on EX9 reports a parity error received from the high
           half of the centerplane (lines 27-30).
         - EX9 using data bus 1 is FAILed from the configuration (line 31).
         - EX9 and CSB1 are named as the primary and secondary FRUs (lines 32,33).

        When EX9/SDI3 detects the parity error, it implies that a bit error occurred to
        data in transit between the DMXs on the high half of the centerplane and SDI3 
        on EX9. Since the error happened while crossing an interconnect, both EX9 and 
        CP1's data bus are suspect. 

        Data path parity for the SDIs is only utilized as an interconnect diagnostic tool. 
        Because the underlying data is protected by ECC, and the individual SDIs only have 
        information on their specific data slice (i.e. multi-bit errors are unknown to a
        single SDI), data is allowed to pass despite parity errors. However, the SDI records
        the parity error if further diagnosis is needed.

- Resolution:

        The frequency of the error must be judged. If the error only occurred once, a
        component replacement is not warranted. However, if repeating relatively often,
        a replacement may be best to avoid a future interruption. Another factor is if 
        the implicated expander was recently installed, serviced, etc. 

        If a replacement is suitable, follow the suggestion of 'wfail'. Start with the 
        expander board. During replacement, examine the interconnect for pin damage. If 
        the expander board exchange does not correct the problem, the centerplane is the 
        secondary FRU.

- Summary of part number and patch ID's 

        http://infoserver.central.sun.com/data/syshbk/Systems/SunFire15K/component.centerplane.html

- References and bug IDs

        SRDB 48122

- Additional background information:

        Using the DMX history and SDI error capture information, it may be possible to 
        determine which bit was in error. In this example, DMX 1.0 also flagged a 
        Recordstop (line 43). Looking at it's history register for expander 9:

           44  redxl> shdmx -s 1 0 h 0x00200
           45  Note: Data is displayed from the currently loaded dump file.
           46  DMX C1/D0   ECC-compressed output history[8:0] to SDIs.
           47     11        10         9         8         7         6     
           48  OE P ECC  OE P ECC  OE P ECC  OE P ECC  OE P ECC  OE P ECC  entry
           49                       0 1  52                                 0   old
           50                       0 1  52                                 1
           51                       0 1  52                                 2
           52                       0 1  52                                 3
           53                       0 1  52                                 4
           54                       0 1  52                                 5
           55                       0 1  52                                 6
           56                       0 1  52                                 7
           57                       0 1  52                                 8
           58                       0 1  52                                 9
           59                       0 1  52                                 10
           60                       0 1  52                                 11
           61                       0 1  52                                 12
           62                       0 1  52                                 13
           63                       0 1  52                                 14
           64                       1 1  52                                 15
           65                       1 0  1B                                 16
           66                       1 0  50                                 17
           67                       1 1  2A                                 18
           68                       0 0  6A                                 19
           69                       0 0  6A                                 20
           70                       0 0  6A                                 21
           71                       0 0  6A                                 22
           72                       0 0  6A                                 23
           73                       0 0  6A                                 24
           74                       0 0  6A                                 25
           75                       0 0  6A                                 26<
           76                       0 0  6A                                 27   new

        The ECC history entry indicated by "<" (line 75) can be compared to the 
        SDI data capture (line 30) using 'parse dmxoh'.

           77  redxl> parse dmxoh 126001C 6a
           78  SDI capture[24:0] = 126001C. Computed ecc = 52.  DMX hist ecc = 6A.
           79  Could be a 1-bit error in bit 5 (as used to compute DMX oh ecc).

- Keywords

        15K, 12K, SF15K, SF12K, Sun Fire 15K, Enterprise, Server, Sun Fire 12K,
        Dstop, CP1 half data parity error
            

INTERNAL SUMMARY:

SUBMITTER: Joshua Freeman APPLIES TO: AFO Vertical Team Docs/HAS, Hardware/Sun Fire /15000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.