SRDB ID   Synopsis   Date
48197   Sun Fire[TM] 12K/15K: Dstop: Select command parity error   31 Oct 2002

Status Issued

Description
- Problem Statement: 

    Dstop: Select command parity error

- Symptoms:

    'wfail' output reports something similar to the following:

       01  redxl> dumpf load dsmd.dstop.020429.0840.40
       02  Created Mon Apr 29 08:40:41 2002
       03  By hpost v. 1.2 Generic 112488-03 Feb 15 2002 13:40:50  executing as pid=6794
       04  On ssc name =  rasputin-sc0.SD_RASCAL.West.Sun.COM
       05  Domain =  0=A    Platform = rasputin
       06  Boards in dump: master SC    CPs/CSBs[1:0]: 1     Requested/not enabled: 2
       07            EXB[17:0]: 12100
       08          Slot0[17:0]: 12100
       09          Slot1[17:0]: 12100
       10  'Not enabled' refers to the Console Bus master port on the parent board.
       11  -D option, -d
       12  "DSMD DomainStop Dump"
       13  0 errors occurred while creating this dump.
       14  redxl> wfail
       15  SDI EX08/S0  Master_Stop_Status0[31:0] = C00400CF
       16          MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
       17  SDI EX08/S0  Dstop0[31:0] = 30019000
       18          Dstop0[16]: D    DARB texp requests all Dstop (M)
       19          Dstop0[28]: D 1E Slot0 asserted Error, enabled to cause Dstop (M)
       20          Dstop0[29]: D    Slot1 asserted Error, enabled to cause Dstop (M)
       21  EPLD SB08  Err1_Dom0: Mask= 00  Err= 41  1stErr= 40
       22          Err1[0]:      Error reported by AR
       23          Err1[6]:  1E+ Error reported by BBC0
       24  BBC SB08/BB0   Device_Err_Stat[31:0] = 80008010
       25          DevErr[    4]:   1E  DCDS asserted error
       26  DCDSs SB08/DG0  slice 5  CPU[1:0]_Cmd_Err[22:0] = 008008  008008
       27          C0CE[    3]:   1E  C0 Select command parity error
       28          C1CE[    3]:   1E  C1 Select command parity error
       29  FAIL Port SB8/P0:  Dstop detected by DCDS.
       30  Primary service FRU is Slot SB8.
       31  FAIL Port SB8/P1:  Dstop detected by DCDS.
       32  Primary service FRU is Slot SB8.
       33  SDI EX13/S0: All SDI is DStopped and RStopped,         requested by DARB.
       34  SDI EX16/S0: All SDI is DStopped and RStopped,         requested by DARB.
            

SOLUTION SUMMARY:
- Troubleshooting:

    The dump header tells us that this Dstop was generated by dsmd (lines 11,12) 
    while a domain was active. This is also evident by the dumpf file name - 
    dsmd.dstop files are created by dsmd as part of an ASR. Walking the
    error chain:

     - The SDI on EX8 calls for Dstop as directed by its Slot 0 board, SB8 (line 19).
       There is also a Slot 1 error asserted, but it is not the first error (line 20).
     - The EPLD on SB8 indicates BBC0 asserted error the first error (line 23).
     - BBC0 indicates the DCDS called for error (line 25).
     - DCDS slice 5 reports select command parity errors (lines 26-28).
     - The DCDSs off of BBC0 serve processors 0 and 1. Hence, 'wfail' FAILs
       SB8/P0 and SB8/P1 (lines 29,31).
     - The FRU called out is SB8 (lines 30,32).

    The DCDSs are slave ASICs, and all transactions are controlled via select
    commands sourced by processors. The select lines are parity protected. DCDS
    slice 5 is configured as the parity checker, hence its detection of the error.

    The select line pathways between the processors and DCDSs are entirely contained 
    within the system board, so the board is the FRU. In the general case, this error 
    could also occur on a MaxCPU board.

- Resolution:

    Repair/replace SB8.

    In general, repair/replace the board reporting the error.

- Summary of part number and patch ID's 

    http://infoserver.central.sun.com/data/syshbk/Devices/System_Board/SYSBD_SunFire_USIIICu.html
    http://infoserver.central.sun.com/data/sshandbook/Devices/CPU_Module/UltraSPARC_MaxCPU.html

- References and bug IDs

    SunSolve Article 48122
    15K System Controller Specification

- Additional background information:

    In the dump header, there's an indication of communication problems to CSB 1 
    (line 06). This indicates that console bus access to this component was
    disabled (line 10). Console bus fans out from the SCM ASICs. They are 
    physically located on the CSBs, but are part of the SC's power domain. So
    even if a CSB is powered off, the SCMs still have power.

    Examining the SCMs, we do not see CSB 1 enabled (lines 40, 56):

       35  redxl> shscm 0 
       36  Note: Data is displayed from the currently loaded dump file.
       37  scm  0   Component ID = 215C007D
       38          DevTemp[8:0] = 041:  Valid  43.83 DegC
       39          CBus_Config[31:0] = 3FFF8103
       40         0x103   MasterPortEnbl[9:0]    CbCnf[9:0]     EXBs 06100
       41             0   CBH_SlavePortEnbl      CbCnf[13]      
       42             0   ShortTimeout           CbCnf[14]      
       43        0x7FFF   PortErrMask[14:0]      CbCnf[29:15]   
       44             0   DisableArb             CbCnf[30]      With other SC's CBH
       45             0   ForceBusy              CbCnf[31]      To other SC's CBH
       46          ResetStat[26:0]   = 01000000
       47          SCM_Mapping_Reg[5:0] = 10
       48          CBus_PortErr[ 0][25:0] = 0000000     (EXB 14 (master))
       49          CBus_PortErr[ 1][25:0] = 0000000     (EXB 13 (master))
       50          CBus_PortErr[ 8][25:0] = 0000000     (EXB  8 (master))
       51  redxl> shscm 1
       52  Note: Data is displayed from the currently loaded dump file.
       53  scm  1   Component ID = 215C007D
       54          DevTemp[8:0] = 03F:  Valid  42.50 DegC
       55          CBus_Config[31:0] = 3FFF8030
       56         0x030   MasterPortEnbl[9:0]    CbCnf[9:0]     EXBs 10000  CSBs 1
       57             0   CBH_SlavePortEnbl      CbCnf[13]      
       58             0   ShortTimeout           CbCnf[14]      
       59        0x7FFF   PortErrMask[14:0]      CbCnf[29:15]   
       60             0   DisableArb             CbCnf[30]      With other SC's CBH
       61             0   ForceBusy              CbCnf[31]      To other SC's CBH
       62          ResetStat[26:0]   = 01000000
       63          SCM_Mapping_Reg[5:0] = 11
       64          CBus_PortErr[ 4][25:0] = 0000000     (CSB  0 (master))
       65          CBus_PortErr[ 5][25:0] = 0000000     (EXB 16 (master))

    The platform logs for this system should be investigated to determine why
    CSB1 was deconfigured

- Meta-Data/Problem categorization:

Product/Platform: SF12K/SF15K
Category:

- Keywords

15K, 12K, SF15K, SF12K, starcat, dstop, Select command parity error            

INTERNAL SUMMARY:

SUBMITTER: Scott Davenport APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.