SRDB ID   Synopsis   Date
50095   Sun Fire[TM] 12K/15K: Dstop: Port 0 Safari device asserted error   23 Jan 2003

Status Issued

Description
- Problem Statement/Title: SF15K Troubleshooting Article:

        Dstop: Port 0 Safari device asserted error 

- Symptoms:

        'wfail' output reports something similar to the following:

           01  redxl> dumpf load dsmd.dstop.021220.1126.07
           02  Created Fri Dec 20 11:26:08 2002
           03  By hpost v. 1.2 Generic 112488-09 Sep 17 2002 13:34:28  executing as pid=15485
           04  On ssc name:  p015-sc0.
           05  Domain =  0=A    Platform = p015
           06  Boards in dump: master SC    CPs/CSBs[1:0]: 3
           07            EXB[17:0]: 03FFF
           08          Slot0[17:0]: 03FFF
           09          Slot1[17:0]: 0003F     Requested/not enabled: 001C0
           10  'Not enabled' refers to the Console Bus master port on the parent board.
           11  -D option, -d
           12  "DSMD DomainStop Dump"
           13  0 errors occurred while creating this dump.
           14  redxl> wfail
           15  SDI EX12/S0  Master_Stop_Status0[31:0] = A000004F
           16          MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
           17  SDI EX12/S0  Dstop0[31:0] = 10029000
           18          Dstop0[17]: D    DARB texp requests Slot0 Dstop (M)
           19          Dstop0[28]: D 1E Slot0 asserted Error, enabled to cause Dstop (M)
           20  EPLD SB12  Err1_Dom0: Mask= 00  Err= 82  1stErr= 80
           21          Err1[1]:      Error reported by SDC
           22          Err1[7]:  1E+ Error reported by BBC1
           23  BBC SB12/BB1   Device_Err_Stat[31:0] = 80018100
           24          DevErr[   16]:       Port 0 bootbus command or protocol error
           25          DevErr[    8]:   1E  Port 0 Safari device asserted error
           26  Proc SB12/P2 (12.0.2) EmuShad[0:78] = 0000 00000000 00000000   (Note rev order)
           27  FAIL Port SB12/P2:  Dstop detected by BBC SB12/BB1.
           28  Primary service FRU is Slot SB12.
           29  DARB C0: enabled ports (expanders)          [17:0]: 03FFF
           30  DARB C0: other darb req Dstop+Rstop for exps[17:0]: 01000
           31  DARB C1: enabled ports (expanders)          [17:0]: 03FFF
           32  DARB C1: other darb req Dstop+Rstop for exps[17:0]: 01000                  

SOLUTION SUMMARY:
- Troubleshooting:

        The dump header tells us that this Dstop was generated by dsmd (lines 11,12) while
        a domain was active. This is also evident by the dump file name - dsmd.dstop files are
        created by dsmd as part of error capturing. Walking the error chain:

         - Not all the Slot1 boards were collected in the dump file (lines 09,10).
           See Additional Background Information for more on this.
         - EX12/S0 (SDI0) reports a first error from its Slot0 board SB12 (line 19).
         - SB12's EPLD reports a first error from BBC1 (line 22).
         - SB12/BBC1 indicates that Port 0 asserted an error (line 25). Note that for
           BBC1, port 0 is processor 2.
         - SB12/P2 is displayed, but no errors are logged in the CPU (line 26).
         - 'wfail' then informs us to fail SB12/P2 (line 27) to avoid the error.
         - SB12 is identified as the primary FRUs (line 28).

        It is clear that SB12/BBC1 logged an error as directed by SB12/P2. But
        'wfail' does not present any information on the processor's error because
        the EmuShad register is zeroed. Looking at the processor:

           33  redxl> shproc 12 0 2
           34  Note: Data is displayed from the currently loaded dump file.
           35  Proc SB12/P2 (12.0.2)  ComponentID= 4919D07D  MaskID= 422  US3+_2.2  EPIC6cu
           36         PC[63:6],2b'0 = 00000000.0000000_       ([5:0] not available)
           37         tPC[63:0]     = FFFFFFFF.F0000040
           38         tnPC[63:0]    = FFFFFFFF.F0000044
           39         tl[2:0] = 2    tt[8:0] = 003: XIR
           40         tstate[39:0]  = 44.F0003505
           41         ASI[7:0] = F0   CWP[2:0] = 5   PState[11:0] = 035   CCR[7:0] = 44
           42         EmuShad[0:78] = 0000 00000000 00000000   (Note rev order)
           43         AFSR [63:0] = 00000000.00000000   AFAR [42:4] = 000.0000000_
           44         AFSR2[63:0] = 00000000.00000000   AFAR2[42:4] = 000.0000000_
           45          Data Ecc synd[8:0]  = 000: No Error.
           46          Mtag Ecc synd[3:0]  =   0: No Error.

        The tPC and tnPC (lines 37, 38) look odd. The upper word is FFFFFFFF. Also
        note the trap type is an XIR (line 39). Some external action issued a reset
        to this processor - either from the SC or a spurious signal within the SB.
	
	In this case, an XIR trap type is present. However, even if this information
	is not present, there is some error within the system board. The BBC clearly
	logs that the processor asserted error. But, the processor does not record
	a reason. Either the processor did not assert error, in which case the SBBC
	is at fault, or the processor failed to record the error, in which case the
	CPU is at fault. In either case, both components and their communication 
	pathways are entirely contained within the SB.      
- Resolution:

        Correlation with the platform/domain messages to determine if an XIR
        was issued is a good step. It would corroborate the trap type listed
        for the processor. But even without such evidence, replacement of SB12 
        is still the best action.

- Summary of part number and patch ID's 

        http://infoserver.central.sun.com/data/syshbk/Devices/System_Board/SYSBD_SunFire_USIIICu.html

- References and bug IDs

        SRDB 48122

- Additional background information:

        The dump header in this dump also noted that not all Slot1 boards were
        collected in the dump file. This means that POST believed these boards to
        be part of the domain (per the PCD), but was unable to collect the ASIC
        information. In general, when such messages are displayed, it is an indicator 
        to examine the platform/domain logs for events that may explain why the
        board(s) are unavailable.

        From the dump, it is clear that the console bus access to the missing Slot1
        boards is disabled. For example, examining the console bus repeater portion 
        of EX6/SDI4:

           47  redxl> shcbr exb 6
           48  Note: Data is displayed from the currently loaded dump file.
           49  EXBCBR EX6      Component ID = 00000000
           50          Port_Config_Stat[31:0] = 068683EF
           51             3   SC_PortEnbl[1:0]       CnfSt[1:0]     
           52             1   BBC_PortEnbl           CnfSt[2]       
           53             1   Slot_PortEnbl[1:0]     CnfSt[4:3]     
           54             3   SC_MaskErrs[1:0]       CnfSt[6:5]     
           55             1   BBC_MaskErrs           CnfSt[7]       
           56             3   Slot_MaskErrs[1:0]     CnfSt[9:8]     
           57             0   SC_PortBusy[1:0]       CnfSt[11:10]   
           58             0   BBC_PortBusy           CnfSt[12]      
           59             0   Slot_PortBusy[1:0]     CnfSt[14:13]   
           60             1   AXQ_CB_MaskErrs        CnfSt[15]      Rev 4+
           61          0x06   BoardId[4:0]           CnfSt[20:16]   
           62             4   Mode[2:0]              CnfSt[23:21]   EXB SDI4
           63             6   Chip Version[3:0]      CnfSt[27:24]   Rev 4+
           64             0   CBErr=>SCInt           CnfSt[28]      
           65             0   CBErr=>DStop           CnfSt[29]      
           66             0   Domain[1:0]            CnfSt[31:30]   
           67          AXQ_CBus_&_ForceErr[31:0]  = 01800000
           68      0x000000   ForceError[22:0]       AXQCB[22:0]    
           69             1   MaskAXQCBErr           AXQCB[23]      
           70             1   AXQ_CBusEnbl           AXQCB[24]      
           ...

        The port enable to IO6 is not active (line 53). Also, the SDI domain masks
        for IO6 are also clear (lines 85,86):

           71  redxl> shsdi -s 6
           72  Note: Data is displayed from the currently loaded dump file.
           73  SDI EX06/S0    Component ID = 64317049
           74          Master_Reset_Config[31:0] = 06000018
           75          Master_Stop_Config[31:0]  = 01000897
           76          Core_Config[21:0]   = 0DA3C2
           77          Sysreg_Config[23:0] = 200001
           78          STB_Config[23:0]    = 20010F
           79          Bogon_Config[63:0]  = 00000003 C03C0010
           80          CP_Config[20:0]     = 0F0F70
           81          Slot0Config[1:0][31:0,30:0] = 20000000 3CA2A150
           82          Slot1Config[1:0][31:0,31:0] = 0000E000 00000000
           83          Slot0_Domain_Mask[17:0]: Slot1 = 0003F  Slot0 = 03FFF
           84          Slot0_Expand_Mask[17:0]: Slot1 = 0003F  Slot0 = 03FFF
           85          Slot1_Domain_Mask[17:0]: Slot1 = 00000  Slot0 = 00000
           86          Slot1_Expand_Mask[17:0]: Slot1 = 00000  Slot0 = 00000
           ...

        A most common cause for this is the power off of a board. The power 
        libraries will deconfigure the console bus repeater to the board being
        powered off as well as ensure the AXQ/SDI domain masks are reprogrammed
        properly to disallow communication to that board.

- Keywords Section

        15K, 12K, SF15K, SF12K, starcat, dstop, Port 0 Safari device asserted error
                  

SUBMITTER: Scott Davenport APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.