SRDB ID   Synopsis   Date
48192   Sun Fire[TM] 12K/15K: Dstop: Slot0 target slot transgression error   31 Oct 2002

Status Issued

Description
- Problem Statement: 

    Dstop: Slot0 target slot transgression error

- Symptoms:

    'wfail' output reports something similar to the following:

       01  redxl> dumpf load dsmd.dstop.020207.0007.29
       02  Created Thu Feb  7 00:07:30 2002
       03  By hpost v. 1.1 Generic 112099-05 Nov 27 2001 12:41:09  executing as pid=14740
       04  On ssc name =  xc46-sc0.SD_Lab.West.Sun.COM
       05  Domain =  0=A    Platform = sun15
       06  Boards in dump: master SC    CPs/CSBs[1:0]: 3
       07            EXB[17:0]: 00011
       08          Slot0[17:0]: 00010
       09          Slot1[17:0]: 00000     Requested/not enabled: 00001
       10  'Not enabled' refers to the Console Bus master port on the parent board.
       11  -D option, -d
       12  "DSMD DomainStop Dump"
       13  Created in a Sun Microsystems Inc. internal environment.
       14  0 errors occurred while creating this dump.
       15  redxl> wfail
       16  SDI EX04/S0  Master_Stop_Status0[31:0] = D004000A
       17          MStop0[3,1]: Slot 0 port is DStopped, SDI is Recordstopped.
       18  SDI EX04/S0  Dstop0[31:0] = 00828080
       19          Dstop0[17]: D    DARB texp requests Slot0 Dstop (M)
       20          Dstop0[23]: D 1E SDI internal Slot0 port requested Dstop
       21  SDI EX04/S0  Slot0_Error1[31:0] = 00088008  Mask = 31444EBF
       22          S0Err1[19]: D 1E Slot0 target slot transgression error (M)
       23              {texp[4:0],targ_dev[2:0],s0dtarg,s0dstat[1:0],
       24              s0dtransid[8:0]} = 04844
       25  FAIL Slot SB4:  Dstop/Rstop detected by SDI
       26  Primary service FRU is Slot SB4.
       27  Secondary service FRU is EXB EX4.
            

SOLUTION SUMMARY:
- Troubleshooting:

    The dump header tells us that this Dstop was generated by dsmd (lines 11,12) 
    while a domain was active. This is also evident by the dumpf file name - 
    dsmd.dstop files are created by dsmd as part of an ASR. 
                                 
    The header also reports that hardware state for IO0 was not collected in the 
    dump (lines 09,10). As wfail indicates, the reason for this is that IO0's console
    bus master was not enabled. The console bus master for IO0 is SDI4 on EX0. EX0
    therefore warrants further investigation. Let's finish with wfail first. Walking
    the error chain:

     - The SDI on EX4 calls for Dstop with an internal error with respect to 
       its Slot 0 port (lines 18-20). 
     - The SDI register flagged a target slot transgression error (lines 21-24). 
     - wfail calls out SB4 as what FAILed and also the primary FRU (lines 25,26). 
       EX4 is marked as the secondary FRU (line 27).

    A transgression is an attempt, successful or unsuccessful, to communicate with 
    a board that is not participating in your domain. By the errors present, SB4 
    attempted such an operation. The master SDI maintains bit vectors for its slot 
    boards which outline the other boards each is permitted to talk with. Looking 
    at the master SDI on EX4: 

       28  redxl> shsdi 4
       29  Note: Data is displayed from the currently loaded dump file.
       30  SDI EX04/S0    Component ID = 64317049
       31          Master_Reset_Config[31:0] = 04000000
       32          Master_Stop_Config[31:0]  = 41001997
       33          Core_Config[21:0]   = 0DB3E2
       34          Sysreg_Config[23:0] = 200001
       35          STB_Config[23:0]    = 20010F
       36          Bogon_Config[63:0]  = 00000003 C03C0010
       37          CP_Config[20:0]     = 0F0F70
       38          Slot0Config[1:0][31:0,30:0] = 20000000 3CA2A150
       39          Slot1Config[1:0][31:0,31:0] = 2000E0F0 28A83880
       40          Slot0_Domain_Mask[17:0]: Slot1 = 00000  Slot0 = 00010
       41          Slot0_Expand_Mask[17:0]: Slot1 = 00000  Slot0 = 00010
       42          Slot1_Domain_Mask[17:0]: Slot1 = 00010  Slot0 = 00001
       43          Slot1_Expand_Mask[17:0]: Slot1 = 00010  Slot0 = 00001

    Looking at the Slot 0 domain and expander masks (lines 40,41), we see that 
    this SDI's Slot 0 board (SB4) is permitted to communicate with Slot 0 on EX4 - 
    in other words, only itself. This is not much of a domain to say the least. 
    But it does explain why the transgression error was detected: As soon as SB4 
    attempted any data transmission outside itself, the SDI considers that 
    transmission invalid.

    The question now is why is EX4's master SDI programmed this way? It does not 
    reflect a valid domain configuration. But, based on boards requested in the 
    dump header (lines 8,9), we can infer that the valid domain contained SB4 and 
    IO0. We previously noted that EX0 warranted a deeper look. Let's do that now: 

       44  redxl> shsdi 0
       45  Note: Data is displayed from the currently loaded dump file.
       46  SDI EX00/S0    Component ID = 64317049
       47          Master_Reset_Config[31:0] = 00000018
       48          Master_Stop_Config[31:0]  = 41000897
       49          Core_Config[21:0]   = 0DA3C2
       50          Sysreg_Config[23:0] = 200001
       51          STB_Config[23:0]    = 20010F
       52          Bogon_Config[63:0]  = 00000003 C03C0010
       53          CP_Config[20:0]     = 0F0F70
       54          Slot0Config[1:0][31:0,30:0] = 20000000 3CA2A150
       55          Slot1Config[1:0][31:0,31:0] = 0000E000 00000000
       56          Slot0_Domain_Mask[17:0]: Slot1 = 00010  Slot0 = 00001
       57          Slot0_Expand_Mask[17:0]: Slot1 = 00010  Slot0 = 00001
       58          Slot1_Domain_Mask[17:0]: Slot1 = 00000  Slot0 = 00000
       59          Slot1_Expand_Mask[17:0]: Slot1 = 00000  Slot0 = 00000 
       60          Force_Error[1:0][31:0] = 0000E000 00000000
       61          Csr2Conf[28:0] = 01000000
       62          IBIST_Enbl[1][3:0],[0][29:0] = 0 00000000
       63           Master_Stop_Status0[31:0] = F0000000
       64           Master_Stop_Status1[31:0] = 7F7F0000
       65           Dstop0[31:0] = 00000000
       66           Dstop1[31:0] = 00000000
       67           Recordstop0[31:0]  = 00000000
       68           Recordstop1[31:0]  = 00000000
       69           Core_Error0[31:0]  = 00000000  Mask = 0051FFFF
       70           Core_Error1[31:0]  = 00000000  Mask = FFFFFFFF
       71           Sysreg_Error[31:0] = 00000000  Mask = 780377FF
       72           STB_Error[31:0]    = 00000000  Mask = 7F00FFFF
       73           CP_Error0[31:0]    = 00000000  Mask = 580067FF
       74           CP_Error1[31:0]    = 00000000  Mask = 7FFCFFFF
       75           Slot0_Error0[31:0] = 00000000  Mask = 7000FFFF
       76           Slot0_Error1[31:0] = 00000000  Mask = 31444EBF
       77           Slot0_Error2[31:0] = 00000000  Mask = 7FFCFFFF
       78           Slot1_Error0[31:0] = 00000000  Mask = FFFFFFFF
       79           Slot1_Error1[31:0] = 08000000  Mask = FFFFFFFF
       80           Slot1_Error2[31:0] = 00000000  Mask = FFFFFFFF

    No errors, not even a DARB request for Dstop. Also, the masks for Slot 1 
    (lines 59,60) do not permit any transmissions from IO0. IO0 is part of the
    defined domain, as indicated by the dump header (line 9). Again, the question
    is why is the SDI programmed in such a manner? The ASICs can only be programmed 
    from the SCs. It's time to look outside the dump file. 

    The dump header tells us the dump was taken at Feb 7, 00:07:30 2002 (line 2). 
    Looking in the platform message log, we see the following messages reported 
    around that time:

       Feb  7 00:07:34 2002 xc46-sc0 hwad[357]: [1217 37708520653855 ERR PciComm.cc 195] 
         Cannot access console bus since the board IO0 is OFF
       Feb  7 00:07:34 2002 xc46-sc0 hwad[357]: [1132 37708521469608 ERR IosramComm.cc 516] 
         Failed to read from offset 1e for key 53444344
       Feb  7 00:07:34 2002 xc46-sc0 hwad[357]: [1217 37708540886871 ERR PciComm.cc 195] 
         Cannot access console bus since the board IO0 is OFF
       Feb  7 00:07:34 2002 xc46-sc0 hwad[357]: [1132 37708541793044 ERR IosramComm.cc 516] 
         Failed to read from offset 1e for key 53444344
       Feb  7 00:07:34 2002 xc46-sc0 hwad[357]: [1217 37708543136069 ERR PciComm.cc 195] 
         Cannot access console bus since the board IO0 is OFF
       Feb  7 00:07:34 2002 xc46-sc0 hwad[357]: [1132 37708543957908 ERR IosramComm.cc 516] 
         Failed to read from offset c for key 53444344
       Feb  7 00:07:35 2002 xc46-sc0 dsmd[448]: [2517 37708971377554 WARNING EventHandler.cc 155] 
         Domain stop has been detected in domain A.

    IO0 is reported as being off. There are also errors reported for key 53444344;
    this is an IOSRAM key (the IosramComm.cc is a big clue here). IOSRAM access is 
    via console bus and console bus requires power. Thus, we can be highly confident 
    that IO0 was powered off.  Since there were not esmd messages indicating a power
    off reason, it was likely done by a system administrator. 

    It's likely that once we saw the Requested/not enabled report from wfail (line 9) 
    we'd have began to look outside the dump file. But for the purposes of this 
    discussion, the walk through the hardware provided some insight on why the 
    transgression error was reported.  Also, this is a prime example of the limitations 
    of wfail. wfail can only report and analyze the data made available to it. 

- Resolution:

    Repair/replace IO0.

- Summary of part number and patch ID's 

- References and bug IDs

    SunSolve Article 48122

- Additional background information:

    With the conclusion that IO0 was powered off, the programming of the SDIs 
    makes sense. At power off, the power libraries remove the powered off board(s) 
    from the masks in the SDIs (and AXQs) to ensure no transactions can be sent to, 
    or more importantly source from, those board(s).

    This in turn also explains why the master SDI on EX0 did not report any errors.
    We know from the error reporting in the hardware that the DARBs broadcast a
    stop request to all master SDIs in the system. The SDI examines the
    request to determine if either of its Slot 0/1 boards needs to be stopped.
    By the time the SDI on EX0 received the stop request, IO0 had already
    been removed from the mask registers. Thus, the SDI determined it had no slots
    that needed to participate in the stop.

- Meta-Data/Problem categorization:

Product/Platform: SF12K/SF15K
Category: 

- Keywords

15K, 12K, SF15K, SF12K, starcat, dstop, Slot0 target slot transgression error
            

INTERNAL SUMMARY:

SUBMITTER: Scott Davenport APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.