SRDB ID   Synopsis   Date
48494   Sun Fire[TM] 12K/15K: Dstop: Error on notify wire 0 from other DARB   1 Nov 2002

Status Issued

Description
- Problem Statement:

       Dstop: Error on notify wire [01] from other DARB

- Symptoms:

       'wfail' output reports something similar to the following:

           01  redxl> dumpf load dsmd.dstop.020202.2254.39
           02  Created Sat Feb  2 22:54:42 2002
           03  By hpost v. 1.2 sms1.2_DR_13 Jan 23 2002 01:06:19  executing as pid=24872
           04  On ssc name =  xc15p13-sc0.SD_Lab.West.Sun.COM
           05  Domain =  4=E = xc15p13-b14    Platform = sun15
           06  Boards in dump: master SC    CPs/CSBs[1:0]: 3
           07            EXB[17:0]: 06000
           08          Slot0[17:0]: 04000     Requested/not enabled: 02000
           09          Slot1[17:0]: 04000
           10  'Not enabled' refers to the Console Bus master port on the parent board.
           11  -D option, -d
           12  "DSMD DomainStop Dump"
           13  Created in a Sun Microsystems Inc. internal environment.
           14  4 errors occurred while creating this dump.
           15  redxl> wfail
           16  SDI EX14/S0: All SDI is DStopped and RStopped,         requested by DARB.
           17  DARB C0 MasterStatus[31:0] = E0070003
           18          MStat[17,18,30,31]: Dstop + Rstop, demand from other DARB
           19          MStat[ 0]: Error on notify wire 0 from other DARB
           20          MStat[ 1]: Error on notify wire 1 from other DARB
           21          MStat[29]: Global Domainstop has occurred
           22          MStat[28:25]: ErrorCount[3:0] = 0
           23  DARB C0: enabled ports (expanders)          [17:0]: 052D2
           24  DARB C0: other darb req Dstop+Rstop for exps[17:0]: 052D2
           25  FAIL Non-degraded cplane configurations:  Dstop/Rstop detected by DARB.
           26  Primary service FRU is the logic centerplane.
      
SOLUTION SUMMARY:
- Troubleshooting:

        The dump header tells us that this Dstop was generated by dsmd (lines 11,12) 
        while a domain was active. This is also evident by the dumpf file name - 
        dsmd.dstop files are created by dsmd as part of an ASR. Also note that
        SB13 was requested during the dump collection (line 08) but was not accessable
        by hpost (see Background Information). Walking the error chain:

         - DARB0 reports errors on the notify wires from DARB1 (lines 19,20).
         - DARB0 also records the errors of a stop demand from DARB1 (line 18).
           and a Global Domainstop event (line 21).
         - Non-degraded centerplane configurations are FAILed (line 25).
         - The centerplane is named as the FRU (line 26).

        The 12K/15K centerplane (CP) is logically divided into two halves. When
        operating at full capacity, the DARB on each logical half of the CP 
        communicates with the other DARB via connections termed notify wires.
        If an error occurs on these pathways, the two logical CP halves no longer
        can maintain lockstep. 

        Since the notify wires are only used while the centerplane is in full
        bandwidth, degrading the centerplane to half bandwidth avoids the failure.
        Hence the FAIL message on line 25. Furthermore, since the error implies a
        loss of lockstep within the centerplane, all domains are affected. Thus
        you see the global domain stop indicator (line 21).

        The notify wire pathways are copper etch completely contained within the
        centerplane. Since the ASICs are also within the centerplane, no interconnects
        are involved. The centerplane is the FRU.
        

- Resolution:

        Degrade the centerplane to half bandwidth with 'setbus' until such time as
        the centerplane can be repaired/replaced.      

- Summary of part number and patch ID's 

        http://infoserver.central.sun.com/data/syshbk/Systems/SunFire15K/component.centerplane.html       
        
- References and bug IDs

        SunSolve Article 48122 
        DARB ASIC Specification       

- Additional background information:

        The signals sent from DARB to DARB is a serial bit stream. A DARB sends
        two copies of the signal, in phase, to its colleague DARB. Included in the
        bitstream is a checksum. The receiving DARB checks the notify signals. If
        the notifies are not equal, and neither checksum is good, a Dstop occurs.
        If one of the checksums is good, it is trusted and the other is dropped.

        Also, by examining an SDI's master stop register, we can see the posting
        of the global Dstop:

           27  redxl> shsdi -e 14
           28  Note: Data is displayed from the currently loaded dump file.
           29  SDI EX14/S0    Component ID = 54317049
           30           Master_Stop_Status0[31:0] = 7004000F
           31          MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
           32           Master_Stop_Status1[31:0] = F2EE0002
           33          0x0E   CP1StopExp[4:0]        MSS1[20:16]    
           34             3   CP1StopSlot[0:1]       MSS1[22:21]    Dstop is 1st stop
           35             1   CP1StopInfoValid       MSS1[23]       
           36          0x12   CP0StopExp[4:0]        MSS1[28:24]    Global stop
           37             3   CP0StopSlot[0:1]       MSS1[30:29]    Dstop is 1st stop
           38             1   CP0StopInfoValid       MSS1[31]       

        The Expander value 18 (0x12) is used to indicate a global Dstop (line 36). All
        SDIs with configured slot boards must acknowledge such a signal and stop. From
        this output, we also see the disagreement between the centerplane halves. DARB0
        signaled a global Dstop while DARB1 indicated EX14 as the stop requestor (line
        33).

        Also, in the dump header, it's noted that SB13 was requested but not collected
        as part of the dump (line 08). This is an indicator that console bus access to
        this component was disabled (line 10). The console bus repeater to a system
        board is the expander's SDI (SDI4, specifically). Examine the console bus
        repeater on EX13:

           39  redxl> shcbr exb 13
           40  Note: Data is displayed from the currently loaded dump file.
           41  EXBCBR EX13      Component ID = 00000000
           42          Port_Config_Stat[31:0] = 048D83E7
           43             3   SC_PortEnbl[1:0]       CnfSt[1:0]     
           44             1   BBC_PortEnbl           CnfSt[2]       
           45             0   Slot_PortEnbl[1:0]     CnfSt[4:3]     
           46             3   SC_MaskErrs[1:0]       CnfSt[6:5]     
           47             1   BBC_MaskErrs           CnfSt[7]       
           48             3   Slot_MaskErrs[1:0]     CnfSt[9:8]     
           49             0   SC_PortBusy[1:0]       CnfSt[11:10]   
           50             0   BBC_PortBusy           CnfSt[12]      
           51             0   Slot_PortBusy[1:0]     CnfSt[14:13]   
           52             1   AXQ_CB_MaskErrs        CnfSt[15]      Rev 4+

        We see that console bus access to both SB13 and IO13 is not enabled (line 45).
        The platform logs for this system should be investigated to determine the
        state of SB13.

- Meta-Data/Problem categorization:

Product/Platform: SF12K/SF15K
Category:

- Keywords

15K, 12K, SF15K, SF12K, Sun Fire 15K, Enterprise, Server, Sun Fire 12K,
starcat, dstop, Error on notify wire [01] from other DARB


      
INTERNAL SUMMARY:

SUBMITTER: Scott Davenport APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.