SRDB ID   Synopsis   Date
48122   Sun Fire[TM] 12K/15K: An Overview of Dstop Diagnosis   29 Oct 2002

Status Issued

Description
- Problem Statement:

    Dstop Diagnosis - An Overview


- Symptoms:

    A domain suffers a Dstop. Messages in the platform log are similar
    to the following:

       Apr 25 08:07:43 2002 sc0 hwad[282]: [1156 3601896159843595 ERR 
        InterruptHandler.cc 2159] Domain Stop interrupt detected, domain B

    Alternatively, during POST execution, the following is noted in the
    POST log:

       DSTOP Detected for Slot SB13
       SDI EX13/S0: Slot 0 port is DStopped, SDI is RStopped, requested by DARB.
       System state dumped to /var/opt/SUNWSMS/SMS1.2/adm/F/dump/xcstate.020805.0427.40
                              

SOLUTION SUMMARY:
- Troubleshooting:

    A Dstop occurs when the system hardware encounters a fatal error. The
    source/cause of the stop can and does vary. There is no single answer for
    how to diagnose a Dstop. The intention of this article is to provide a
    framework in which to base a Dstop analysis. Specifics on interpreting 
    'redx' output is left to other articles dealing with specific Dstops, 
    although a brief overview of the objectives of 'wfail' is listed in "Background
    Information" below. 

    A general approach to Dstop analysis is:

    1. Examine the hardware state dump.

       This is the obvious first step. Load the state dump into 'redx' and execute 'wfail'.
       Examine the output from the standpoint of how errors are recorded and
       reported within the Starcat hardware. Refer to "Background Information"
       below for further details on error recording/reporting. 

       In many cases, the cause of the Dstop is obvious. 'wfail' identifies 
       either a single suspect component, or at most two suspects separated
       by a single interconnect (i.e., a System Board and Expander). However, if 
       the cause of the Dstop is not obvious (i.e., multiple, disparate
       suspect components, undiagnosable timeout), the state dump alone may
       not be sufficient to identify the source of the Dstop.

    2. Characterize the platform at the time of the Dstop.

       When the source/cause of a Dstop is not obvious from the state dump,
       it is wise to characterize the platform activity just prior to the
       Dstop. If multiple Dstops are present in a short time span, focus
       initially on the first Dstop.

       Sources of information include platform message logs, domain message
       logs, domain console logs, and POST logs. Items of interest include
       environmental fluctuation (temperature, voltage, etc.), actions on
       shared components (expanders, centerplanes), or evidence of user 
       actions that could impact the Dstopped domain(s).

    3. Consider recent changes/services to the system and the failure history.

       When both the state dump and platform characterization do not yield
       strong suspect component(s), another avenue of investigation is 
       recent changes to the platform/domain. Changes include hardware 
       replacements, upgrades, and additions. Software changes are also 
       notable, such as updated patches to SMS or Solaris[TM] on the domain. 
       An examination of the failure history of a platform/domain may also 
       reveal a trend or pattern that may suggest a suspect component.

       Such investigation may be the only recourse when faced with an
       undiagnosable timeout.

    From the above steps, the goal is to identify one or more suspect
    components. Once identified, there are two goals: problem resolution and 
    problem avoidance. 

       Problem Resolution:
          Typically, but not exclusively, a Dstop translates into a component 
          replacement. Once the suspect component is removed from the system 
          and its replacement installed and tested, resolution is achieved. In 
          'redx', the suspect component(s) to be replaced is(are) indicated by the 
          "Primary/Secondary FRU" lines in 'wfail' output. For example:

             Primary service FRU is EXB EX8.
             Secondary service FRU is CSB C1 or the logic centerplane.

          Or, in the case of iterative debugging sessions, other observed
          behaviors in the system (POST failures, panics, etc.) may be the
          indicator of a suspect component.

          However, customer and/or system constraints may delay the scheduling 
          of a maintenance window for component replacement. Hence, problem 
          avoidance becomes key.

       Problem Avoidance:
          As mentioned above, resolution may be days/weeks/months into the
          future. However, it may be possible to avoid the source of the
          problem in the interim. In 'redx', components listed after "FAIL" in
          'wfail' output indicate those components to remove from the 
          configuration to avoid the problem. For example:

             FAIL Port SB14/P0:  Dstop detected by SDC

          The same style of text is also present in POST logs when a Dstop is
          detected during POST. Components reported as FAILed do NOT necessarily 
          equate to the FRU for problem resolution. In this example, a processor 
          is FAILed, but a processor is not a FRU in the Starcat - the system 
          board is the FRU. 

          The problem can be avoided by blacklisting the component(s) listed
          as FAILed. Blacklisting can also be a useful method for verifying a
          diagnosis prior to replacing any hardware. In cases where there are
          multiple suspect component(s), one component can be blacklisted. If the Dstop
          does not return, confidence is raised that the blacklisted component
          is the problem source. And, of course, blacklisting provides an
          excellent interim solution to maximize domain uptime.

- Resolution:

    Specific resolution varies from case to case. Typically, resolution to
    a Dstop is the replacement of a failed component. Other possible resolutions
    can be application of software patches to SMS/Solaris.

- Summary of part number and patch ID's 

    Various.
    
- References and bug IDs

    Other knowledge articles discuss specific Dstops in greater detail.
    http://cpre-amer.west.sun.com/esg/hsg/starcat/xctt/redx_dumpanalysis.html
    http://esp.west.sun.com/starcat/post/dstop_101.html

- Additional background information:

    'wfail' Objectives:

       'redx' provides the 'wfail' command (pronounced "w'fail"), as in
       "what failed". It is the first line of attack for analyzing a 
       hardware state dump. 'wfail' performs a scan of all the DARBs and
       master SDIs in the dump and reports any errors captured in those
       ASICs. If the captured data indicates errors present in other ASICs, 
       'wfail' follows that chain and displays the errors in those ASICs
       as well. Finally, 'wfail' reports what component to FAIL from the 
       configuration and the service FRU(s), if any.

       Understanding the objectives of 'wfail' is crucial to know what the
       command is useful for, and what its limitations are. 'wfail' has
       three objectives:

       1. Report the errors detected in the hardware.

          'wfail' generally tries to report which errors occurred first,
          as these are most interesting for diagnosis.

       2. Report what resource(s) should be deconfigured (FAILed) from
          the configuration to leave the maximal fault-free configuration.

          The semantic 'wfail' uses is precisely equivalent to what POST
          would choose to do if the error(s) occurred during POST. The
          FAILed component is NOT necessarily the broken component. 
          Understanding this point is key. But, by preventing the domain
          from using the FAILed component, the error can be avoided.

          For example, suppose the centerplane has a problem communicating
          with Expander 3. 'wfail' will not FAIL the centerplane. This 
          impacts the entire platform, potentially making the platform 
          unusable. Rather, the expander is FAILed and the remainder of
          the system is usable. Even if by some means 'wfail' knew the
          fault was on the centerplane (which it can't), the expander would
          still be FAILed. Overall, this has less impact to the platform.
          Thus, the maximal fault-free configuration is provided.

       3. Recommend which FRU(s) to replace.

          The key word is recommend. If the fault is across an interconnect,
          such as the centerplane/expander example above, isolating to a
          single FRU is not possible. Such is the nature of an interconnect
          architecture. However, 'wfail' will call out a primary FRU and,
          when applicable, a secondary FRU in such cases. The troubleshooter
          must make a judgment on which component to target for replacement.

    Error Recording:

       The error registers in the 12K/15K ASICs are each 32 bits in size and 
       can record up to 15 different errors. The registers are organized to 
       distinguish between first errors and accumulated errors. This 
       organization applies to the SDIs, AXQs, and most of the L1 board
       ASICs, although the handling of the accumulated error bits differ
       slightly. A typical error register is as follows:

          31                          16 15 14                         0
          +------------------------------------------------------------+
          | |                           |  |                           |
          | | Accum Error Flags [30:16] |1E|  First Error Flags [14:0] |
          | |                           |  |                           |
          +------------------------------------------------------------+
                                   1E = 1st Error

       The first error bits [14:0] are only set if no other errors have
       already been recorded in the register. Bit [15] is set only if a
       bit [14:0] is being set and no other error registers in the ASIC
       already have bit [15] set. Thus, an ASIC can accurately report the
       first error it encountered. Note that the first error is with respect
       to the ASIC. It does not necessarily indicate the first error within
       the domain.

       The accumulated error bits [30:16] have a 1-to-1 relationship to
       the first errors [14:0]. For the SDI and AXQ, the accumulated errors
       are always set when the corresponding first error bit is set. Consider
       the following example:

          SDI EX14/S0  Dstop0[31:0] = 10019000
                  Dstop0[16]: D    DARB texp requests all Dstop (M)
                  Dstop0[28]: D 1E Slot0 asserted Error, enabled to cause Dstop (M)

       The register's bits are 10019000. This equates to bits 12, 15, 16 and 28
       being set. The first error is bit 12. It is the only bit set in the [14:0]
       range. When bit 12 is set, bit 15 is also set so the ASIC records
       subsequent errors in the accumulated area. Also, when bit 12 is set, its
       corresponding accumulated bit 28 is set. Sometime after the first error
       event, another error is recorded and bit 16 is set. It does not have a
       corresponding first error because the first error has already occurred.

       'redx' decodes the accumulated error bits [30:16] in the register. This
       is intentional - we want to see all of the errors that were recorded in
       the ASIC. To distinguish which error occurred first, first errors are
       flagged with a '1E'.

       For L1 board ASICs, accumulated error bits are treated slightly differently.
       Unlike the AXQ and SDI, accumulated error bits are not set until a 
       repeated error of the same type occurs. Consider this example:

          SDC SB14  PortErr [0][25:0] =  0028001            (Safari Port 0)
                  P0Err[    0]:   1E  Parity Bidi error
                  P0Err[   17]:       Parity Single error

       The register's bits are 0028001. This equates to bits 1, 15 and 17 being
       set. The first error bit is 1. It is the only bit set in the [14:0] range.
       When bit 1 is set, bit 15 is also set to the ASIC records subsequent errors
       in the accumulated area. Unlike the AXQ and SDI, bit 16 (bit 1's 
       corresponding accumulated bit) is not set. Sometime after the first error
       event, another error is recorded and bit 17 is set.

       As before, 'redx' denotes the first error with a '1E'. If repeated errors
       of the same type occur, the notation becomes '1E+'.

       For diagnosis purposes, first errors are of most interest. Note that it
       is possible for an ASIC to have multiple first errors set in its error
       registers, which indicates they were all set on the same clock cycle.

    Error Reporting:

       Error reporting in the 12K/15K is a hardware tree. There is an error 
       concentrator at each level of the interconnect, with the root of the tree
       at the SC. Error reporting happens in three "waves". The first wave is to
       centralize the stop request in the centerplane, specifically the DARBs.
       The second wave is to notify the remainder of the domain resources that 
       a stop request is in process. And finally the third wave informs the System
       Controller a stop condition exists in the hardware.

       First wave:
          The ASIC detecting an error reports the error to its nearest error
          concentrator. The error concentrators are the EPLDs for L1 boards, the 
          Master SDI for the expanders, and the DARBs for the centerplane. 

          o Any errors detected within an L1 board are reported to the board's EPLD.
            If the error is an ECC error, the EPLD asserts ECC_ERROR to its expander's
            Master SDI. For all other error types, the EPLD asserts ERROR to its Master
            SDI. The Master SDI in turn notifies the DARBs via the texp bus.
          o Any errors detected within an expander are reported to the expander's
            Master SDI. The Master SDI in turn notifies the DARBs via the texp bus.
          o Any errors detected within a centerplane half are reported to the DARB
            servicing that centerplane half via the Xstop bus. The DARBs inform each
            other of any stop requests via the notify (Ntfy) wires connecting the
            DARBs.

       Second wave:
          Once the detected error(s) has "bubbled" up to the DARBs, it is the DARB
          that declares Dstop and/or Rstop to the remainder of the platform resources.
          The DARBs notify the AMXs/RMXs/DMXs via the Xstop busses and all Master
          SDIs via the texp bus. The stop demand message defines the type of stop
          and the expander/slot(s) in error. The ASIC receiving the demand message
          in responsible for stopping its appropriate ports/slots. 

          o The AMXs/RMXs/DMXs examine the stop message to determine the port (expander)
            in error. If that expander is not a split expander, all other
            non-split expander ports to which the errored port can communicate with
            are stopped. If the port (expander) in error is a split expander, only 
            the errored port is stopped.

            The centerplane ASICs cannot blindly stop a port that routes to a split
            expander. Such action could inappropriately halt a domain that is not
            in error, breaching domain isolation. In such cases, stopping the expander
            is deferred to the Master SDI on that expander.

          o The Master SDIs examine the stop message to determine if it services any
            L1 boards to which the port (expander) in error can communicate with. The
            Master SDI uses its configuration registers that define domain membership
            to make this determination. If the Master SDI determines one or both of
            its L1 boards must be stopped, it asserts ERR_PAUSE to that L1 board's AR.
            The Master SDI will also stop itself, the slave SDIs, and the AXQ. In the
            case of a split expander, the appropriate halves of the SDIs and AXQs are
            stopped.

       Third wave:

          When all domain resources have processed the stop message, the DARBs raise an
          interrupt to the System Controller to signal the hardware requests service.
          hwad in SMS services the interrupt, examines the DARBs to determine the stop
          type and also examines the SDIs to determine which domain(s) are impacted. The
          hardware state dump is taken and SMS proceeds to recover the domain(s).

       The diagram below details the error reporting flow, busses, etc.

                     CENTERPLANE                           EXPANDER                    SLOT 0
     ############################################   #######################   ########################
     #                                          #   #                     #   #                      #
     #  +------+  +------+  +------+  +------+  #   #       +-----+       #   #                      #
     #  | AMX0 |  | RMX0 |  | AMX1 |  | RMX1 |  #   #   +---| AXQ |<--+   #   #                      #
     #  |  x2  |  |      |  |  x2  |  |      |  #   #   |   +-----+   |   #   #                      #
     #  +--^---+  +--^---+  +---^--+  +---^--+  #   #  Stop           |   #   #                      #
     #     |         |          |         |     #   #   |   +-----+   |   #   #                      #
     #     |         |          |         |     #   #   |+--| SDI |<-+|   #   #                      #
     #     |         |          |         |     #   #   ||  +-----+  ||   #   #                      #
     #     |         |          |         |     #   # Requests       Stop #   #                      #
     #  XStop bus    |      Xstop bus     |     #   #   ||  +-----+  ||   #   #  +------+  +------+  #
     #     |         |          |         |     #   #   ||+-| SDI |<-+|   #   #  |  AR  |  | EPLD |  #
     #     |      Xstop bus     |     Xstop bus #   #   ||| +-----+  ||   #   #  +--^---+  +-^--^-+  #
     #     |         |          |         |     #   #   |||        Demands#   #     |        |  |    #
     #     |         |          |         |     #   #   ||| +-----+  ||   #   ######|########|##|#####  
     #     |         |          |         |     #   #   ||+>|     |--+|   #         |        |  |
     #   +-v---------v-+        |         |     #   #   |+->|     |---+   #         |        |  |
     #   |             |        |         |     #   #   +-->|     |-----ERR_PAUSE---+        |  |
     #   |             |        |         |     #t  #       |     |<----ECC_ERROR_S0---------+  |
     #   |    DARB0    |<-------|---------|------e-b------->|     |<----ERROR_S0----------------+
     #   |             |        |         |     #x u#       | SDI |       #
<--------|             |      +-v---------v-+ /--p-s------->| (M) |       #
Intr #   |             |<---->|             | | #   #       |     |<----ERROR_S1----------------+
 to  #   +------^------+ Ntfy |             |<+ #   #   +-->|     |<----ECC_ERROR_S1---------+  |
SCs  #          |             |    DARB1    |   #   #   |+->|     |-----ERR_PAUSE---+        |  |
<---------------|-------------|             |   #   #   ||+>|     |--+    #         |        |  |
     #          |             |             |   #   #   ||| +-----+  |    #         |        |  |
     #          |             |             |   #   #  Stop          |    #   ######|########|##|#####
     #          |             +-------^-----+   #   #   ||| +-----+  |    #   #     |        |  |    #
     #          |                     |         #   #   ||+-| SDI |<-+    #   #  +--v---+  +------+  #
     #      Xstop bus                 |         #   #   ||  +-----+  |    #   #  |  AR  |  + EPLD +  #
     #          |                     |         #   # Requests      Stop  #   #  +------+  +------+  #
     #          |                 Xstop bus     #   #   ||  +-----+  |    #   #                      #
     #          |                     |         #   #   |+--| SDI |<-+    #   #                      #
     #          |                     |         #   #   |   +-----+  |    #   #                      #
     #      +---v--+              +---v--+      #   #   |          Demands#   #                      #
     #      | DMX0 |              | DMX1 |      #   #   |   +-----+  |    #   #                      #
     #      |  x6  |              |  x6  |      #   #   +---| SDI |<-+    #   #                      #
     #      +------+              +------+      #   #       +-----+       #   #                      #
     #                                          #   #                     #   #                      #
     ############################################   #######################   ########################
                                                                                       SLOT 1

       As an example, take the case where the AXQ detects a parity error. Assume that
       there are no split expanders and both Slot 0 and Slot 1 are part of the domain
       in error.

       1. The AXQ sends a stop request to its Master SDI.
       2. The Master SDI reports this to the DARBs via the texp bus.
       3. The DARBs broadcast the stop request to the AMXs/RMXs/DMXs and all
          Master SDIs in the system.
       4. The Master SDIs examine the stop message and, if appropriate, assert ERR_PAUSE
          to the ARs of the L1 boards in the domain. In this example, ERR_PAUSE is
          asserted to both Slot 0 and Slot 1.
       5. The DARBs raise an interrupt to the System Controller.

- Meta-Data/Problem categorization:
Product/Platform: Sun Fire 12K/15K
Category:

-  Keywords

dstop, error reporting, error recording, overview, primer, 15K, 12K, SF15K, SF12K, starcat                              

INTERNAL SUMMARY:

SUBMITTER: Scott Davenport APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.