SRDB ID | Synopsis | Date | ||
48122 | Sun Fire[TM] 12K/15K: An Overview of Dstop Diagnosis | 29 Oct 2002 |
Status | Issued |
Description |
- Problem Statement: Dstop Diagnosis - An Overview - Symptoms: A domain suffers a Dstop. Messages in the platform log are similar to the following: Apr 25 08:07:43 2002 sc0 hwad[282]: [1156 3601896159843595 ERR InterruptHandler.cc 2159] Domain Stop interrupt detected, domain B Alternatively, during POST execution, the following is noted in the POST log: DSTOP Detected for Slot SB13 SDI EX13/S0: Slot 0 port is DStopped, SDI is RStopped, requested by DARB. System state dumped to /var/opt/SUNWSMS/SMS1.2/adm/F/dump/xcstate.020805.0427.40
SOLUTION SUMMARY:
- Troubleshooting: A Dstop occurs when the system hardware encounters a fatal error. The source/cause of the stop can and does vary. There is no single answer for how to diagnose a Dstop. The intention of this article is to provide a framework in which to base a Dstop analysis. Specifics on interpreting 'redx' output is left to other articles dealing with specific Dstops, although a brief overview of the objectives of 'wfail' is listed in "Background Information" below. A general approach to Dstop analysis is: 1. Examine the hardware state dump. This is the obvious first step. Load the state dump into 'redx' and execute 'wfail'. Examine the output from the standpoint of how errors are recorded and reported within the Starcat hardware. Refer to "Background Information" below for further details on error recording/reporting. In many cases, the cause of the Dstop is obvious. 'wfail' identifies either a single suspect component, or at most two suspects separated by a single interconnect (i.e., a System Board and Expander). However, if the cause of the Dstop is not obvious (i.e., multiple, disparate suspect components, undiagnosable timeout), the state dump alone may not be sufficient to identify the source of the Dstop. 2. Characterize the platform at the time of the Dstop. When the source/cause of a Dstop is not obvious from the state dump, it is wise to characterize the platform activity just prior to the Dstop. If multiple Dstops are present in a short time span, focus initially on the first Dstop. Sources of information include platform message logs, domain message logs, domain console logs, and POST logs. Items of interest include environmental fluctuation (temperature, voltage, etc.), actions on shared components (expanders, centerplanes), or evidence of user actions that could impact the Dstopped domain(s). 3. Consider recent changes/services to the system and the failure history. When both the state dump and platform characterization do not yield strong suspect component(s), another avenue of investigation is recent changes to the platform/domain. Changes include hardware replacements, upgrades, and additions. Software changes are also notable, such as updated patches to SMS or Solaris[TM] on the domain. An examination of the failure history of a platform/domain may also reveal a trend or pattern that may suggest a suspect component. Such investigation may be the only recourse when faced with an undiagnosable timeout. From the above steps, the goal is to identify one or more suspect components. Once identified, there are two goals: problem resolution and problem avoidance. Problem Resolution: Typically, but not exclusively, a Dstop translates into a component replacement. Once the suspect component is removed from the system and its replacement installed and tested, resolution is achieved. In 'redx', the suspect component(s) to be replaced is(are) indicated by the "Primary/Secondary FRU" lines in 'wfail' output. For example: Primary service FRU is EXB EX8. Secondary service FRU is CSB C1 or the logic centerplane. Or, in the case of iterative debugging sessions, other observed behaviors in the system (POST failures, panics, etc.) may be the indicator of a suspect component. However, customer and/or system constraints may delay the scheduling of a maintenance window for component replacement. Hence, problem avoidance becomes key. Problem Avoidance: As mentioned above, resolution may be days/weeks/months into the future. However, it may be possible to avoid the source of the problem in the interim. In 'redx', components listed after "FAIL" in 'wfail' output indicate those components to remove from the configuration to avoid the problem. For example: FAIL Port SB14/P0: Dstop detected by SDC The same style of text is also present in POST logs when a Dstop is detected during POST. Components reported as FAILed do NOT necessarily equate to the FRU for problem resolution. In this example, a processor is FAILed, but a processor is not a FRU in the Starcat - the system board is the FRU. The problem can be avoided by blacklisting the component(s) listed as FAILed. Blacklisting can also be a useful method for verifying a diagnosis prior to replacing any hardware. In cases where there are multiple suspect component(s), one component can be blacklisted. If the Dstop does not return, confidence is raised that the blacklisted component is the problem source. And, of course, blacklisting provides an excellent interim solution to maximize domain uptime. - Resolution: Specific resolution varies from case to case. Typically, resolution to a Dstop is the replacement of a failed component. Other possible resolutions can be application of software patches to SMS/Solaris. - Summary of part number and patch ID's Various. - References and bug IDs Other knowledge articles discuss specific Dstops in greater detail. http://cpre-amer.west.sun.com/esg/hsg/starcat/xctt/redx_dumpanalysis.html http://esp.west.sun.com/starcat/post/dstop_101.html - Additional background information: 'wfail' Objectives: 'redx' provides the 'wfail' command (pronounced "w'fail"), as in "what failed". It is the first line of attack for analyzing a hardware state dump. 'wfail' performs a scan of all the DARBs and master SDIs in the dump and reports any errors captured in those ASICs. If the captured data indicates errors present in other ASICs, 'wfail' follows that chain and displays the errors in those ASICs as well. Finally, 'wfail' reports what component to FAIL from the configuration and the service FRU(s), if any. Understanding the objectives of 'wfail' is crucial to know what the command is useful for, and what its limitations are. 'wfail' has three objectives: 1. Report the errors detected in the hardware. 'wfail' generally tries to report which errors occurred first, as these are most interesting for diagnosis. 2. Report what resource(s) should be deconfigured (FAILed) from the configuration to leave the maximal fault-free configuration. The semantic 'wfail' uses is precisely equivalent to what POST would choose to do if the error(s) occurred during POST. The FAILed component is NOT necessarily the broken component. Understanding this point is key. But, by preventing the domain from using the FAILed component, the error can be avoided. For example, suppose the centerplane has a problem communicating with Expander 3. 'wfail' will not FAIL the centerplane. This impacts the entire platform, potentially making the platform unusable. Rather, the expander is FAILed and the remainder of the system is usable. Even if by some means 'wfail' knew the fault was on the centerplane (which it can't), the expander would still be FAILed. Overall, this has less impact to the platform. Thus, the maximal fault-free configuration is provided. 3. Recommend which FRU(s) to replace. The key word is recommend. If the fault is across an interconnect, such as the centerplane/expander example above, isolating to a single FRU is not possible. Such is the nature of an interconnect architecture. However, 'wfail' will call out a primary FRU and, when applicable, a secondary FRU in such cases. The troubleshooter must make a judgment on which component to target for replacement. Error Recording: The error registers in the 12K/15K ASICs are each 32 bits in size and can record up to 15 different errors. The registers are organized to distinguish between first errors and accumulated errors. This organization applies to the SDIs, AXQs, and most of the L1 board ASICs, although the handling of the accumulated error bits differ slightly. A typical error register is as follows: 31 16 15 14 0 +------------------------------------------------------------+ | | | | | | | Accum Error Flags [30:16] |1E| First Error Flags [14:0] | | | | | | +------------------------------------------------------------+ 1E = 1st Error The first error bits [14:0] are only set if no other errors have already been recorded in the register. Bit [15] is set only if a bit [14:0] is being set and no other error registers in the ASIC already have bit [15] set. Thus, an ASIC can accurately report the first error it encountered. Note that the first error is with respect to the ASIC. It does not necessarily indicate the first error within the domain. The accumulated error bits [30:16] have a 1-to-1 relationship to the first errors [14:0]. For the SDI and AXQ, the accumulated errors are always set when the corresponding first error bit is set. Consider the following example: SDI EX14/S0 Dstop0[31:0] = 10019000 Dstop0[16]: D DARB texp requests all Dstop (M) Dstop0[28]: D 1E Slot0 asserted Error, enabled to cause Dstop (M) The register's bits are 10019000. This equates to bits 12, 15, 16 and 28 being set. The first error is bit 12. It is the only bit set in the [14:0] range. When bit 12 is set, bit 15 is also set so the ASIC records subsequent errors in the accumulated area. Also, when bit 12 is set, its corresponding accumulated bit 28 is set. Sometime after the first error event, another error is recorded and bit 16 is set. It does not have a corresponding first error because the first error has already occurred. 'redx' decodes the accumulated error bits [30:16] in the register. This is intentional - we want to see all of the errors that were recorded in the ASIC. To distinguish which error occurred first, first errors are flagged with a '1E'. For L1 board ASICs, accumulated error bits are treated slightly differently. Unlike the AXQ and SDI, accumulated error bits are not set until a repeated error of the same type occurs. Consider this example: SDC SB14 PortErr [0][25:0] = 0028001 (Safari Port 0) P0Err[ 0]: 1E Parity Bidi error P0Err[ 17]: Parity Single error The register's bits are 0028001. This equates to bits 1, 15 and 17 being set. The first error bit is 1. It is the only bit set in the [14:0] range. When bit 1 is set, bit 15 is also set to the ASIC records subsequent errors in the accumulated area. Unlike the AXQ and SDI, bit 16 (bit 1's corresponding accumulated bit) is not set. Sometime after the first error event, another error is recorded and bit 17 is set. As before, 'redx' denotes the first error with a '1E'. If repeated errors of the same type occur, the notation becomes '1E+'. For diagnosis purposes, first errors are of most interest. Note that it is possible for an ASIC to have multiple first errors set in its error registers, which indicates they were all set on the same clock cycle. Error Reporting: Error reporting in the 12K/15K is a hardware tree. There is an error concentrator at each level of the interconnect, with the root of the tree at the SC. Error reporting happens in three "waves". The first wave is to centralize the stop request in the centerplane, specifically the DARBs. The second wave is to notify the remainder of the domain resources that a stop request is in process. And finally the third wave informs the System Controller a stop condition exists in the hardware. First wave: The ASIC detecting an error reports the error to its nearest error concentrator. The error concentrators are the EPLDs for L1 boards, the Master SDI for the expanders, and the DARBs for the centerplane. o Any errors detected within an L1 board are reported to the board's EPLD. If the error is an ECC error, the EPLD asserts ECC_ERROR to its expander's Master SDI. For all other error types, the EPLD asserts ERROR to its Master SDI. The Master SDI in turn notifies the DARBs via the texp bus. o Any errors detected within an expander are reported to the expander's Master SDI. The Master SDI in turn notifies the DARBs via the texp bus. o Any errors detected within a centerplane half are reported to the DARB servicing that centerplane half via the Xstop bus. The DARBs inform each other of any stop requests via the notify (Ntfy) wires connecting the DARBs. Second wave: Once the detected error(s) has "bubbled" up to the DARBs, it is the DARB that declares Dstop and/or Rstop to the remainder of the platform resources. The DARBs notify the AMXs/RMXs/DMXs via the Xstop busses and all Master SDIs via the texp bus. The stop demand message defines the type of stop and the expander/slot(s) in error. The ASIC receiving the demand message in responsible for stopping its appropriate ports/slots. o The AMXs/RMXs/DMXs examine the stop message to determine the port (expander) in error. If that expander is not a split expander, all other non-split expander ports to which the errored port can communicate with are stopped. If the port (expander) in error is a split expander, only the errored port is stopped. The centerplane ASICs cannot blindly stop a port that routes to a split expander. Such action could inappropriately halt a domain that is not in error, breaching domain isolation. In such cases, stopping the expander is deferred to the Master SDI on that expander. o The Master SDIs examine the stop message to determine if it services any L1 boards to which the port (expander) in error can communicate with. The Master SDI uses its configuration registers that define domain membership to make this determination. If the Master SDI determines one or both of its L1 boards must be stopped, it asserts ERR_PAUSE to that L1 board's AR. The Master SDI will also stop itself, the slave SDIs, and the AXQ. In the case of a split expander, the appropriate halves of the SDIs and AXQs are stopped. Third wave: When all domain resources have processed the stop message, the DARBs raise an interrupt to the System Controller to signal the hardware requests service. hwad in SMS services the interrupt, examines the DARBs to determine the stop type and also examines the SDIs to determine which domain(s) are impacted. The hardware state dump is taken and SMS proceeds to recover the domain(s). The diagram below details the error reporting flow, busses, etc. CENTERPLANE EXPANDER SLOT 0 ############################################ ####################### ######################## # # # # # # # +------+ +------+ +------+ +------+ # # +-----+ # # # # | AMX0 | | RMX0 | | AMX1 | | RMX1 | # # +---| AXQ |<--+ # # # # | x2 | | | | x2 | | | # # | +-----+ | # # # # +--^---+ +--^---+ +---^--+ +---^--+ # # Stop | # # # # | | | | # # | +-----+ | # # # # | | | | # # |+--| SDI |<-+| # # # # | | | | # # || +-----+ || # # # # | | | | # # Requests Stop # # # # XStop bus | Xstop bus | # # || +-----+ || # # +------+ +------+ # # | | | | # # ||+-| SDI |<-+| # # | AR | | EPLD | # # | Xstop bus | Xstop bus # # ||| +-----+ || # # +--^---+ +-^--^-+ # # | | | | # # ||| Demands# # | | | # # | | | | # # ||| +-----+ || # ######|########|##|##### # | | | | # # ||+>| |--+| # | | | # +-v---------v-+ | | # # |+->| |---+ # | | | # | | | | # # +-->| |-----ERR_PAUSE---+ | | # | | | | #t # | |<----ECC_ERROR_S0---------+ | # | DARB0 |<-------|---------|------e-b------->| |<----ERROR_S0----------------+ # | | | | #x u# | SDI | # <--------| | +-v---------v-+ /--p-s------->| (M) | # Intr # | |<---->| | | # # | |<----ERROR_S1----------------+ to # +------^------+ Ntfy | |<+ # # +-->| |<----ECC_ERROR_S1---------+ | SCs # | | DARB1 | # # |+->| |-----ERR_PAUSE---+ | | <---------------|-------------| | # # ||+>| |--+ # | | | # | | | # # ||| +-----+ | # | | | # | | | # # Stop | # ######|########|##|##### # | +-------^-----+ # # ||| +-----+ | # # | | | # # | | # # ||+-| SDI |<-+ # # +--v---+ +------+ # # Xstop bus | # # || +-----+ | # # | AR | + EPLD + # # | | # # Requests Stop # # +------+ +------+ # # | Xstop bus # # || +-----+ | # # # # | | # # |+--| SDI |<-+ # # # # | | # # | +-----+ | # # # # +---v--+ +---v--+ # # | Demands# # # # | DMX0 | | DMX1 | # # | +-----+ | # # # # | x6 | | x6 | # # +---| SDI |<-+ # # # # +------+ +------+ # # +-----+ # # # # # # # # # ############################################ ####################### ######################## SLOT 1 As an example, take the case where the AXQ detects a parity error. Assume that there are no split expanders and both Slot 0 and Slot 1 are part of the domain in error. 1. The AXQ sends a stop request to its Master SDI. 2. The Master SDI reports this to the DARBs via the texp bus. 3. The DARBs broadcast the stop request to the AMXs/RMXs/DMXs and all Master SDIs in the system. 4. The Master SDIs examine the stop message and, if appropriate, assert ERR_PAUSE to the ARs of the L1 boards in the domain. In this example, ERR_PAUSE is asserted to both Slot 0 and Slot 1. 5. The DARBs raise an interrupt to the System Controller. - Meta-Data/Problem categorization: Product/Platform: Sun Fire 12K/15K Category: - Keywords dstop, error reporting, error recording, overview, primer, 15K, 12K, SF15K, SF12K, starcat
INTERNAL SUMMARY:
SUBMITTER: Scott Davenport APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS: