Document fins/I0788-1
FIN #: I0788-1
SYNOPSIS: Running POST on one domain may Dstop all other running domains on Sun
Fire 15K systems with SMS 1.1
DATE: Mar/08/02
KEYWORDS: Running POST on one domain may Dstop all other running domains on Sun
Fire 15K systems with SMS 1.1
---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------
FIELD INFORMATION NOTICE
(For Authorized Distribution by SunService)
SYNOPSIS: Running POST on one domain may Dstop all other running
domains on Sun Fire 15K systems with SMS 1.1.
Sun Alert: Yes
TOP FIN/FCO REPORT: Yes
PRODUCT_REFERENCE: SMS 1.1 on Sun Fire 15K
PRODUCT CATEGORY: Server / SW Admin
PRODUCTS AFFECTED:
Systems Affected
----------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- F15K ALL Sun Fire 15000 -
X-Options Affected
------------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- - - - -
PART NUMBERS AFFECTED:
Part Number Description Model
----------- ----------- -----
- - -
REFERENCES:
BugId: 4505473 - AMX data ECC uncorrectable error.
PatchId: 112080 - SMS 1.1: Patch IBIST for pause wafer change.
ESC: 534366 - All domain down due to DSTOP.
SunAlert: 42881
PROBLEM DESCRIPTION:
A software error on one domain (such as a heartbeat failure, panic
timeout, or error-reset) can cause another domain to DStop on Sun
Fire 15K systems running SMS 1.1. The manifestation of this issue may
cause the POST running on one domain to Dstop all other running domains.
While the occurrence is rare, the impact is platform wide. Depending
upon domain configuration and applications, down time can be several
hours. This problem is intermittent and may be related to a
domain sync operation on the centerplane (reset of unused ports).
Running POST on one domain means that the power-on self tests are
executed on any domain in the system. This is done to initially bring a
domain online, a DR attach of a board (not currently supported), or a
recovery action performed by the SMS software to get a domain back up
and running after a reboot, panic, or Dstop.
A message in the platform message log (/var/opt/SUNWSMS/adm/platform/messages)
would report:
Jan 17 20:25:55 2002 swmtft901 hwad[22514]: [1156 1693005732870614 ERR
InterruptHandler.cc 2127] Domain Stop interrupt detected, domain XXX
SMS then creates a Dstop dump file in /var/opt/SUNWSMS/adm/[XXX]/dump.
The file name is dsmd.dstop.YYMMDD.hhmm.ss (for this example). If this
dump file is opened with "redx" and the "wfail" command is
issued, the
output below is reported. For example:
sc% redx -cl
redx> dumpf load dsmd.dstop.020117.2025.55)
redx> wfail
...ouptut below...
The Dstop signature of this issue is as follows:
SDI EX03/S0 Master_Stop_Status0[31:0] = 7004004F
MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
SDI EX03/S0 Dstop0[31:0] = 12018200
Dstop0[16]: D DARB texp requests all Dstop (M)
Dstop0[25]: D 1E AXQ requests all Dstop (M)
Dstop0[28]: D Slot0 asserted Error, enabled to cause Dstop (M)
AXQ EX03 ( 3) Error_Flag_02[31:0] = 04008400 Mask = 0000FFFF
Err2[26]: D 1E AMX 0-3 hs flow control didn't arrive
simultaneously
FAIL EXB EX3: Dstop/Rstop detected by AXQ.
Primary service FRU is EXB EX3.
SDI EX04/S0 Master_Stop_Status0[31:0] = 0004000F
MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
SDI EX04/S0 Dstop0[31:0] = 02018200
Dstop0[16]: D DARB texp requests all Dstop (M)
Dstop0[25]: D 1E AXQ requests all Dstop (M)
AXQ EX04 ( 4) Error_Flag_03[31:0] = 30009000 Mask = 21005EFF
Err3[28]: D 1E AMX data ECC uncorrectable error
Err3[29]: R AMX data ECC correctable error
FAIL EXB EX4: Dstop/Rstop detected by AXQ.
Primary service FRU is EXB EX4.
The AMX flow control error shown above is the key message. The system
will recover automatically via ASR (automatic system recovery). After
recording the Dstop information, SMS restarts the domain(s).
Any SMS 1.1 installations without patch 112080 or later installed
are susceptible to this problem. SMS 1.2 and higher are not affected
by this issue.
The true cause of the problem is the AMX ASIC which doesn't handle
port resets correctly. The bug fix changes how POST performs the
reset to ensure it's done safely.
A Dstop, or Domain Stop, occurs when the hardware detects an
unrecoverable error. The ASICs in the system cease processing
transactions as quickly as possible to prevent further corruption of
data and facilitate debugging. It also occurs during the centerplane
reset of ports. The AMX has a problem with the reset of ports not done
under domain sync. Changing the reset so that it is done under domain
sync causes the problem to go away.
IMPLEMENTATION:
---
| | MANDATORY (Fully Pro-Active)
---
---
| X | CONTROLLED PRO-ACTIVE (per Sun Geo Plan)
---
---
| | REACTIVE (As Required)
---
CORRECTIVE ACTION:
An Authorized Enterprise Services Field Representative may avoid the
above mentioned problems by following the recommendations as shown
below.
For a permanent fix, please install SMS 1.1 Patch 112080 or later,
or upgrade to SMS1.2. This patch is specifically for SMS 1.1, and is
not tied to any one particular Solaris OS release.
COMMENTS:
None
=========================================================================
Implementation Footnote:
i) In case of MANDATORY FINs, Enterprise Services will attempt to
contact all affected customers to recommend implementation of
the FIN.
ii) For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical
support teams will recommend implementation of the FIN (to their
respective accounts), at the convenience of the customer.
iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the
need arises.
-------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:
SunWeb Access:
--------------
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/
* From there, select the appropriate link to query or browse the FIN and
FCO Homepage collections.
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/
* From there, select the appropriate link to browse the FIN or FCO index.
Supporting Documents:
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@Sun.COM
--------------------------------------------------------------------------
Copyright (c) 1997-2003 Sun Microsystems, Inc.