Document fins/I0796-1
FIN #: I0796-1
SYNOPSIS: Sun Fire Server domain may hang in the event of data parity errors
DATE: Mar/25/02
KEYWORDS: Sun Fire Server domain may hang in the event of data parity errors
---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------
FIELD INFORMATION NOTICE
(For Authorized Distribution by SunService)
SYNOPSIS: Sun Fire Server domain may hang in the event of data parity
errors.
Sun Alert: Yes
TOP FIN/FCO REPORT: Yes
PRODUCT_REFERENCE: Sun Fire 3800/4800/4810/6800 Servers
PRODUCT CATEGORY: Server / SW Admin
PRODUCTS AFFECTED:
Systems Affected
----------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- S8 ALL Sun Fire 3800 -
- S12 ALL Sun Fire 4800 -
- S12i ALL Sun Fire 4810 -
- S24 ALL Sun Fire 6800 -
X-Options Affected
------------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- - - - -
PART NUMBERS AFFECTED:
Part Number Description Model
----------- ----------- -----
- - -
REFERENCES:
BugId: 4431384 - Data path parity error reporting is enabled, 1-bit
ECC errors can kill domain.
4470487 - Data path parity error is enabled, single bit ECC
causes domain pause.
PatchId: 112127 or later - Hardware/PROM: Sun Fire 3800/4800/4810/6800
Systems flashprom update.
ESC: 534843
Sun Alert: 42804
PROBLEM DESCRIPTION:
In the rare event that a Sun Fire System Controller board detects
data parity errors, the domain may be paused to prevent corrupted
data from being written to permanent storage. This is unnecessary
in the case of single bit data corruption as the corruption can be
corrected without pausing the domain.
Sun Fire servers have many error detection capabilities designed into
the communication network. For errors that can not be corrected, the
domain in which the error was detected will be paused to avoid any data
corruption from propagating to persistent storage. The system
controller reports a data parity error and the domain remains paused
until the "setkeyswitch" off/on sequence is performed.
This issue can occur with the following platform:
Sun Fire 3800/4800/4810/6800 without patch 112127
NOTE: This issue occurs only in the event of data parity errors.
The error messages are logged to the domain console, and show up with
the "showlogs" commands on the domain shell, provided it has not been
flushed out of the message history by subsequent log messages. Solaris
log facilities will not show this information as it is generated by the
system controller and not Solaris.
Additionally, the platform console on the system controller will also
have a message logged indicating that the domain is paused, although
the details of the domain pause are only available on the domain
console. If the syslog facilities have been configured on the system
controller, the messages will also be logged there, for both the domain
and platform syslog hosts.
Error messages would look something like the following:
Jun 14 11:25:22 k12-5a Domain-A.SC:
/partition0/domain0/SB0/bbcGroup0/cpuAB/dcds7:
>>> Cheetah1ErrorStatus[0x61] : 0x01008100
FE [15:15] : 0x1
AccSafDPerr [24:24] : 0x1
SafDPerr [08:08] : 0x1 Fireplane data parity error
Jun 14 11:25:22 k12-5a Domain-A.SC: Domain A is currently paused
due to an error. This domain must be turned off via "setkeyswitch
off" to recover
The error messages will appear within a minimal amount of time from the
actual pause event. If the system controller is very busy (say booting
another domain) there may be a slight delay of a few seconds, but
generally the messages will appear right away.
One of the error detection capabilities is parity checking on the
data interconnect. In the initial release of firmware, a detected
data parity error will cause the domain to pause, as mentioned
above. However, since data transfers are also covered by Error
Correction Codes (ECC), the upset of a single data bit can be
corrected and therefore need not be cause for pausing the domain.
Pausing a domain is a Serengeti hardware feature that detects hardware
errors and immediately stops all traffic on the Safari bus in order to
prevent data corruption. A core file is useless in this case because
this is not a software problem. This is different from a normal
Solaris panic where errors cause Solaris to continue committing data
and dumping a core file.
Depending on the setting of reboot-on-error (see the setupdomain and
showdomain commands), the domain will either be reset automatically
or must be manually reset via setkeyswitch.
With the latest firmware release for the Sun Fire platform, data parity
will still be checked and detected errors reported. But, in the
unlikely event of a data parity error occurring, the domain will no
longer be paused, thus avoiding unnecessary unscheduled downtime. See
the Corrective Action section for the recommended firmware update.
IMPLEMENTATION:
---
| | MANDATORY (Fully Proactive)
---
---
| X | CONTROLLED PROACTIVE (per Sun Geo Plan)
---
---
| | REACTIVE (As Required)
---
CORRECTIVE ACTION:
The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives who may encounter the above
mentioned problem.
Please adhere to the following procedure as necessary to perform
flashprom update on Sun Fire 3800/4800/4810/6800 systems:
Install patchId 112127 or later.
NOTE: Improved data parity handling has been incorporated into the
solution provided above.
COMMENTS:
None
============================================================================
Implementation Footnote:
i) In case of MANDATORY FINs, Enterprise Services will attempt to
contact all affected customers to recommend implementation of
the FIN.
ii) For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical
support teams will recommend implementation of the FIN (to their
respective accounts), at the convenience of the customer.
iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the
need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:
SunWeb Access:
--------------
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/
* From there, select the appropriate link to query or browse the FIN and
FCO Homepage collections.
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/
* From there, select the appropriate link to browse the FIN or FCO index.
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@sdpsweb.EBay
--------------------------------------------------------------------------
Copyright (c) 1997-2003 Sun Microsystems, Inc.