Document fins/I0796-1


FIN #: I0796-1

SYNOPSIS: Sun Fire Server domain may hang in the event of data parity errors

DATE: Mar/25/02

KEYWORDS: Sun Fire Server domain may hang in the event of data parity errors


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)

 

SYNOPSIS: Sun Fire Server domain may hang in the event of data parity 
          errors.
         

Sun Alert:          Yes              

TOP FIN/FCO REPORT: Yes 
 
PRODUCT_REFERENCE:  Sun Fire 3800/4800/4810/6800 Servers
 
PRODUCT CATEGORY:   Server / SW Admin


PRODUCTS AFFECTED:  
 
Systems Affected
----------------
Mkt_ID   Platform   Model      Description       Serial Number
------   --------   -----      -----------       -------------
  -        S8        ALL       Sun Fire 3800           -   
  -        S12       ALL       Sun Fire 4800           -
  -        S12i      ALL       Sun Fire 4810           -
  -        S24       ALL       Sun Fire 6800           -


X-Options Affected
------------------
Mkt_ID   Platform   Model   Description                   Serial Number
------   --------   -----   -----------                   -------------
  -         -         -          -                              -


PART NUMBERS AFFECTED: 

Part Number   Description                    Model
-----------   -----------                    -----
     -             -                           -


REFERENCES:

BugId:     4431384 - Data path parity error reporting is enabled, 1-bit 
                     ECC errors can kill domain.
           4470487 - Data path parity error is enabled, single bit ECC 
                     causes domain pause.

PatchId:   112127 or later - Hardware/PROM: Sun Fire 3800/4800/4810/6800 
                                Systems flashprom update.
 
ESC:       534843

Sun Alert: 42804


PROBLEM DESCRIPTION:

In the rare event that a Sun Fire System Controller board detects
data parity errors, the domain may be paused to prevent corrupted
data from being written to permanent storage.  This is unnecessary
in the case of single bit data corruption as the corruption can be
corrected without pausing the domain.

Sun Fire servers have many error detection capabilities designed into
the communication network.  For errors that can not be corrected, the
domain in which the error was detected will be paused to avoid any data
corruption from propagating to persistent storage.  The system
controller reports a data parity error and the domain remains paused
until the "setkeyswitch" off/on sequence is performed.

This issue can occur with the following platform: 

       Sun Fire 3800/4800/4810/6800 without patch 112127 

NOTE: This issue occurs only in the event of data parity errors.

The error messages are logged to the domain console, and show up with
the "showlogs" commands on the domain shell, provided it has not been
flushed out of the message history by subsequent log messages. Solaris
log facilities will not show this information as it is generated by the
system controller and not Solaris.

Additionally, the platform console on the system controller will also
have a message logged indicating that the domain is paused, although
the details of the domain pause are only available on the domain
console. If the syslog facilities have been configured on the system
controller, the messages will also be logged there, for both the domain
and platform syslog hosts.

Error messages would look something like the following: 

        Jun 14 11:25:22 k12-5a Domain-A.SC:
        /partition0/domain0/SB0/bbcGroup0/cpuAB/dcds7:
        >>> Cheetah1ErrorStatus[0x61] : 0x01008100
                      FE [15:15] : 0x1
             AccSafDPerr [24:24] : 0x1
                SafDPerr [08:08] : 0x1 Fireplane data parity error
        Jun 14 11:25:22 k12-5a Domain-A.SC: Domain A is currently paused  
        due to an error.  This domain must be turned off via "setkeyswitch
        off" to recover

The error messages will appear within a minimal amount of time from the
actual pause event. If the system controller is very busy (say booting
another domain) there may be a slight delay of a few seconds, but
generally the messages will appear right away.

One of the error detection capabilities is parity checking on the
data interconnect. In the initial release of firmware, a detected
data parity error will cause the domain to pause, as mentioned 
above. However, since data transfers are also covered by Error 
Correction Codes (ECC), the upset of a single data bit can be 
corrected and therefore need not be cause for pausing the domain.

Pausing a domain is a Serengeti hardware feature that detects hardware
errors and immediately stops all traffic on the Safari bus in order to
prevent data corruption.  A core file is useless in this case because
this is not a software problem.  This is different from a normal
Solaris panic where errors cause Solaris to continue committing data
and dumping a core file.

Depending on the setting of reboot-on-error (see the setupdomain and
showdomain commands), the domain will either be reset automatically
or must be manually reset via setkeyswitch.

With the latest firmware release for the Sun Fire platform, data parity
will still be checked and detected errors reported.  But, in the
unlikely event of a data parity error occurring, the domain will no
longer be paused, thus avoiding unnecessary unscheduled downtime.  See
the Corrective Action section for the recommended firmware update.


IMPLEMENTATION: 

         ---
        |   |   MANDATORY (Fully Proactive)
         ---    
         
  
         ---
        | X |   CONTROLLED PROACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        |   |   REACTIVE (As Required)
         ---


CORRECTIVE ACTION:

The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives who may encounter the above
mentioned problem. 

Please adhere to the following procedure as necessary to perform
flashprom update on Sun Fire 3800/4800/4810/6800 systems:

   Install patchId 112127 or later.  

NOTE: Improved data parity handling has been incorporated into the 
      solution provided above. 
   
   
COMMENTS:  

None

============================================================================

Implementation Footnote:

i)   In case of MANDATORY FINs, Enterprise Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO index.

Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@sdpsweb.EBay
--------------------------------------------------------------------------


Copyright (c) 1997-2003 Sun Microsystems, Inc.