Document fins/I0760-1

FIN #: I0760-1

SYNOPSIS: Too many Memory DIMMs are being unnecessarily replaced on UltraSPARC
          III family of systems utilizing NG-DIMM memory

DATE: Aug/04/02

KEYWORDS: Too many Memory DIMMs are being unnecessarily replaced on UltraSPARC
          III family of systems utilizing NG-DIMM memory


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)



SYNOPSIS:  Too many Memory DIMMs are being unnecessarily replaced on
           UltraSPARC III family of systems utilizing NG-DIMM memory.          
             
             
Sun Alert:          No             

TOP FIN/FCO REPORT: No 
 
PRODUCT_REFERENCE:  UltraSPARC III family of systems utilizing NG-DIMM memory
 
PRODUCT CATEGORY:   Server / Service 


PRODUCTS AFFECTED:  

Systems Affected   
----------------
Mkt_ID     Platform    Model   Description          Serial Number
------     --------    -----   -----------          -------------
  -        S8           ALL    Sun Fire 3800              -
  -        S12          ALL    Sun Fire 4800              -
  -        S12i         ALL    Sun Fire 4810              -
  -        S24          ALL    Sun Fire 6800              -
  -        F15K         ALL    Sun Fire 15000             -
  -        A28          ALL    Sun Blade 1000             -
  -        A35          ALL    Sun Fire 280R              -
  -        A30          ALL    Sun Fire V880              -
  -        N28          ALL    Netra 20                   -


X-Options Affected
------------------
Mkt_ID     Platform   Model   Description         Serial Number
------     --------   -----   -----------         -------------
  -           -         -          -                    -


PART NUMBERS AFFECTED:

Part Number   		Description   		Model
-----------   		-----------   		-----
     -                       -                    -


REFERENCES:

N/A

      
PROBLEM DESCRIPTION:  

Memory components (DIMMs) for UltraSPARC III family of systems
utilizing NG-DIMM memory are being returned from the field as failed
based on Correctable Error (CE) reports.  However, upon failure
analysis, most of these memory parts show No Trouble Found (NTF).  The
intent of this FIN is to provide Sun Service Representatives an
overview of ECC and to give criteria for replacing DIMMs.  This FIN is
expected to reduce or eliminate unnecessarily replaced DIMMs. 
 
For the UltraSPARC III family of systems utilizing NG-DIMM memory, it
has been reported that DIMMs are suspected to have failed for ECC
errors and are being replaced in the field unnecessarily.  This may
partially be caused by a lack of understanding by Field Engineers of
what actually constitutes ECC, what are the definitions of different
terms related to ECC, and what is the criteria to determine when ECC
errors are to be considered excessive. 
        
Failure analysis on suspected failed DIMMs, which are returned from the
field, has determined that nearly 100% turn out to be NTF.  This is
causing a substantial impact on the valuable resources of Engineering,
Operations and Service.  Further, the cost of procuring additional
DIMMs in order to maintain Target Stocking Levels (TSL) in the field
is very high.
          
The following ECC overview should help in providing an understanding
of this issue:
     
  An Overview of ECC
			     
    Introduction:
    =============

    The recent launch of the UltraSPARC III family of systems utilizing
    NG-DIMM memory, coupled with recent changes that have been
    implemented and released in the Solaris 8 Operating Environment,
    has led to some confusion about reported ECC errors and whether
    these events are indications of hardware that needs replacement.
    This document is intended to give a brief overview of ECC to
    explain why these events occur and what action, if any, should be
    taken when they do. 

    The scope of this discussion is limited to soft errors that occur
    in memory and how they are reported by Solaris.  It does not
    account for hard errors or errors that occur while data travels
    through the system interconnect.  It also does not account for
    information made available to the service processor.  As such, the
    concepts discussed here can be applied to any system, not just
    UltraSPARC III family of systems utilizing NG-DIMM memory.  For
    this discussion, soft errors are transient errors in memory that
    can be corrected by rewriting the affected memory cell.  Hard
    errors occur when a cell is permanently damaged and cannot hold the
    correct information.  Sometimes the cell can be stuck at "0",
    sometimes it can be stuck at "1".

    ECC Concepts:
    =============

    Any non-persistent storage device, whether it be Dynamic Random 
    Access Memory (DRAM) used for main memory or Static Random Access 
    Memory (SRAM) used for caches, is subject to occasional incidences 
    of data loss due to the impact of energetic alpha particles or cosmic 
    rays.  This data loss manifests itself in the changing of the value 
    stored in the memory location affected by the collision.  Typically 
    only a single bit is affected, but there is a small probability (<10%) 
    that multiple cells can be upset.

    When a bit flips due to this phenomenon, it is referred to as a soft 
    error.  This is to distinguish it from a hard error resulting from 
    faulty hardware.  These soft errors happen at a rate, called the soft 
    error rate, that can be predicted as a function of the memory density, 
    the memory technology, and the geographic location of the memory system.

    ECC was invented to facilitate the survival of these naturally 
    occurring losses of data.  The concept is that every word of data stored 
    in memory also has check information stored along with it.  This check 
    information serves two purposes.  First, when a word of data is read out 
    of memory, the check information can be used to detect if any of the 
    bits of the word have changed, and whether just a single bit has 
    changed or more than one bit has changed.  Second, in the event that a 
    single bit has changed, the check information can be used to determine 
    which bit in the word changed and therefore correct the word by 
    flipping this bit back to its complementary value.

    When an ECC check mechanism has detected that one or more bits in a 
    word of data has changed, this is broadly categorized as an ECC error. 
    These errors can be further categorized as a function of the number of 
    bits in error.  Because ECC can correct single bit flips, single bit 
    errors are referred to as Correctable Errors (CEs).  Multi-bit errors 
    are referred to as Uncorrectable Errors (UEs).

    Solaris Behavior:
    =================

    When a CE is detected, the device that read the word and detected the 
    event can certainly correct the data and continue on unimpeded.  However, 
    this does not address the fact that the referenced word is still 
    resident in memory uncorrected, i.e., a subsequent read of this word 
    would result in another CE event.  If, over time, this word in memory 
    is never corrected, the possibility starts to arise that another bit 
    may flip in the same word.  This would lead to a UE event.  To avoid 
    this possibility, the detection of a CE causes a trap to Solaris. 
    The resultant error handling code scrubs the affected memory word by 
    writing the corrected word back into memory.

    As part of handling the error, Solaris will proceed to log a fair 
    amount of diagnostic information.  One such event log, taken from a 
    Sun Fire 6800 running Solaris 8, looks like the following:

Oct 25 09:06:25 wpc26 SUNW,UltraSPARC-III: [ID 796192 kern.notice]
NOTICE:[AFT0]
Corrected system bus (CE) Event on CPU18 at TL=0, errID 0x0000c9b9.19d92690
Oct 25 09:06:25 wpc26  AFSR 0x00000002<CE>.00000097 AFAR
0x00000001.04bdf7d0 
Oct 25 09:06:25 wpc26  Fault_PC 0x10024a74 Esynd 0x0097 /N0/SB5/P3/B0/D2 J16500

Oct 25 09:06:25 wpc26 SUNW,UltraSPARC-III: [ID 154767 kern.notice] [AFT0] errID
0x0000c9b9.19d92690 Corrected Memory Error on /N0/SB5/P3/B0/D2 J16500 is 
Persistent
Oct 25 09:06:25 wpc26 SUNW,UltraSPARC-III: [ID 682217 kern.notice] [AFT0] errID
0x0000c9b9.19d92690 Data Bit 3 was in error and corrected
Oct 25 09:06:25 wpc26 SUNW,UltraSPARC-III: [ID 422650 kern.info] [AFT2] errID
0x0000c9b9.19d92690 E$tag PA=0x00000000.00bdf7c0 does not match 
AFAR=0x00000001.04bdf7c0
Oct 25 09:06:25 wpc26 SUNW,UltraSPARC-III: [ID 904800 kern.info] [AFT2] errID
0x0000c9b9.19d92690 PA=0x00000000.00bdf7c0
Oct 25 09:06:25 wpc26  E$tag 0x00000000.01000001 E$state_7 Invalid 
Oct 25 09:06:25 wpc26 SUNW,UltraSPARC-III: [ID 895151 kern.info] [AFT2] E$Data 
(0x00)
0x5a8d0016.00000a20 0x20202020.37333231 ECC 0x128
Oct 25 09:06:25 wpc26 SUNW,UltraSPARC-III: [ID 895151 kern.info] [AFT2] E$Data 
(0x10)
0x37333330.32062c00 0x5a8f000c.00000a20 ECC 0x1f6
Oct 25 09:06:25 wpc26 SUNW,UltraSPARC-III: [ID 895151 kern.info] [AFT2] E$Data 
(0x30)
0x20202020.37333330 0x34062c00.5a8f000d ECC 0x1fc
Oct 25 09:06:25 wpc26 SUNW,UltraSPARC-III: [ID 929717 kern.info] [AFT2] D$ data

not available 
Oct 25 09:06:25 wpc26 SUNW,UltraSPARC-III: [ID 335345 kern.info] [AFT2] I$ data

not available 

For the case of a CE, the lines tagged with AFT0 are the most important 
ones to examine.  The lines tagged with AFT2 provide data for detailed 
diagnostic evaluation, which is not expected to occur in the field. 
Points that need explanation are the following: 

    1. The event was detected by CPU18 (line 2).  All this really means is 
       that CPU18 is the processor that took the trap, thus invoking the 
       Solaris CE error handling code.
       
    2. The DIMM containing the affected memory location is /N0/SB5/P3/B0/D2, 
       which has a reference designator on the Sun Fire system board of 
       J16500 (lines 4 and 6).  This is the important information, not by 
       itself, but in conjunction with other events reported over time, as 
       will be described in the next section.
       
    3. Solaris describes this event as Persistent (line 6).  The Solaris 
       error handling code provides a disposition code as a result of the 
       scrub operation.  This disposition is one of Intermittent, 
       Persistent, or Sticky.  The definition of each of these codes is:
  
       o  Intermittent means the error was not detected on a reread of the 
          affected memory location.
         
       o  Persistent means the error was detected again on a reread of the 
          affected memory location but the scrub operation corrected it.
         
       o  Sticky means that the error still exists in memory even after the 
          scrub operation.  These events should be investigated further to 
          determine if some hardware replacement is necessary since this is 
          indicative of a hard failure.

Servicing Memory Based on Soft Errors:
--------------------------------------

  As discussed earlier, soft errors are naturally occurring events.  As 
  such, a single report of a CE should not be the basis for servicing/
  replacing a memory device.  In fact, one should expect the number of CEs 
  reported by a system to correlate with the soft error rate that can be 
  predicted by the amount of memory in the system and the geographic 
  location of the system.  Rather than going through system specific 
  calculations to determine acceptable soft error rates, the guideline 
  that is recommended for servicing of memory in the presence of soft 
  errors is the following:

NOTE: Three or more CE's attributed to the same memory module within
      a 24 hour period is not acceptable.

Effective with Solaris 9 KU1 and Solaris 8 KU16, there is now
functionality implemented in Solaris that notifies the administrators of
excessive CE events.

In Solaris 9 KU1 and Solaris 8 KU16, two new tunables, ecc_softerr_limit
and ecc_softerr_interval, are introduced.  A per-DIMM count is kept for
the number of intermittent CEs that occurred.  This count is decremented
every (ecc_softerr_interval/ecc_softerr_limit) seconds.  If the count ever
exceeds ecc_softerr_limit, the following message is printed out:

Mar 22 13:12:31 wpc26 unix: WARNING: [AFT0] 3 soft errors in less than
24:00 (hh:mm) detected from Memory Module U1004

The default values are 1440 seconds (24 hours) for ecc_softerr_interval
and 2 for ecc_softerr_limit.  These values can be changed by adding
entries in /etc/system.  For example:

set ecc_softerr_interval=2880
set ecc_softerr_limit=4

Note that ecc_softerr_interval is defined in seconds.


IMPLEMENTATION: 
 
         ---
        |   |   MANDATORY (Fully Pro-Active)
         ---    
         
  
         ---
        |   |   CONTROLLED PRO-ACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        | X |   REACTIVE (As Required)
         ---
         

CORRECTIVE ACTION: 
        
The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives who may encounter the above
mentioned situation.

For the three categories of ECC memory errors that Solaris reports 
(Intermittent, Persistent, Sticky), the following guidelines should be 
followed for replacement of DIMMs on Sun Fire Midframe and High-End 
servers.

  . Intermittent: Check for the reporting of parity errors, otherwise 
                  ignore.

  . Persistent: Replace DIMM if a message is output to the console 
                warning of excess errors.

  . Sticky: Replace DIMM on first occurrence.
         

COMMENTS:  

============================================================================

Implementation Footnote:

i)   In case of MANDATORY FINs, Enterprise Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO index.

Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@Sun.COM
--------------------------------------------------------------------------