Document fins/I0887-1

FIN #: I0887-1

SYNOPSIS: Guidelines for understanding and diagnosing UltraSPARC III Level 2
          (L2) SRAM Cache Memory Errors

DATE: Oct/02/02

KEYWORDS: Guidelines for understanding and diagnosing UltraSPARC III Level 2
          (L2) SRAM Cache Memory Errors


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)



SYNOPSIS:  Guidelines for understanding and diagnosing UltraSPARC III 
           Level 2 (L2) SRAM Cache Memory Errors.                  


SunAlert:           No

TOP FIN/FCO REPORT: No 
  
PRODUCT_REFERENCE:  UltraSPARC III Level 2 SRAM
 
PRODUCT CATEGORY:   Server / Service


PRODUCTS AFFECTED:

Systems Affected:
-----------------  
Mkt_ID   Platform     Model        Description           Serial Number
------   --------     -----        -----------           -------------
  -        A28         ALL         Sun Blade 1000              - 
  -        A29         ALL         Sun Blade 2000              -
  -        A35         ALL         Sun Fire 280R               -
  -        A37         ALL         Sun Fire V480               -
  -        A30         ALL         Sun Fire V880               -
  -        S8          ALL         Sun Fire 3800               -
  -        S12         ALL         Sun Fire 4800               -
  -        S12i        ALL         Sun Fire 4810               -
  -        S24         ALL         Sun Fire 6800               -
  -        F12K        ALL         Sun Fire 12K                -
  -        F15K        ALL         Sun Fire 15K                -
  -        N28         ALL         Netra 20                    -


X-Options Affected:
-------------------
Mkt_ID          Platform   Model   Description                  Serial Number
------          --------   -----   -----------                  -------------
X4007A             -         -     ASSY CPU-4PROC USIIIP 900MHz       -
X4525A             -         -     MAXCPU 900MHz CNFIG F15K           -
X4004A             -         -     CPU-2PROC USIII  750MHz            -
X4005A             -         -     CPU-4PROC USIII  900MHz            -
X4006A             -         -     CPU-2PROC USIIIP 900MHz            -
X4046A             -         -     CPU DUAL 750MHz AL A30             -
X4047A             -         -     CPU DUAL 750MHz AL A30             -
XCPUBD-4049        -         -     CPU-4GB/4PROC USIII 900+M          -
XCPUBD-F4089       -         -     CPU-8GB/4PROC USIII 900+M          -
XCPUBD-F4169       -         -     CPU-16GB/4PROC USIII 900+M         -
XCPUBD-F4329       -         -     CPU-32GB/4PROC USIII 900+M         -
XCPUBD-2029        -         -     CPU-2GB/2PROC USIII 900+M          -
XCPUBD-2049        -         -     CPU-4GB/2PROC USIII 900+M          -
XCPUBD-2089        -         -     CPU-8GB/2PROC USIII 900+M          -
SF-XCPUBD-227      -         -     CPU-2GB/2PROC USIII 750MHz         -
SF-XCPUBD-447      -         -     CPU-4GB/4PROC USIII 750MHz         -
SF-XCPUBD-487      -         -     CPU-8GB/4PROC 512MB USIII          -


PART NUMBERS AFFECTED:

Part Number             Description                           Model
-----------             -----------                           -----
540-5052-02 or below    ASSY CPU-4PROC USIIIP 900+ MHz          -
540-4729-04 or below    ASSY CPU-2PROC USIII 750MHz             -
540-4730-04 or below    ASSY CPU-4PROC USIII 750MHz             -
540-5051-02 or below    ASSY CPU-2PROC USIIIP 900+ MHz          -
501-5818-06 or below    ASSY CPU DUAL 750MHz AL A30             -
540-4934-03 or below    ASSY CPU-4GB/4PROC USIII 900+ MHz       -
540-4992-02 or below    ASSY CPU-8GB/4PROC USIII 900+ MHz       -
540-4990-03 or below    ASSY CPU-16GB/4PROC USIII 900+ MHz      -
540-4993-02 or below    ASSY CPU-32GB/4PROC USIII 900+ MHz      -
540-4984-02 or below    ASSY CPU-2GB/2PROC USIII 900+ MHz       -


REFERENCES:

BugId:      4484338 - Need to improve handling of correctable errors. 
            4741848 - Invalid AFSR message is only HW diagnostic on VSP 
                      US-III platforms.

PatchId:    108528: SunOS 5.8: kernel update patch. 
            112233: SunOS 5.9: Kernel Patch.

FIN:        I0856-1: UltraSPARC III and III+ based platforms could be 
                     susceptible to UCC errors that may cause system 
                     panics.

SUN Alert:  45527

URL:        http://sram.eng/MTG/Quality/24hr_estimate_3.pdf
            http://onestop/ecache/index.shtml?menu


PROBLEM DESCRIPTION:

This FIN is to communicate and provide vital information to Sun
employees, especially from Sun Support Services, and Authorized Sun
Service Provider employees regarding two major points: 

1. Using the correct terminology associated with UltraSPARC III  
   Level 2 (L2) SRAM memory errors. 

2. The number of such correctable errors that can occur in either 
   Sun Fire 4MB or 8MB Level 2 (L2) cache modules before cause for 
   concern.

The underlying assumption is that such memory errors may have been
incorrectly identified as "E-cache" errors or upon detection, have
caused unnecessarily replacement of the entire main system board
module since the Level 2/SRAM cache module is not a field replaceable
unit.  Therefore, it is likely that new problems will be created by
replacing a perfectly good main system board module - causing
additional downtime for the customer and extra hardware costs to Sun.

This issue can occur in the following releases:

UltraSPARC-III / UltraSPARC-III+ family:

     Solaris 8
     Solaris 9

Detection:
----------
  Correctable Level 2/SRAM cache errors are indicated by the following
  AFT0 error events: UCC Event, EDC Event, CPC Event, WDC Event.  By
  default, starting in S8 KU-16 and S9U1, these events are logged by all
  platforms to /var/adm/messages.

  Uncorrectable Level 2/SRAM cache errors are indicated by the following
  AFT1 error events: UCU Event, EDU:BLD Event, EDU:ST Event, CPU Event,
  WDU Event. By default these events are logged to the console and
  /var/adm/messages on all platforms for all releases of S8 and S9.

  For a description of these events, see below or Infodoc 43642, which
  describes in detail all Level 1 cache (L1$), Level 2/SRAM cache (L2$) and
  Memory errors of the UltraSPARC-III family processors.

  For Sun Blade 1000, Sun Blade 2000, Sun Fire 280R, Sun Fire V480 and
  Sun Fire V880 platforms prior to S8 KU-16 and S9U1, correctable Level
  2/SRAM cache messages were not logged to either the console or
  /var/adm/messages.  The only symptoms of a failing Level 2/SRAM cache
  module will typically be: "Invalid AFSR" messages, "UCC Event at
TL>0"
  messages and/or "ptl1 panic" events.  However, these messages may
also
  indicate a failing memory DIMM, a failing CPU or a failing system
  board.  To diagnose errors on these platforms, it is recommended to
  either upgrade to S8 KU-16 or S9U1 or to set the configuration
  variable "ce_verbose" to 1 in /etc/system.


SRAM Level 2 Cache Memory Terminology Explained:
------------------------------------------------
  . The UltraSPARC III (USIII) processor has cache memory on the chip;
    64KB 4-way associative for data and 32KB 4-way associative for
    instruction, along with 2KB write buffer and 2KB prefetch.  This
    memory is called Level 1 (L1) as it is the first level that the
    processor uses.

  . Level 2 (L2) cache on the USIII processor refers to off-chip cache,
    therefore it is not on the processor.

  . The USIII architecture has both Error Detection and Error Correction
    Code (ECC).

  . For an ECC event in USIII, it just says it is an ECC event, and a
    certain bit is corrected.  This can occur for either the cache or the
    main DRAM memory.  There are certain details in the error message
    that allow for determining whether the event was Level 2/SRAM cache
    or in main memory.

  . There is no Level 3 cache for the USIII processor line.  One can
    think of the memory as hierachial, with the processor having
    different levels that it accesses.  The first level is closest to the
    processor, and can be accessed the fastest.  The second level is
    farther away, and takes more time.  In Sun's case, there is no 3rd
    level, but the next step is accessing main memory, which takes many
    cycles.


Types of L2/SRAM Cache Errors:
------------------------------
  Level 2/SRAM cache errors are either single-bit, which are
  correctable, or multi-bit, which are uncorrectable.  Level 2/SRAM
  cache errors are reported depending upon how the processor detects
  the error.  The USIII architure uses an extremely robust error
  correcting code called SEC- DED (single error correct - double error
  detect) that minimizes the impact of Level 2/SRAM cache and memory
  errors.  A single bit error in a 144 bit checkword is corrected and a
  double bit error is detected and trapped in the processor.  An entire
  576 bit word consists of 4 checkwords, so up to 4 independent bit
  errors are handled by the SED- DED code without any impact to data
  integrity.

  The errors reported in the Asynchronous Fault Status Registers
  (AFSR) are:

  1. UCC -  SW correctable Level 2/SRAM cache ECC error for instruction
	    fetch or data access other than block load.  Some paths
	    accessing Level 2/SRAM cache do not have sufficient time for 
	    the ECC algorithm to present corrected data to the processor.  
	    In these instances, software must intervene and flush Level 
            2/SRAM cache and perhaps D$ to ensure correct operation.

  2. UCU -  Uncorrectable Level 2/SRAM cache error for instruction fetch 
            or data access other than block load.

  3. EDC -  HW corrected Level 2/SRAM cache ECC error for store merge or 
            block load.

  4. EDU -  Uncorrectable Level 2/SRAM cache ECC error for store merge or 
            block load.  In most cases, software can differentiate between 
            an error from a store merge, which it indicates with EDU:ST,
            and an error from a block load, which it indicates with
            EDU:BLD.

  5. WDC -  HW corrected Level 2/SRAM cache ECC for writeback 
            (victimization).

  6. WDU -  Uncorrectable Level 2/SRAM cache ECC error for writeback 
            (victimization).

  7. CPC -  HW corrected Level 2/SRAM cache ECC error for copyout 
            (snoop request).

  8. CPU -  Uncorrected Level 2/SRAM cache ECC error for copyout 
            (snoop request).


Causes of SRAM Errors
=====================

For an extensive discussion on the causes of Level 2/SRAM cache memory
errors, please reference the white paper: "Estimate of Threshold for
ECC in Serengeti L2 SRAM Modules", C. Slayman, 08/20/02

This document can be found at: 

    http://sram.eng/MTG/Quality/24hr_estimate_3.pdf


IMPLEMENTATION:

           ---
          |   |   MANDATORY (Fully Proactive)
           ---


           ---
          |   |   CONTROLLED PROACTIVE (per Sun Geo Plan)
           ---


           ---
          | X |   REACTIVE (As Required)
           ---


CORRECTIVE ACTION:

The following recommendation is provided as a guideline for authorized
Sun Services Field Representatives who may encounter the above
mentioned problem.

It is recommended that all UltraSPARC-III / UltraSPARC-III+ systems
be upgraded to these kernel patch levels:

	Solaris 8 108528 or later
	Solaris 9 112233 or later

The following information should be used when determining what action 
to take when Level 2/SRAM Cache Memory Errors occur.

RECOMMENDATIONS:
================

1. Don't replace a module that shows only one or two XXC errors
   in a 24hr period:
   It is difficult to determine if two errors on the same Level 2/SRAM
   cache module are due to a multi-cell cosmic ray event or the onset
   of a different failure mechanism. Therefore, replacement of such a
   module runs the risk of creating more downtime at the customer site
   by swapping out a motherboard with perfectly good components.
   Unless a diagnostic program can be run to determine that the two
   errors ARE NOT from nearest neighbor cells, it is recommended that
   the module not be replaced. (Conversely, if it can be verified that
   the two errors are indeed independent or come from different SRAMs
   on the same module, this cannot be explained by cosmic ray
   phenomena and the reliability of the Level 2/SRAM cache module
   should be viewed as suspect).


2. Take a closer look at three XXC errors in a 24hr period and
   replace if necessary:
   Three errors on the same Level 2/SRAM cache module within a 24hr
   period is highly unlikely from a cosmic ray event: ~10 to 100 times
   less frequent than a double-cell event and ~100 to 10,000 times
   less frequent than a single-cell upset. Therefore, the risk of
   replacing a perfectly good module using this criteria is much
   lower. Please note that XXC errors that have unique error ids
   should be considered seperate errors. For example: A UCC/WDC with
   the same error id should be considered as one error and not two.


3. Take a very close look at all XXU errors:
   Cosmic ray events can only corrupt single bits in each checkword.
   This is handled by SEC/DED code and should not lead to any system
   crash. Not every xxU will lead to a system crash, since the system
   is able to recover from many of them. Nonetheless, every xxU is
   serious and should be attended to promptly. Until we have software
   that does this automatically, it would be advisable to scan
   /var/adm/messages periodically (weekly, daily or even every few
   hours depending on the severity of the problem) to see if any
   recent xxU events (UCU, WDU, EDU or CPU) have occurred. One
   possible command to do this would be:

     egrep AFT /var/adm/messages /var/adm/messages.0 | egrep "U Event"

   If any are found, then the full context of the message should be
   examined (not just the lines printed from the command above, but
   the complete set associated messages in the /var/admin/messages) to
   see what board or module should be replaced.

4. Watch for patterns in the errors:
   Cosmic ray SER is random in space and time, so all Level 2/SRAM
   cache modules are likely to be hit. If there appears to be a
   particular module, motherboard or bit that is showing a higher error
   rate or the errors appear to be occurring at a particular part of
   the day, then the events are not cosmic ray induce (unless the time
   is correlated to scrubber operation which exercises 100% of
   memory).


COMMENTS:

None

============================================================================

Implementation Footnote:

i)   In case of MANDATORY FINs, Enterprise Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO index.

Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@sdpsweb.EBay
--------------------------------------------------------------------------