Document fins/I0755-1


FIN #: I0755-1

SYNOPSIS: Tuning the ecache_scan_rate parameter of the Solaris cache scrubber
          provides improved Ecache parity error protection on non-mirrored SRAM
          UltraSPARC II-based systems

DATE: Aug/09/02

KEYWORDS: Tuning the ecache_scan_rate parameter of the Solaris cache scrubber
          provides improved Ecache parity error protection on non-mirrored SRAM
          UltraSPARC II-based systems


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)



SYNOPSIS: Tuning the ecache_scan_rate parameter of the Solaris cache 
          scrubber provides improved Ecache parity error protection on 
          non-mirrored SRAM UltraSPARC II-based systems.  
              

Sun Alert:          No

TOP FIN/FCO REPORT: Yes 
 
PRODUCT_REFERENCE:  Solaris cache scrubber
 
PRODUCT CATEGORY:   Software / Solaris


PRODUCTS AFFECTED:  

Systems Affected
----------------  
Mkt_ID   Platform   Model   Description              Serial Number
------   --------   -----   -----------              -------------
  -      E3000       ALL    Ultra Enterprise 3000          -  
  -      E3500       ALL    Ultra Enterprise 3500          -  
  -      E4000       ALL    Ultra Enterprise 4000          -  
  -      E4500       ALL    Ultra Enterprise 4500          -  
  -      E5000       ALL    Ultra Enterprise 5000          -  
  -      E5500       ALL    Ultra Enterprise 5500          -  
  -      E6000       ALL    Ultra Enterprise 6000          -
  -      E6500       ALL    Ultra Enterprise 6500          -
  -      E450-HPC    ALL    Ultra Enterprise 450 HPC       -
  -      A25         ALL    Enterprise 450                 -
  -      A33         ALL    Enterprise 420R                -
  -      A26         ALL    Enterprise 250                 -
  -      A34         ALL    Enterprise 220R                -
  -      N14         ALL    Netra T-1405                   -
  -      N15         ALL    Netra T-1400                   -
  -      N07         ALL    Netra T1 100                   -
  -      N06         ALL    Netra T1 105                   -
  -      N04         ALL    Netra T-1125                   -
  -      N03         ALL    Netra T-1120                   -
  -      A27         ALL    Ultra 80                       -


Mkt_ID   Platform   Model   Description              Serial Number
------   --------   -----   -----------              -------------
X-Options Affected
------------------
X2248A     -         -      480Mhz UltraSPARC II Module 8MB Cache   -
X2244A     -         -      400Mhz UltraSPARC II Module 4MB Cache   -
X1994A     -         -      400Mhz UltraSPARC II Module 2MB Cache   -
X2240A     -         -      300MHz UltraSPARC II Module 2MB Cache   -
X2230A     -         -      250MHz UltraSPARC II Module 1MB Cache   -
X1995A     -         -      450Mhz UltraSPARC II Module 4MB Cache   -
X1997A     -         -      440Mhz UltraSPARC II Module 4MB Cache   -
X2580A     -         -      400MHz UltraSPARC II Module 8MB cache   -
X2570A     -         -      400MHz UltraSPARC II Module 4MB cache   -
X1993A     -         -      400Mhz UltraSPARC II Module 2MB Cache   -
X1992A     -         -      360Mhz UltraSPARC II Module 4MB Cache   -
X2560A     -         -      336MHz UltraSPARC II Module 4MB Cache   -


PART NUMBERS AFFECTED:

Part Number            Description                              Model
-----------            -----------                              -----
501-5729-04 or lower   480 MHz UltraSPARC II Module 8MB Cache     -
501-5344-06 or lower   450 MHz UltraSPARC II Module 4MB Cache     -
501-5539-06 or lower   450 MHz UltraSPARC II Module 4MB Cache     -
501-5682-04 or lower   440 MHz UltraSPARC II Module 4MB Cache     -
501-5235-04 or lower   400 MHz UltraSPARC II Module 8MB Cache     -
501-4995-03 or lower   400 MHz UltraSPARC II Module 4MB Cache     -
501-5239-05 or lower   400 MHz UltraSPARC II Module 4MB Cache     -
501-5420-04 or lower   400 MHz UltraSPARC II Module 4MB Cache     -
501-5425-04 or lower   400 MHz UltraSPARC II Module 4MB Cache     -
501-5446-04 or lower   400 MHz UltraSPARC II Module 4MB Cache     -
501-5500-03 or lower   400 MHz UltraSPARC II Module 4MB Cache     -
501-5585-02 or lower   400 MHz UltraSPARC II Module 4MB Cache     -
501-5237-04 or lower   400 MHz UltraSPARC II Module 2MB Cache     -
501-5445-05 or lower   400 MHz UltraSPARC II Module 2MB Cache     -
501-5541-02 or lower   400 MHz UltraSPARC II Module 2MB Cache     -
501-5545-01            400 MHz UltraSPARC II Module 2MB Cache     -
501-5149-A1 or lower   440 MHz UltraSPARC IIi Module 2MB Cache    -
501-5740-01            400 MHz UltraSPARC IIi Module 2MB Cache    -
501-5741-01            400 MHz UltraSPARC IIi Module 2MB Cache    -
501-4178-04 or lower   250 MHz UltraSPARC II Module 1MB Cache     -
501-4363-08 or lower   336 MHz UltraSPARC II Module 4MB Cache     -


REFERENCES:

PatchId: 103640 or higher - Kernel Patch (Solaris 2.5.1)
         105181 or higher - Kernel Patch (Solaris 2.6)
         106541 or higher - Kernel Patch (Solaris 7)
         108528 or higher - Kernel Patch (Solaris 8)

FIN:     I0570-3
         I0616-1

URL:     http://bestpractices.central/
         http://cte-www.uk/cgi-bin/afsr/afsr.pl
         http://onestop.eng/ecache/scrubber_tuning.txt


PROBLEM DESCRIPTION:

UltraSPARC II with non-mirrored SRAM modules are susceptible to Ecache
parity errors.  Systems shipped with mixed-vendor IBM/Sony SRAM CPU
modules have a higher susceptibility to E$ errors due to higher
particle emissions (less-favorable SER) on the IBM SRAM componentry.  

To reduce the likelihood of Ecache Data, Writeback and CopyOut Parity
errors, a "Cache Scrubber" has been implemented in the Solaris Kernel
that periodically flushes modified cache lines out to main memory and
invalidates cache lines that have not been modified.  By reducing the
likelihood that an otherwise nonfatal error in the Ecache will result
in a system failure, this procedure improves the system's reliability.
 
Solaris Kernel patches are available that provide improved handling and
reduction of Ecache errors in systems using UltraSPARC-II and -IIi
processors.  Ecache parity errors on non-mirrored SRAM UltraSPARC
II-based systems result in unplanned system downtime.  All customers
using Solaris 2.5.1, 2.6, 7 and 8 are encouraged to upgrade to latest
kernel patches as they become available.
 
The risk of Ecache parity errors can be further reduced by tuning the
ecache_scan_rate parameter of the Cache Scrubber.  It is recommended
that the Cache Scrubber parameter "ecache_scan_rate" be adjusted on
affected systems and that the parameter not be adjusted above 1000.
ecache_scan_rate of 1000 causes the entire cache to be scrubbed once
per second.  Little to no marginal benefit has been demonstrated of a
higher frequency for this parameter in terms of Ecache error
mitigation.
 
The default setting for ecache_scan_rate is 100.  Setting this
parameter at 1000 has been demonstrated to provide additional
mitigation against Ecache errors.  As the primary reason behind the
effectiveness of this measure is shortened duration of residency times
of meaningful data in the cache, increases in ecache_scan_rate above
100 but less than 1000 may also provide effective mitigation against
Ecache errors.

Identifying Candidate Systems
-----------------------------
 
The following criteria should be used to identify which systems
will most benefit from a modifying the Cache Scrubber ecache_scan_rate:

  1) System is UltraSPARC-II and does not have mirrored SRAM
("Sombras").
 
  2) System has had 1 or more Ecache errors or similarly configured
     systems in the same install base have experienced Ecache errors.
 
  3) Business impact of unplanned downtime on the system is significant.
 
  4) System resides in an environment that has a history of temperature
     and humidity control problems or in a region with typically dry
     winters.
 
Scrubber Tuning Performance Impact
----------------------------------
   
  Increasing ecache_scan_rate does require additional CPU resources
  though testing has demonstrated that the most typical CPU utilization
  penalty of setting ecache_scan_rate at 1000 on a 400MHz+ workgroup or
  Telco server is less than 1%.  If a particular system appears to be a
  good candidate for scrubber tuning, and that system is known or
  believed to have periods of 90%+ CPU utilization then it would be
  important to test the setting on a test system approximating the
  production environment and load to identify any performance impact
  of the scrubber setting change.
 
  The following command can be used to observe CPU utilization for 24
  hours.  The resulting file can then be graphed using a spreadsheet or
  other graphing environment.  This example samples system CPU idle time
  every 10 seconds.  The interval and count can be modified to to take
  more frequent samples or to observe a longer total period.

    # vmstat [ interval ] [ count ] | ...

    # vmstat 10 8640 | awk '{print $22}' | grep -v id | grep -v
      '^$' > /path/loadtest.csv &

  If utilization on the target system is very high, it may be appropriate
  to increase Ecache_scan_rate at a value greater than the default 100 but
  less than 1000.

Risk of Ecache parity errors is diminished by tuning the
ecache_scan_rate parameter of the cache scrubber.  The only other fix
for Ecache parity errors is mirrored SRAM which is not available for
UltraSPARC II-based midrange and Telco platforms.
  

IMPLEMENTATION: 

         ---
        |   |   MANDATORY (Fully Proactive)
         ---    
         
  
         ---
        | X |   CONTROLLED PROACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        |   |   REACTIVE (As Required)
         ---


CORRECTIVE ACTION:

The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives who may encounter the above
mentioned problem.
 
Using the criteria given above to identify candidate systems, adjust
the kernel Cache Scrubber ecache_scan_rate as needed.  Note that
although the procedure to adjust ecache_scan_rate is non-intrusive and
does not require a reboot, it is recommended that it be done during a
scheduled maintenance window.
 
To adjust ecache_scan_rate:

  1. As root, run the following command to adjust ecache_scan_rate.

     # echo 'ecache_scan_rate/W 0t1000' | adb -kw
 
     NOTE: This does not require downtime.  Be very careful, though,
           as mis-typing the command could result in downtime.
 
  2. To make the change permanent, add the parameter setting to
     /etc/system.  It is best to insert all 3 parameters together into
     /etc/system if the settings are not already there:

       set ecache_scrub_enable=1
       set ecache_scan_rate=1000
       set ecache_calls_a_sec=100

     NOTE: Note on the 'ecache_scrub_enable=1', the 1 is set by default.
 
     NOTE: If the settings already exist in /etc/system, simply modify
           "ecache_scan_rate=100" to "ecache_scan_rate=1000".

NOTE: The ecache_scan_rate value should be 1000.  A lower value, though
      potentially beneficial in theory, is not known to be beneficial
      whereas "1000" is.  If any negative performance impact is
      observed, and that is unlikely, it could be set back to some
      lower value then.

To check a system's current setting use the following command.
This does not modify the setting in any way:
 
  # echo 'ecache_scan_rate/D' | adb -k

Additional reference:
 
  http://onestop.eng/ecache/scrubber_tuning.txt
               

COMMENTS:

None  

============================================================================

Implementation Footnote:

i)   In case of MANDATORY FINs, Enterprise Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO index.

Supporting Documents:
---------------------
* Supporting documents for FIN/FCOs can be found on Edist.  Edist can be 
  accessed internally at the following URL: http://edist.corp/.
  
* From there, follow the hyperlink path of "Enterprise Services Documenta- 
  tion" and click on "FIN & FCO attachments", then choose the
appropriate   
  folder, FIN or FCO.  This will display supporting directories/files for 
  FINs or FCOs.
   
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@Sun.COM
---------------------------------------------------------------------------



Copyright (c) 1997-2003 Sun Microsystems, Inc.