Document fins/I0798-1


FIN #: I0798-1

SYNOPSIS: When an ECC error occurs on Sun Blade 100, Solaris incorrectly
          identifies the faulty DIMM

DATE: Apr/03/02

KEYWORDS: When an ECC error occurs on Sun Blade 100, Solaris incorrectly
          identifies the faulty DIMM


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)

                                    

SYNOPSIS: When an ECC error occurs on Sun Blade 100, Solaris incorrectly
	  identifies the faulty DIMM.
      

Sun Alert:          No             

TOP FIN/FCO REPORT: Yes 
 
PRODUCT_REFERENCE:  DIMM on Sun Blade 100 
 
PRODUCT CATEGORY:   Desktop / SW Admin


PRODUCTS AFFECTED:
  
Systems affected:
-------------------  
Mkt_ID   Platform   Model   Description   Serial Number
------   --------   -----   -----------   -------------
           A36	     ALL    Sun Blade 100       -
           

X-Options affected:
-------------------
Mkt_ID   Platform   Model   Description   Serial Number
------   --------   -----   -----------   -------------
  -          -        -          -             -


PART NUMBERS AFFECTED: 

Part Number   Description   Model
-----------   -----------   -----
    -             -           -


REFERENCES:

BugId:   4624001 - grover DIMM reporting off by one.

PatchId: 111179 - Hardware/PROM: Blade 100 Flash PROM Update.
 
ESC:     534363 - It appears the the OBP is incorrectly converting 
                  an AFAR to the wrong UNUM when.

DOC:     806-3416-10: Sun Blade 100 Service Manual.

     
PROBLEM DESCRIPTION:  
   
When an ECC memory error occurs on a Sun Blade 100 system, Solaris logs
a certain amount of diagnostic information.  However the wrong DIMM can
be reported as faulty.  This can lead to unnecessary outages as well
as additional service calls if the wrong DIMM is replaced.

The example below logged by Solaris shows that the physical address 
0x3e52e030 is located within DIMM2  while it is actually located within
DIMM1, as stated in the Sun Blade 100 Service Manual, 806-3416-10.

Here are two cases of error messages:

  Case1:

    AFSR 0x00000001<ME>.80300000<PRIV,UE,CE> AFAR
0x00000000.3e52e030
    AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x1009421c
    UDBH 0x0362<UE,CE> UDBH.ESYND 0x62 UDBL 0x0000 UDBL.ESYND 0x00
    UDBH Syndrome 0x62 Memory Module DIMM2
                     ^^^^^^^^^^^^^^^^^^^

    From the manual 806-3416-10:

    DIMM#  UNUM   Dimm Starting Address
    -----------------------------------
    DIMM0   U2      0x00000000      
    DIMM1   U3      0X20000000
    DIMM2   U4      0x40000000
    DIMM3   U5      0x60000000

    Yet AFAR 3e52e030 was reported as DIMM2.  It should be DIMM1.

  Case2:

    WARNING: [AFT1] Uncorrectable Memory Error on CPU0 Data access at TL>0,

             errID 0x0000d11d.7f890248
          AFSR 0x00000000.80300000<PRIV,UE,CE> AFAR 0x00000000.1489bdb0
          AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x10023c08
          UDBH 0x03c2<UE,CE> UDBH.ESYND 0xc2 UDBL 0x0000 UDBL.ESYND
0x00
          UDBH Syndrome 0xc2 Memory Module DIMM1
        [AFT2] errID 0x0000d11d.7f890248 E$tag != PA from AFAR; E$line 
        was victimized dumping memory from PA 0x00000000.1489bd80 instead
        [AFT2] E$Data (0x00): 0x00000000.00000000
        [AFT2] E$Data (0x08): 0x00000000.00000000
        [AFT2] E$Data (0x10): 0x00000000.00000000
        [AFT2] E$Data (0x18): 0x00000000.00000000
        [AFT2] E$Data (0x20): 0x00000000.00000000
        [AFT2] E$Data (0x28): 0x00000000.00000000
        [AFT2] E$Data (0x30): 0x00000000.00008800
        [AFT2] E$Data (0x38): 0x00000000.00000000

    panic[cpu0]/thread=2a100017d40: [AFT1] errID 0x0000d11d.7f890248 UE
Error(s)
  
In the example above, AFAR 1489bdb0 indicates DIMM0, but the error
message reports "Memory Module DIMM1".

The OBP is incorrectly converting an AFAR to the wrong UNUM when it
comes to Sun Blade 100 DIMMs.  The Sun Blade 100 OBP is generating UNUM
strings of DIMM1-4 instead of DIMM0-3.  

This Sun Blade 100 DIMM reporting problem has been fixed in the latest
release of the OBP firmware.  Changes were made to
obp/arch/sun4u/grover/memprobe.fth.  Please follow the recommendations
provided in the Corrective Action below.


IMPLEMENTATION: 

         ---
        |   |   MANDATORY (Fully Proactive)
         ---    
         
  
         ---
        | X |   CONTROLLED PROACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        |   |   REACTIVE (As Required)
         ---


CORRECTIVE ACTION:

The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives who may encounter the above
mentioned problem.

Please follow one of the below guidelines:

  Guideline 1:
  ------------

  For Sun Blade 100 platforms with OBP 4.5.0 and below, remap DIMM#
  reported - Subtract one (1) from reported number. 
  
  Example : If Solaris reports DIMM2, DIMM1 is the defective DIMM. 

OR:

  Guideline 2:
  ------------

  This problem has been fixed with the new "Sun Blade 100 Flash PROM". 
  Apply patch 111179 which upgrades the OBP to 4.5.9.  If OBP 4.5.9
  is used, the DIMM# reported will be correct.


COMMENTS:  

None

============================================================================

Implementation Footnote:

i)   In case of MANDATORY FINs, Enterprise Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO index.

Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@sdpsweb.EBay
--------------------------------------------------------------------------


Copyright (c) 1997-2003 Sun Microsystems, Inc.