Document fins/I0839-1


FIN #: I0839-1

SYNOPSIS: DIMM error messages on Sun Fire V880 systems with Solaris 8 Updates 5
          and 6 make it difficult to identify the failing DIMM

DATE: Jun/17/02

KEYWORDS: DIMM error messages on Sun Fire V880 systems with Solaris 8 Updates 5
          and 6 make it difficult to identify the failing DIMM


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)



SYNOPSIS: DIMM error messages on Sun Fire V880 systems with Solaris 8 
          Updates 5 and 6 make it difficult to identify the failing DIMM.
              

Sun Alert:          No

TOP FIN/FCO REPORT: Yes 
 
PRODUCT_REFERENCE:  DIMM on Sun Fire V880 
 
PRODUCT CATEGORY:   Server / SW Admin 


PRODUCTS AFFECTED:  

Mkt_ID     Platform     Model   Description        Serial Number
------     --------     -----   -----------        -------------
Systems Affected
----------------
  -          A30         ALL    Sun Fire V880           -

X-Options Affected
------------------
  -           -           -          -                   -
 

PART NUMBERS AFFECTED: 

Part Number   Description   Model
-----------   -----------   -----
     -             -          -


REFERENCES:

BugId:   4491362 - CE error reporting in Daktari and Cherrystone is 
                   ambiguous.   

PatchId: 108528 or higher: SunOS 5.8: kernel update patch.  

      
PROBLEM DESCRIPTION: 

DIMM error messages on Sun Fire V880 systems running Solaris 8 U5
(07/01) or Solaris 8 U6 (10/01) without Patch 108528 do not provide
the exact location of the failing DIMM.  This serviceability issue
could cause support personnel to replace the wrong DIMMs and lead to
unnecessary service calls and system down time.

The Sun Fire V880 platform is capable of having up to 4 dual-CPU/Memory
Boards installed into the chassis.  Memory DIMMs are installed into the
CPU/Memory Boards themselves, not onto the System Board.  With Solaris
8 U5 (07/01) and Solaris 8 U6 (10/01) the error messaging only reports
a DIMM connector slot (e.g. JXXXX), but not the CPU/Memory Board where
that DIMM resides.  CPU/Memory Boards plug into the System Board in one
of four slots.

When an error occurs and Solaris logs the message, the service person
is given a particular DIMM slot on some CPU, but is not provided a
specific CPU Board on which to replace the DIMM.  As a result, the DIMM
may be pulled from that DIMM slot on each installed CPU Board.  DIMMs
may also be mistakenly pulled from the CPU Board where the "victim" CPU
is located" (the CPU which reported the event).  This may not be where
the DIMM actually resides.  This can cause the wrong DIMM/DIMMs to be
replaced.  It is believed that this has contributed to the high rate of
NTF (No Trouble Found) for DIMMs replaced on the V880.  

Here are some examples taken from a system running Solaris 8 U6, but they 
will be the same for Solaris 8 U5.  

Here is a single bit error that was corrected.  It shows that J8000 is
the failing DIMM.  However, it does not give the location of the
CPU/Memory Board on which J8000 resides.

   May 16 04:05:02 bm006 SUNW,UltraSPARC-III: [ID 423498 kern.notice] 
          NOTICE:[AFT0] Corrected system bus (CE) Event on CPU7 at TL=0, 
          errID 0x00000052.228491d4
   May 16 04:05:02 bm006     AFSR 0x00000002<CE>.000001c3 AFAR 
          0x00000020.c40330b0
   May 16 04:05:02 bm006     Fault_PC 0x19490 Esynd 0x01c3 J8000 <------*
   May 16 04:05:02 bm006 SUNW,UltraSPARC-III: [ID 266361 kern.notice] 
          [AFT0]errID 0x00000052.228491d4 Corrected Memory Error on J3200 
          is Intermittent
   May 16 04:05:02 bm006 SUNW,UltraSPARC-III: [ID 466347 kern.notice] 
          [AFT0] errID 0x00000052.228491d4 Data Bit 72 was in error and 
          corrected

Here is an example of a multi-bit problem which results in a UE error.  
Multiple CPUs across different CPU/Memory Boards report the failure.  
The reporting CPU will not always be where the failing DIMM is located.

   May 16 04:06:16 bm006 SUNW,UltraSPARC-III: [ID 774876 kern.warning] 
          WARNING: [AFT1] Uncorrectable system bus (UE)   Event on CPU7 
          User Data Access at TL=0, errID 0x00000063.74c9f1d0
   May 16 04:06:16 bm006     AFSR 0x00000004<UE>.000000b1 AFAR 
          0x00000080.c802c390
   May 16 04:06:16 bm006     Fault_PC 0x19490 Esynd 0x00b1 J3100 J3101 
          J3201 J3200
   May 16 04:06:16 bm006 SUNW,UltraSPARC-III: [ID 820243 kern.notice] 
          [AFT1] errID 0x00000063.74c9f1d0 More than four Bits were in 
          error

   May 16 04:06:16 bm006 SUNW,UltraSPARC-III: [ID 920920 kern.warning] 
          WARNING: [AFT1] Uncorrectable system bus (UE) Event on CPU3 
          User Data Access at TL=0, errID 0x00000063.74ca36f4
   May 16 04:06:16 bm006     AFSR 0x00000004<UE>.000000b1 AFAR 
          0x00000080.c802c390
   May 16 04:06:16 bm006     Fault_PC 0x19590 Esynd 0x00b1 J3100 J3101 
          J3201 J3200
   May 16 04:06:16 bm006 SUNW,UltraSPARC-III: [ID 774076 kern.notice] 
          [AFT1] errID 0x00000063.74ca36f4 More than four Bits were in 
          error

   May 16 04:06:16 bm006 SUNW,UltraSPARC-III: [ID 147098 kern.warning] 
          WARNING: [AFT1] Uncorrectable system bus (UE) Event on CPU6 
          User Data Access at TL=0, errID 0x00000063.74ca70c4
   May 16 04:06:16 bm006     AFSR 0x00000004<UE>.000000b1 AFAR 
          0x00000080.c802c390
   May 16 04:06:16 bm006     Fault_PC 0x19490 Esynd 0x00b1 J3100 J3101 
          J3201 J3200
   May 16 04:06:16 bm006 SUNW,UltraSPARC-III: [ID 356796 kern.notice] 
          [AFT1] errID 0x00000063.74ca70c4 More than four Bits were in 
          error

   May 16 04:06:16 bm006 SUNW,UltraSPARC-III: [ID 679866 kern.warning] 
          WARNING: [AFT1] Uncorrectable system bus (UE) Event on CPU2 
          User Data Access at TL=0, errID 0x00000063.74ca876c
   May 16 04:06:16 bm006     AFSR 0x00000004<UE>.000000b1 AFAR 
          0x00000080.c802c390
   May 16 04:06:16 bm006     Fault_PC 0x19910 Esynd 0x00b1 J3100 J3101 
          J3201 J3200
   May 16 04:06:16 bm006 SUNW,UltraSPARC-III: [ID 565709 kern.notice] 
          [AFT1] errID 0x00000063.74ca876c More than four Bits were in 
          error
   
This DIMM serviceability issue has been addressed in Solaris 8 U7
(02/02) and in Kernel PatchId 108528 or greater.  For those customers
who cannot upgrade to the 02/02 release, it is strongly recommended
that Patch 108528Id or greater be installed to alleviate this issue.
      
Once Patch 108528 or Solaris 8 Update 7 is installed, the memory
error messages will include the location of the failing DIMM, including
the CPU/Memory Board.  Here is an example of a DIMM error from Solaris
8 U7.  Note that the CPU/Memory Board is listed in the error message.

   May 23 04:01:37 cl304 SUNW,UltraSPARC-III+: [ID 375193 kern.notice] 
          NOTICE: [AFT0] Corrected system bus (CE) Event on CPU0 at TL=0, 
          errID 0x00000043.b4f99f78
   May 23 04:01:37 cl304     AFSR 0x00000002&#60;CE&#62;.000000f8 AFAR
          0x00000040.ff431100
   May 23 04:01:37 cl304     Fault_PC 0x10031120 Esynd 0x00f8 Slot B: 
          J2901
   May 23 04:01:37 cl304 SUNW,UltraSPARC-III+: [ID 737772 kern.notice] 
          [AFT0] errID 0x00000043.b4f99f78 Corrected Memory Error on  
          Slot B: J2901 is Sticky
          ^^^^^^                                            
                                                 
   May 23 04:01:37 cl304 SUNW,UltraSPARC-III+: [ID 594816 kern.notice] 
         [AFT0] errID 0x00000043.b4f99f78 Data Bit 56 was in error and 
         corrected

The error message above points to DIMM J2901 on the CPU/Memory Board in 
Slot B.

    
IMPLEMENTATION:  
 
         ---
        |   |   MANDATORY (Fully Pro-Active)
         ---    
         
  
         ---
        | X |   CONTROLLED PRO-ACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        |   |   REACTIVE (As Required)
         ---
         

CORRECTIVE ACTION: 

The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives who may encounter the above
mentioned problem.

To aid in diagnosing DIMM failures on Sun Fire V880 systems, perform 
the following actions.  

NOTE: It is recommended to run Extended POST after any failing DIMMs are 
      replaced to verify that the memory is OK.  If time permits, also 
      run the Memory Test from VTS.

1. Install Solaris 8 Kernel PatchId 108528 or greater, or upgrade the
   OS to Solaris 8 Update 7 (02/02) or later.
   
   OR
   
2. If it not possible to install PatchId 108528 or upgrade to Solaris 8 
   U7, use the procedure below to determine the location of failed DIMMs on 
   the V880 platform. 
   
DIMM FAILURE ANALYSIS PROCEDURE
-------------------------------
   
  A. Correctable Error (CE):
  --------------------------
  Use the first error messsage shown in the Problem Description, which is
  a CE error.  The data needed to determine where the failing DIMM is
  located is the AFAR associated with the DIMM and the Memory
  Configuration that OBP sets up prior to booting the System.  The AFAR
  can be gathered from the error message.  The AFAR is
  0x00000020.c40330b0 as shown below.

    May 16 04:05:02 bm006 SUNW,UltraSPARC-III: [ID 423498 kern.notice] 
           NOTICE: [AFT0] Corrected system bus (CE) Event on CPU7 at TL=0, 
           errID 0x00000052.228491d4
    May 16 04:05:02 bm006     AFSR 0x00000002<CE>.000001c3 
           AFAR 0x00000020.c40330b0
           ^^^^^^^^^^^^^^^^^^^^^^^^
    May 16 04:05:02 bm006     Fault_PC 0x19490 Esynd 0x01c3 J8000
    May 16 04:05:02 bm006 SUNW,UltraSPARC-III: [ID 266361 kern.notice] 
           [AFT0] errID 0x00000052.228491d4 Corrected Memory Error on J3200 
           is Intermittent
    May 16 04:05:02 bm006 SUNW,UltraSPARC-III: [ID 466347 kern.notice] 
           [AFT0] errID 0x00000052.228491d4 Data Bit 72 was in error and 
           corrected
        
  Next, collect the Memory Configuration. There are two possible methods  
  to gather this information, 1. or 2 below. 

    1. Use 'cfgadm -av | grep "base address"' from Solaris. 'This
should
       be captured from the system which is producing the DIMM error. Do  
       NOT rely on the output below or any other system's output. Memory
       configurations can vary from system to system depending on their 
       own unique memory layout.

       The following output is displayed:
  
       SBa::memory                    connected    configured   ok   base
       address 0x2000000000, 1048576 KBytes total, unconfigurable
       SBb::memory                    connected    configured   ok   base
       address 0x4000000000, 1048576 KBytes total, unconfigurable
       SBc::memory                    connected    configured   ok   base
       address 0X6000000000, 1048576 KBytes total, unconfigurable
       SBd::memory                    connected    configured   ok   base
       address 0X8000000000, 1048576 KBytes total, unconfigurable
  
   This shows the address ranges associated with each cpu/mem Module.
  
       SBa: cpu/mem Module in Slot A address Range 0x2000000000
       SBb: cpu/mem Module in Slot B address Range 0x4000000000
       SBc: cpu/mem Module in Slot C address Range 0X6000000000
       SBd: cpu/mem Module in Slot D address Range 0X8000000000
  
   2. Perform the following three steps from OBP to gather the OBP
      memory Configuration.  The memory configuration must be captured from 
      the system which is producing the DIMM error.  Do NOT rely on the
      output below or any other system's output.  Memory configurations 
      can vary from system to system depending on their own unique memory 
      layout.

      Set up the OBP to gather the Memory Configuration:
   	
         ok setenv diag-switch? true
         ok setenv diag-level min
         ok reset-all
   
      The system key switch could also be set to the "diag" position,
      or the NVRAM variable "diag-level" could be set to "max"
(ok
      setenv diag-level max).  The quickest way would be the 3 steps
      above.

   Here is output from the OBP showing the Memory Configuration:

   03:57:51 Memory Configuration: 
   03:57:51 CPU0 Bank0  128 +  128 +  128 +  128 :  512MB @  2000000000  
            8-way #0
   03:57:51 CPU0 Bank1  128 +  128 +  128 +  128 :  512MB @  2000000000  
            8-way #2
   03:57:52 CPU0 Bank2  128 +  128 +  128 +  128 :  512MB @  2000000000  
            8-way #4
   03:57:52 CPU0 Bank3  128 +  128 +  128 +  128 :  512MB @  2000000000  
            8-way #6
   03:57:52 CPU1 Bank0  128 +  128 +  128 +  128 :  512MB @  4000000000  
            8-way #0
   03:57:52 CPU1 Bank1  128 +  128 +  128 +  128 :  512MB @  4000000000  
            8-way #2
   03:57:52 CPU1 Bank2  128 +  128 +  128 +  128 :  512MB @  4000000000  
            8-way #4
   03:57:52 CPU1 Bank3  128 +  128 +  128 +  128 :  512MB @  4000000000  
            8-way #6
   03:57:52 CPU2 Bank0  128 +  128 +  128 +  128 :  512MB @  2000000000  
            8-way #1
   03:57:52 CPU2 Bank1  128 +  128 +  128 +  128 :  512MB @  2000000000  
            8-way #3
   03:57:52 CPU2 Bank2  128 +  128 +  128 +  128 :  512MB @  2000000000  
            8-way #5
   03:57:52 CPU2 Bank3  128 +  128 +  128 +  128 :  512MB @  2000000000  
            8-way #7
   03:57:52 CPU3 Bank0  128 +  128 +  128 +  128 :  512MB @  4000000000   
            8-way #1
   03:57:52 CPU3 Bank1  128 +  128 +  128 +  128 :  512MB @  4000000000  
            8-way #3
   03:57:52 CPU3 Bank2  128 +  128 +  128 +  128 :  512MB @  4000000000  
            8-way #5
   03:57:53 CPU3 Bank3  128 +  128 +  128 +  128 :  512MB @  4000000000  
            8-way #7
   03:57:53 CPU4 Bank0  128 +  128 +  128 +  128 :  512MB @  6000000000  
            8-way #0
   03:57:53 CPU4 Bank1  128 +  128 +  128 +  128 :  512MB @  6000000000  
            8-way #2
   03:57:53 CPU4 Bank2  128 +  128 +  128 +  128 :  512MB @  6000000000  
            8-way #4
   03:57:53 CPU4 Bank3  128 +  128 +  128 +  128 :  512MB @  6000000000  
            8-way #6
   03:57:53 CPU5 Bank0  128 +  128 +  128 +  128 :  512MB @  8000000000  
            8-way #0
   03:57:53 CPU5 Bank1  128 +  128 +  128 +  128 :  512MB @  8000000000  
            8-way #2
   03:57:53 CPU5 Bank2  128 +  128 +  128 +  128 :  512MB @  8000000000  
            8-way #4
   03:57:53 CPU5 Bank3  128 +  128 +  128 +  128 :  512MB @  8000000000  
            8-way #6
   03:57:53 CPU6 Bank0  128 +  128 +  128 +  128 :  512MB @  6000000000  
            8-way #1
   03:57:53 CPU6 Bank1  128 +  128 +  128 +  128 :  512MB @  6000000000  
            8-way #3
   03:57:53 CPU6 Bank2  128 +  128 +  128 +  128 :  512MB @  6000000000  
            8-way #5
   03:57:53 CPU6 Bank3  128 +  128 +  128 +  128 :  512MB @  6000000000  
            8-way #7
   03:57:53 CPU7 Bank0  128 +  128 +  128 +  128 :  512MB @  8000000000  
            8-way #1
   03:57:54 CPU7 Bank1  128 +  128 +  128 +  128 :  512MB @  8000000000  
            8-way #3
   03:57:54 CPU7 Bank2  128 +  128 +  128 +  128 :  512MB @  8000000000  
            8-way #5
   03:57:54 CPU7 Bank3  128 +  128 +  128 +  128 :  512MB @  8000000000  
            8-way #7

Now, all of the data has been collected to associate the failing DIMM 
with the correct CPU/Memory Board.

The AFAR is 0x00000020.c40330b0.  Take the two digits to the left of
the "."  In this case it is 20.  The value of 20 corresponds to
2000000000 which is seen from the Memory Configuration output from the
OBP.  2000000000 is associated with CPUs 0 and 2, or the CPU/Memory
Board in Slot A.  
                                                             
Here is how the CPUs map out to each slot on the System Board:

   Slot A   CPU0 & CPU2
   Slot B   CPU1 & CPU3
   Slot C   CPU4 & CPU6
   Slot D   CPU5 & CPU7
   
This information is hard coded.  For example, if there was one
CPU/Memory Board located in Slot B, the CPUs would show up as 1 and 3. 

   B. Uncorrectable Error (UE):
   ----------------------------

   May 16 04:06:16 bm006 SUNW,UltraSPARC-III: [ID 920920 kern.warning] 
          WARNING: [AFT1] Uncorrectable system bus (UE) Event on CPU3 User 
          Data Access at TL=0, errID 0x00000063.74ca36f4
   May 16 04:06:16 bm006     AFSR 0x00000004<UE>.000000b1 AFAR 
          0x00000080.c802c390
   May 16 04:06:16 bm006     Fault_PC 0x19590 Esynd 0x00b1 J3100 J3101 
          J3201 J3200
   May 16 04:06:16 bm006 SUNW,UltraSPARC-III: [ID 774076 kern.notice] 
          [AFT1] errID 0x00000063.74ca36f4 More than four Bits were in 
          error

The AFAR above is 0x00000080.c802c390.  Take the two digits to the left
of the "."  In this case it is 80, so it happens that the same memory
configuration is valid.  80 corresponds to 8000000000 which relates to
CPUs 5 and 7, or the CPU/Memory Board in Slot D.  Notice that the error
message was generated from CPU3.  However, the failing DIMMs are not on
the same board as CPU3.  They are on the CPU/Memory Board in Slot D.
Also note that this is a UE error where more than four bits were in
error.  In this case you need to change all four DIMMs (J3100 J3101
J3201 J3200) located on the CPU/Memory Board in Slot D.  The failure
cannot be isolated to one DIMM in this case, so all four must be
replaced.


COMMENTS:

None
   
----------------------------------------------------------------------------

Implementation Footnote:

i)   In case of MANDATORY FINs, Enterprise Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO index.

Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@Sun.COM
--------------------------------------------------------------------------


Copyright (c) 1997-2003 Sun Microsystems, Inc.