Document fins/I0895-1


FIN #: I0895-1

SYNOPSIS: Sun Fire 12K/15K domains may fail to boot due to LPOST issue

DATE: Dec/01/02

KEYWORDS: Sun Fire 12K/15K domains may fail to boot due to LPOST issue


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)



SYNOPSIS: Sun Fire 12K/15K domains may fail to boot due to LPOST issue.


SunAlert:           No

TOP FIN/FCO REPORT: Yes 
  
PRODUCT_REFERENCE:  Sun Fire 12K/15K
 
PRODUCT CATEGORY:   Server / SW Admin

     
PRODUCTS AFFECTED:  
  
Mkt_ID   Platform   Model   Description         Serial Number
------   --------   -----   -----------         -------------
  -        F12K      ALL    Sun Fire 12000            -
  -        F15K      ALL    Sun Fire 15000            -


X-Options Affected
------------------
Mkt_ID   Platform   Model     Description        Serial Number
------   --------   -----     -----------        -------------
  -         -         -            -                   -


PART NUMBERS AFFECTED:

Part Number      Description           Model
-----------      -----------           -----
     -                -                  -


REFERENCES:

BugId:   4728549 - starcat won't boot: "Memory Address not Aligned".

PatchId: 112829: SMS 1.2: SMS lpost patch.


PROBLEM DESCRIPTION: 

In rare instances, Sun Fire 12K/15K domains may fail to boot due to an
LPOST issue.  This issue has only been observed during Sun internal
testing, but has the potential to cause loss of availability for
customer domains.

This issue affects any Sun Fire 12K/15K domain running Solaris 8 or
Solaris 9 where the CPU/Memory Board or MaxCPU Board has LPOST level
5.13.3 and lower.

To determine the LPOST version for a domain:

   Use 'flashupdate -d X -f /opt/SUNWSMS/hostobjs/sgcpu.flash -n',
   where X is the letter [A-R] of the domain.
	
To date, this issue has only been experienced during in-house testing
of Solaris 9 9/02.  The known failure modes have all been during boot.
However, it is possible that a CPU/MCPU DR configure operation (on a
board that was POST'ed using the faulty LPOST firmware) has the
potential to put the domain, or customer data, at risk.

When a domain fails to boot due to this issue, one of three signatures 
is output to the domain console. 

 1. Memory corruption

   {20} ok boot
   Boot device:  
   /pci@1c,600000/pci@1/SUNW,qlc@4/fp@0,0/disk@w220000203709cd23,0:d  
      File and args:
   Loading ufs-file-system package 1.4 04 Aug 1995 13:02:54.
   FCode UFS Reader 1.12 00/07/17 15:48:16.
   Loading: /platform/SUNW,Sun-Fire-15000/ufsboot
   Loading: /platform/sun4u/ufsboot
   SunOS Release 5.9 Version cs3:s9-phase2-build:08/03/02 64-bit
   Copyright 1983-2002 Sun Microsystems, Inc.  All rights reserved.
   Use is subject to license terms.
   obpsym: symbolic debugging is available.
   Read 320799 bytes from misc/forthdebug
   Memory Address not Aligned
   {23} ok
    
 2. Send mondo panic

   {20} boot -v
   ...
   mc-us37 at root: SAFARI 0x223 0x400000 ...
   mc-us37 is /memory-controller@223,400000
   cpu544: SUNW,UltraSPARC-III+ (upaid 544 impl 0x15 ver 0x22 clock 900 MHz)
   cpu512: SUNW,UltraSPARC-III+ (upaid 512 impl 0x15 ver 0x22 clock 900 MHz)
   cpu513: SUNW,UltraSPARC-III+ (upaid 513 impl 0x15 ver 0x22 clock 900 MHz)
   cpu514: SUNW,UltraSPARC-III+ (upaid 514 impl 0x15 ver 0x22 clock 900 MHz)
   cpu515: SUNW,UltraSPARC-III+ (upaid 515 impl 0x15 ver 0x22 clock 900 MHz)
   send mondo timeout [833134 NACK 0 BUSY]
   IDSR 0x4  aids: 202
   panic: failed to stop cpu514

   panic[cpu512]/thread=2a1001fdd40: send_mondo_set: timeout
 
 3. Hang

   {a0} ok boot
   Boot device:
      /pci@17c,600000/pci@1/SUNW,qlc@4/fp@0,0/disk@w210000203719b3f4,0:a
      File and args: 
   SunOS Release 5.9 Version cs3:s9-phase2-build:SF15K_Phase_II_build_17 
   64-bit
   Copyright 1983-2002 Sun Microsystems, Inc.  All rights reserved.
   Use is subject to license terms.
   |             <-----  hangs right here

  OR
	 
   {20} boot -v
      ...
   mc-us37 at root: SAFARI 0x223 0x400000 ...
   mc-us37 is /memory-controller@223,400000
   cpu544: SUNW,UltraSPARC-III+ (upaid 544 impl 0x15 ver 0x22 clock 900 MHz)
   cpu512: SUNW,UltraSPARC-III+ (upaid 512 impl 0x15 ver 0x22 clock 900 MHz)
   cpu513: SUNW,UltraSPARC-III+ (upaid 513 impl 0x15 ver 0x22 clock 900 MHz)
   cpu514: SUNW,UltraSPARC-III+ (upaid 514 impl 0x15 ver 0x22 clock 900 MHz)
   cpu515: SUNW,UltraSPARC-III+ (upaid 515 impl 0x15 ver 0x22 clock 900 MHz)
	           <-----  hangs right here
   
The root cause for this issue is that LPOST fails to clean the W$
before/after it cleans (zeros) E$ tags.  This omission not only leaves
the W$ dirty, but puts the processor in an invalid state where the W$
is out of sync with the E$.  Later, when the dirty W$ lines are
evicted, the processor assumes the W$ and E$ are in sync and chooses
not to confirm that the W$ and E$ line tags match before merging the
data.

When OBP is loaded and Solaris begins to boot, the content of the W$ is
known to become merged with OBP and Solaris data.  This is how the
memory corruption occurs.  The invalid processor state is also known to
cause the CPU to simply stop making forward progress in the instruction
stream.

This issue is addressed in SMS 1.2 patch 112829 (or higher) which
contains an updated LPOST flash image.  No patch for SMS 1.1 is planned
at this time.  An upgrade to SMS 1.2 (or higher) is recommended.
  

IMPLEMENTATION: 

         ---
        |   |   MANDATORY (Fully Proactive)
         ---    
         
  
         ---
        | X |   CONTROLLED PROACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        |   |   REACTIVE (As Required)
         ---


CORRECTIVE ACTION:

The following recommendation is provided as a guideline for authorized
Sun Services Field Representatives who may encounter the above
mentioned problem.

1. Apply SMS 1.2 patch 112829 to both System Controllers per the
   patch README file.  Pay attention to the special installation
   instructions.

2. Use the 'flashupdate' command to update the LPOST image on CPU/MCPU
   boards.  Refer to the 'flashupdate' man page for specific command
   syntax.


COMMENTS:  

None.

============================================================================

Implementation Footnote:

i)   In case of MANDATORY FINs, Sun Services will attempt to contact  
     all affected customers to recommend implementation of the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Sun Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Sun Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.central/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.central/

* From there, select the appropriate link to browse the FIN or FCO index.

Internet Access:
----------------
* Access the top level URL of https://spe.sun.com
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@Sun.COM
--------------------------------------------------------------------------


Copyright (c) 1997-2003 Sun Microsystems, Inc.