Document fins/I0788-1


FIN #: I0788-1

SYNOPSIS: Running POST on one domain may Dstop all other running domains on Sun
          Fire 15K systems with SMS 1.1

DATE: Mar/08/02

KEYWORDS: Running POST on one domain may Dstop all other running domains on Sun
          Fire 15K systems with SMS 1.1


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)



SYNOPSIS: Running POST on one domain may Dstop all other running 
          domains on Sun Fire 15K systems with SMS 1.1.
              

Sun Alert:          Yes

TOP FIN/FCO REPORT: Yes
 
PRODUCT_REFERENCE:  SMS 1.1 on Sun Fire 15K  
 
PRODUCT CATEGORY:   Server / SW Admin
 

PRODUCTS AFFECTED:

Systems Affected
----------------    
Mkt_ID   Platform   Model   Description       Serial Number
------   --------   -----   -----------       -------------
  -        F15K      ALL    Sun Fire 15000          -
  

X-Options Affected
------------------

Mkt_ID   Platform   Model   Description       Serial Number
------   --------   -----   -----------       -------------
  -         -         -          -                  -


PART NUMBERS AFFECTED:

Part Number   Description   Model
-----------   -----------   -----
     -             -          -

REFERENCES:

BugId:    4505473 - AMX data ECC uncorrectable error.

PatchId:  112080 - SMS 1.1: Patch IBIST for pause wafer change.

ESC:      534366 - All domain down due to DSTOP.

SunAlert: 42881


PROBLEM DESCRIPTION: 

A software error on one domain (such as a heartbeat failure, panic
timeout, or error-reset) can cause another domain to DStop on Sun
Fire 15K systems running SMS 1.1.  The manifestation of this issue may 
cause the POST running on one domain to Dstop all other running domains.
While the occurrence is rare, the impact is platform wide.  Depending
upon domain configuration and applications, down time can be several
hours.  This problem is intermittent and may be related to a
domain sync operation on the centerplane (reset of unused ports).

Running POST on one domain means that the power-on self tests are
executed on any domain in the system.  This is done to initially bring a
domain online, a DR attach of a board (not currently supported), or a
recovery action performed by the SMS software to get a domain back up
and running after a reboot, panic, or Dstop.

A message in the platform message log (/var/opt/SUNWSMS/adm/platform/messages)
would report: 

    Jan 17 20:25:55 2002 swmtft901 hwad[22514]: [1156 1693005732870614 ERR 
    InterruptHandler.cc 2127] Domain Stop interrupt detected, domain XXX 
               
SMS then creates a Dstop dump file in /var/opt/SUNWSMS/adm/[XXX]/dump.
The file name is dsmd.dstop.YYMMDD.hhmm.ss (for this example).  If this
dump file is opened with "redx" and the "wfail" command is
issued, the
output below is reported.  For example: 

        sc% redx -cl
        redx> dumpf load dsmd.dstop.020117.2025.55)
        redx> wfail
        ...ouptut below...            

The Dstop signature of this issue is as follows: 

        SDI EX03/S0  Master_Stop_Status0[31:0] = 7004004F
              MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
        SDI EX03/S0  Dstop0[31:0] = 12018200
              Dstop0[16]: D    DARB texp requests all Dstop (M)   
              Dstop0[25]: D 1E AXQ requests all Dstop (M)
              Dstop0[28]: D    Slot0 asserted Error, enabled to cause Dstop (M)
        AXQ EX03 ( 3) Error_Flag_02[31:0] = 04008400  Mask = 0000FFFF
              Err2[26]: D 1E AMX 0-3 hs flow control didn't arrive 
simultaneously  
        FAIL EXB EX3:  Dstop/Rstop detected by AXQ.
        Primary service FRU is EXB EX3.
        SDI EX04/S0  Master_Stop_Status0[31:0] = 0004000F
              MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
        SDI EX04/S0  Dstop0[31:0] = 02018200
              Dstop0[16]: D    DARB texp requests all Dstop (M)
              Dstop0[25]: D 1E AXQ requests all Dstop (M)
        AXQ EX04 ( 4) Error_Flag_03[31:0] = 30009000  Mask = 21005EFF
              Err3[28]: D 1E AMX data ECC uncorrectable error            
              Err3[29]: R    AMX data ECC correctable error      
        FAIL EXB EX4:  Dstop/Rstop detected by AXQ.
        Primary service FRU is EXB EX4.
        
The AMX flow control error shown above is the key message.  The system
will recover automatically via ASR (automatic system recovery).  After
recording the Dstop information, SMS restarts the domain(s).        
        
Any SMS 1.1 installations without patch 112080 or later installed
are susceptible to this problem.  SMS 1.2 and higher are not affected 
by this issue.

The true cause of the problem is the AMX ASIC which doesn't handle
port resets correctly.  The bug fix changes how POST performs the
reset to ensure it's done safely.                   

A Dstop, or Domain Stop, occurs when the hardware detects an
unrecoverable error.  The ASICs in the system cease processing
transactions as quickly as possible to prevent further corruption of
data and facilitate debugging.  It also occurs during the centerplane
reset of ports.  The AMX has a problem with the reset of ports not done
under domain sync.  Changing the reset so that it is done under domain
sync causes the problem to go away.  

 
IMPLEMENTATION: 
 
         ---
        |   |   MANDATORY (Fully Pro-Active)
         ---    
         
  
         ---
        | X |   CONTROLLED PRO-ACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        |   |   REACTIVE (As Required)
         ---
         

CORRECTIVE ACTION:

An Authorized Enterprise Services Field Representative may avoid the
above mentioned problems by following the recommendations as shown 
below.

For a permanent fix, please install SMS 1.1 Patch 112080 or later,
or upgrade to SMS1.2.  This patch is specifically for SMS 1.1, and is 
not tied to any one particular Solaris OS release.

COMMENTS:

None 

=========================================================================

Implementation Footnote:

i)   In case of MANDATORY FINs, Enterprise Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the   
     need arises.
-------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO index.

Supporting Documents:
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@Sun.COM
--------------------------------------------------------------------------


Copyright (c) 1997-2003 Sun Microsystems, Inc.