Document fins/I0822-1


FIN #: I0822-1

SYNOPSIS: Sun Fire 15K domains containing Expanders with hardware dash revision
          16 or less may dstop due to bug 4505200

DATE: Jun/05/02

KEYWORDS: Sun Fire 15K domains containing Expanders with hardware dash revision
          16 or less may dstop due to bug 4505200


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)



SYNOPSIS: Sun Fire 15K domains containing Expanders with hardware
          dash revision 16 or less may dstop due to bug 4505200.
      

Sun Alert:          No

TOP FIN/FCO REPORT: Yes 
 
PRODUCT_REFERENCE:  Sun Fire 15K
 
PRODUCT CATEGORY:   Server / Service


PRODUCTS AFFECTED:
  
Systems affected:
-----------------  
Mkt_ID   Platform   Model   Description      Serial Number
------   --------   -----   -----------      -------------
  -        F15K      ALL    Sun Fire 15K           -


X-Options affected:
-------------------
Mkt_ID   Platform   Model   Description                   Serial Number
------   --------   -----   -----------                   -------------
X6727A	    -	      -	    PCI Dual FC Network Adapter+        - 
X6799A      -         -     PCI Single FC Host Adapter          -
X6767A      -         -     PCI 2G Single FC Host Adapter       -
X6768A      -         -     PCI 2G Dual FC Network Adapter+     -
X1074A      -         -     PCI Cluster SCI Adapter             -
  

PART NUMBERS AFFECTED: 

Part Number            Description        Model
-----------            -----------        -----
501-5179-16 or lower   Expander Board       -


REFERENCES:

BugId: 4505200 - Limit PIOs in Schizo nexus on Starcat.

FCO:   A0192-1 
       A0193-1

ESC:   536433 - Customer gets DStops during benchmark testing with very 
                high IO load .
       536089 - Two domains had `Command reissue timeouts` within 12 
                hours on same platform.

     
PROBLEM DESCRIPTION:

Sun has identified a bug in revision 6.0 of the AXQ ASIC located on
the Expander board.  All Sun Fire 15K systems shipped prior to April 1,
2002 can experience a dstop failure.

All Sun Fire 15K servers shipped since April 1, 2002 are configured
with the new 6.1 AXQ ASIC and are not impacted by this bug.  This bug
has no impact on any Sun Fire 12K system.

The configurations that are more susceptible to the bug are larger I/O
configurations with streaming I/O mode turned on.  The bug is also more
readily hit when stressing I/O paths.  The bug has been seen in
customer configurations using:

     X6727A - PCI Dual FC Network Adapter+   
     X6799A - PCI Single FC Host Adapter    
     X6767A - PCI 2G Single FC Host Adapter
     X6768A - PCI 2G Dual FC Network Adapter+
     X1074A - PCI Cluster SCI Adapter

NOTE: The bug is not limited to configurations using only these
      adapters.

Expanders with a hardware dash revision less than 17 are subject to the
bug.  The 'xcrev-all' tool available on the CPRE website can assist the
field in determining the dash revision of Expanders present in the
system.  It is available at the following URL:

   http://cpre-amer.west/esg/hsg/starcat/tools/xcrev.html
    
An example run showing eight problematic Expanders is included below.
The ECO dash revision, if present, should be used to determine down
revision expanders.
  
If no ECO dash revision is present then the manufacturer dash level 
should be used.

xc46-sc0:sms-svc:12> ./xcrev-all ex

                         Manuf       ECO
   Location  Part Num  Dash  Rev  Dash  Rev  Serial No.
   --------  --------  ---------  ---------  ----------
   EX0       501-5179   -11  -01   -16  -51  447984
   EX1       501-5179   -11  -01   -16  -51  448950
   EX2       501-5179   -11  -01   -16  -51  447855
   EX3       501-5179   -11  -01   -16  -51  448945   
   EX4       501-5179   -11  -01   -16  -51  448946
   EX5       501-5179   -09  -04   --   --   398961
   EX6       501-5179   -09  -04   --   --   396041
   EX7       501-5179   -09  -04   --   --   394338
   EX8       Not present/access error.
   EX9       Not present/access error.
   EX10      Not present/access error.
   EX11      Not present/access error. 
   EX12      Not present/access error.
   EX13      Not present/access error.
   EX14      Not present/access error.
   EX15      Not present/access error.
   EX16      Not present/access error.
   EX17      Not present/access error.

NOTE: This example is from a lab machine with only 8 Expanders present.
All F15K customer systems include 18 Expanders.

The xcrev-all tool uses the 'prtfru' command to generate its output.
'prtfru' may also be directly used to verify presence of problematic
Expanders.  An example 'prtfru' run is included below.  This example shows 
Expander 0 with a hardware dash revision of "16".

   xc46-sc0:sms-svc:5> prtfru
"/frutree/chassis/CP/ex0?Label=ex0/EXB" | 
                       egrep "ManR|ECO_CurrentR"
   /ECO_CurrentR
   /ECO_CurrentR/UNIX_Timestamp32: Sat Nov  3 16:56:06 PST 2001
   /ECO_CurrentR/Firmware_Revision: 
   /ECO_CurrentR/Hardware_Revision: 51
   /ECO_CurrentR/HW_Dash_Level: 16
   /ManR
   /ManR/UNIX_Timestamp32: Mon Aug 13 14:18:30 PDT 2001
   /ManR/Fru_Description: ASSY,ECB,ELEC,SYS EXP,STARCAT
   /ManR/Manufacture_Loc: ENDICOTT NY USA
   /ManR/Sun_Part_No: 5015179
   /ManR/Sun_Serial_No: 447984
   /ManR/Vendor_Name: IBM
   /ManR/Initial_HW_Dash_Level: 11
   /ManR/Initial_HW_Rev_Level: 01
   /ManR/Fru_Shortname: EXB

Customers who hit bug 4505200 will experience a domain stop (i.e.,
dstop).  This can be identified by the following example error messages
logged to the domain message log (/var/opt/SUNWSMS/adm/[A-R]/messages).  
Customers will also note that the domain has crashed.

=======================================================================
Mar 27 14:57:36 2002 sc0 dsmd[1246]-B(): [2516 55278088712478 ERR 
EventHandler.cc 126] Domain stop has been detected in domain B
Mar 27 14:57:36 2002 sc0 dsmd[1246]-B(): [2525 55278436694370 NOTICE 
SysControl.cc 2149] Taking hardware configuration dump. Dump file: 
-D/var/opt/SUNWSMS/SMS1.2/adm/B/dump/dsmd.dstop.020327.1457.36

NOTE: redx should be used ONLY by Sun personnel and not by customers.

Sun personnel can use the SMS redx command to look at the resulting
dstop file to match known failure signatures.  Example wfail output is
provided for signatures matched at existing customer sites.  The examples
show both a SMS 1.1 dump file and a SMS 1.2 dump file.  One primary
condition, a reissue timeout, is treated differently in SMS 1.2
as compared to SMS 1.1.

The primary condition may also be manifested as "SDI Command pool
timeouts"
and "AXQ WATRANSID dealocation timeouts".

First verify existence of AXQ revisions that exhibit the bug.  The redx
'shaxq' command will display the AXQ revision.  Below is example output
showing an AXQ revision 6.0.  Any AXQ with revision less than 6.1 can
hit bug 4505200.

   redx> shaxq 0
   AXQ  EX0 (0)   Component ID = C4312049   Rev 6.0  <======AXQ revision
        ExpID[4:0] = 00
        Config0[31:0] = 1B2808F9
        Config1[31:0] = 00241BC0
        Timeout_Conf 1[19:0] = 7BDEF  0[31:0] = 1E07BE0F
        Sec_Config[22:0] = 000000
        Csr0_status[4:0] = 00
        ID_Mask[31:0] = 00000000  Home_Mask[31:0] = 00000000
        Flow_Ctl_Config[28:0] = 00CF0888
        Config6[31:0] = 00000000
        Config4[31:0] = 00400000
        Slot0_Domain_Mask[17:0]: Slot1 = 00000  Slot0 = 00000 Where Slot SB0
        Slot0_DomInt_Mask[17:0]: Slot1 = 00000  Slot0 = 00000     can send.
        Slot1_Domain_Mask[17:0]: Slot1 = 00001  Slot0 = 00002 Where Slot IO0
        Slot1_DomInt_Mask[17:0]: Slot1 = 00000  Slot0 = 00002     can send.
              Error_Flag_00[31:0] = 00000000  Mask = FFFFFFFF
              Error_Flag_01[31:0] = 00000000  Mask = 4000FFFF
              Error_Flag_02[31:0] = 00000000  Mask = 0000FFFF
              Error_Flag_03[31:0] = 00000000  Mask = 65005EFF
              Error_Flag_04[31:0] = 00000000  Mask = 01FEFFFF
              Error_Flag_05[31:0] = 00000000  Mask = 1024FFFF
              Error_Flag_06[31:0] = 00000000  Mask = 7E3DFFFF
              Error_Flag_07[31:0] = 00000000  Mask = 63FFFFFF
              Error_Flag_08[31:0] = 00000000  Mask = FFFFFFFF
              Error_Flag_09[31:0] = 00000000  Mask = 7E00FFFF
              Error_Flag_10[31:0] = 00000000  Mask = FFFFFFFF
              Error_Flag_11[31:0] = 00000000  Mask = 7FF0FFFF

The first evidence of the bug is an AXQ "Timeout on command reissue
transaction" error.  The field should look for its occurrence in redx
'wfail' output.  Under SMS 1.2 this is a Dstop condition and is noted
by the 'D' in the following example error line:

   AXQ EX10 (10) Error_Flag_01[31:0] = 00048004  Mask = 4000FFFF
        Err1[18]: D 1E Timeout on command reissue transaction to Slot1
                  ^                  
Under SMS 1.1 the error is an Rstop condition as shown by the 'R' in this 
example:

   AXQ EX11 (11) Error_Flag_01[31:0] = 00048004  Mask = 40047FFB
        Err1[18]: R 1E Timeout on command reissue transaction to Slot1
                  ^

Example wfail output for SMS 1.2 showing bug 4505200 failure signature:
=======================================================================

redxl> dumpf load dsmd.dstop.020327.1457.36
Created Wed Mar 27 14:57:38 2002
By hpost v. 1.2 Generic 112488 Feb 15 2002 13:40:50  executing as pid=23247
On ssc name =  sc0.
Domain =  1=B    Platform = domain
Boards in dump: master SC    CPs/CSBs[1:0]: 3
          EXB[17:0]: 07E00
        Slot0[17:0]: 07E00
        Slot1[17:0]: 00E00
-D option, -d
"DSMD DomainStop Dump"
0 errors occurred while creating this dump.
redxl> wfail
SDI EX09/S0: All SDI is DStopped and RStopped,         requested by DARB.
SDI EX10/S0  Master_Stop_Status0[31:0] = 9004004F
        MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
SDI EX10/S0  Dstop0[31:0] = 12018200
        Dstop0[16]: D    DARB texp requests all Dstop (M)
        Dstop0[25]: D 1E AXQ requests all Dstop (M)
        Dstop0[28]: D    Slot0 asserted Error, enabled to cause Dstop (M)
AXQ EX10 (10) Error_Flag_00[31:0] = 00048004  Mask = 0000FFFF
        Err0[18]: D 1E Timeout on command reissue transaction to Slot0
AXQ EX10 (10) Error_Flag_01[31:0] = 00048004  Mask = 4000FFFF
        Err1[18]: D 1E Timeout on command reissue transaction to Slot1
FAIL Slot SB10:  Dstop/Rstop detected by AXQ.
The FRU for this failure cannot be identified from the available information.
        This error is not diagnosable. The FAIL action is just a guess to
        satisfy the POST design requirement that something must be
        deconfigured after a stop to guarantee that the process terminates.
        The FAILed component is no more suspect than any other hardware
        in the domain.
SDI EX11/S0: All SDI is DStopped and RStopped,         requested by DARB.
SDI EX12/S0: All SDI is DStopped and RStopped,         requested by DARB.
SDI EX13/S0: All SDI is DStopped and RStopped,         requested by DARB.
SDI EX14/S0: All SDI is DStopped and RStopped,         requested by DARB.
DARB C0: enabled ports (expanders)          [17:0]: 07E3F
DARB C0: other darb req Dstop+Rstop for exps[17:0]: 00400
DARB C1: enabled ports (expanders)          [17:0]: 07E3F
DARB C1: other darb req Dstop+Rstop for exps[17:0]: 00400
redxl> 


Example wfail output for SMS 1.1 showing bug 4505200 failure signature:
=======================================================================

redxl> dumpf load dsmd.dstop.020325.1220.39
Created Mon Mar 25 12:20:39 2002
By hpost v. 1.1 Generic 112080 Feb  5 2002 11:16:14  executing as pid=19231
On ssc name =  sc0.
Domain =  1=B    Platform = domain
Boards in dump: master SC    CPs/CSBs[1:0]: 3
          EXB[17:0]: 07E00
        Slot0[17:0]: 07E00
        Slot1[17:0]: 00E00
-D option, -d
"DSMD DomainStop Dump"
0 errors occurred while creating this dump.
redxl> wfail
SDI EX09/S0: All SDI is DStopped and RStopped,         requested by DARB.
SDI EX09/S0  Core_Error0[31:0]  = 02008200  Mask = 0051FFFF
        CoreErr0[25]: D 1E Command pool timeout, non-split exp (M)
            valid_{slot_wr[1:0],read}_TO = 1 (rev 4+)
            {cmd_pool_loc[5:0],cmd4io,retired,half_used} = 020
SDI EX10/S0: All SDI is DStopped and RStopped,         requested by DARB.
SDI EX11/S0  Master_Stop_Status0[31:0] = 100400CF
        MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
SDI EX11/S0  Dstop0[31:0] = 35018400
        Dstop0[16]: D    DARB texp requests all Dstop (M)
        Dstop0[24]: D    SDI internal Slot1 port requested Dstop
        Dstop0[26]: D 1E AXQ requests Slot0 Dstop (M)
        Dstop0[28]: D    Slot0 asserted Error, enabled to cause Dstop (M)
        Dstop0[29]: D    Slot1 asserted Error, enabled to cause Dstop (M)
SDI EX11/S0  Recordstop0[31:0]  = 00818080
        Rstop0[16]: R    DARB texp request Recordstop (M)
        Rstop0[23]: R 1E AXQ requests all Recordstop (M)
SDI EX11/S0  Slot1_Error2[31:0] = 00028002  Mask = 7FFCFFFF
        S1Err2[17]: D 1E Slot1 command pool timeout (M)
            valid_{slot_wr[1:0],read}_TO = 4 (rev 4+)
            {cmd_pool_loc[5:0],cmd4io,retired,half_used} = 054
AXQ EX11 (11) Error_Flag_00[31:0] = 40048004  Mask = 00047FFB
        Err0[18]: R 1E Timeout on command reissue transaction to Slot0
        Err0[30]: D    Home lock timeout for Slot0
AXQ EX11 (11) Error_Flag_01[31:0] = 00048004  Mask = 40047FFB
        Err1[18]: R 1E Timeout on command reissue transaction to Slot1
FAIL Slot SB11:  Dstop/Rstop detected by AXQ.
The FRU for this failure cannot be identified from the available information.
        This error is not diagnosable. The FAIL action is just a guess to
        satisfy the POST design requirement that something must be
        deconfigured after a stop to guarantee that the process terminates.
        The FAILed component is no more suspect than any other hardware
        in the domain.
SDI EX12/S0: Slot 0 port is DStopped, SDI is RStopped, requested by DARB.
SDI EX13/S0: Slot 0 port is DStopped, SDI is RStopped, requested by DARB.
SDI EX14/S0: Slot 0 port is DStopped, SDI is RStopped, requested by DARB.
DARB C0: enabled ports (expanders)          [17:0]: 07E3F
DARB C0: exps request Dstop+Rstop           [17:0]: 00800
DARB C0: other darb req Dstop+Rstop for exps[17:0]: 00800
DARB C1: enabled ports (expanders)          [17:0]: 07E3F
DARB C1: exps request Dstop+Rstop           [17:0]: 00800
DARB C1: other darb req Dstop+Rstop for exps[17:0]: 00800
redxl> 

Next, the bug is characterized by one or both of the following Safari
timeout errors on an ioc present in the domain:

       Safari_Err_Int_Enbl[63:0] = 80000000 00000017
        ErrLog[ 7]: ErrOut  Timeout on head of SFP queue
        ErrLog[ 9]: ErrOut  Timeout on head of CI queue
        ErrLog[63]:         Error Out asserted (S_ERROR_L pin)
        
A 'shioc' example with only the "Timeout on head of SFP queue" error is

shown below:

redxl> shioc 9 1 1

Note: Data is displayed from the currently loaded dump file.
PCI IOC IO09/P1   Component ID = 1824C06D    TO_2.2
       Safari_Control[57:0] = 1555555 13D00007
           1   SSM_Mode               SafCtl[0]      
           1   HierBusMode            SafCtl[1]      
           1   SlowSnoop              SafCtl[2]      
        0x1D   AgentId[4:0]           SafCtl[24:20]  
        0x09   ExpanderId[4:0]        SafCtl[29:25]  
           1   SafTimeout[1:0]        SafCtl[33:32]  2**28 cycles (norm)
    0x555555   DTL_Mode[23:0]         SafCtl[57:34]  
       Safari_Bus_Enbl[3:0] = 0000
           0   SafariDbgEnbl          BusEn[0]       
           0   UPA_DbgEnbl            BusEn[1]       
           0   SoftReset_l            BusEn[2]       
           0   Soft_PwrOK             BusEn[3]       
       Safari_Pause = 0
       PCI_Arb_Halt = 0
       Safari_Err_Log[63:0]      = 80000000 00000080
       Safari_Err_Enbl[63:0]     = FC000000 000003E0
       Safari_Err_Int_Enbl[63:0] = 80000000 00000017
        ErrLog[ 7]: ErrOut  Timeout on head of SFP queue
        ErrLog[63]:         Error Out asserted (S_ERROR_L pin)

redxl> 

It will be necessary to step through and display 'shioc' output for
each ioc present in the domain.  There are two iocs present on each
hsPCI board and it will require running two 'shioc' commands for each
board.  For the hsPCI in Expander 9, for example, you would run 'shioc
9 1 0' and 'shioc 9 1 1'.  All iocs should be displayed until either a
SFP or CI queue timeout error is found.

If the above two conditions: 

  1.) AXQ command reissue timeout and 

  2.) ioc SFP/CI queue timeout are found then the failure signature has 
      been matched.

Definitive diagnosis of Bug 4505200 requires special postrc
directives.  If the above failure signature is detected, the field
should escalate through the regular CPRE escalation path for further
review.
   
Engineering has released revision 6.1 of the AXQ ASIC which corrects
bug 4505200.  Expanders with hardware dash revision 17, or greater,
include an AXQ ASIC that fixes the bug.

The probability of a customer experiencing this bug is very low.  Sun
has created a fix for this bug and is in the process of implementing it
in the field at no cost to the customer.

Mandatory FCO A0192-1 will be released on or before 05/15/02.  This will
require all down revision Expanders (part 501-5179) to be replaced.  Prior
to the release of the FCO, customers who match the failure signature of
bug 4505200 should have their case escalated to CPRE.  CPRE will verify
the diagnosis of bug 4505200 and will submit the confirmation to GEO VP's
for expedited approval of parts.

Please note that Sun recommends that FCO A0193-1 (Schizo 2.2 ASIC based
hsPCI boards) and FCO A0192-1 must be implemented at the same time to
minimize customer disruption.


IMPLEMENTATION: 

         ---
        |   |   MANDATORY (Fully Proactive)
         ---    
         
  
         ---
        | X |   CONTROLLED PROACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        |   |   REACTIVE (As Required)
         ---


CORRECTIVE ACTION:

The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives who may encounter the above
mentioned problem.

If the above failure is detected, please escalate to HES CPRE and
provide the following data/information:

   1. Explorer output from the System Controller.
   
   2. WHAT IS THE SYSTEM CONFIGURATION (SOFTWARE PACKAGES, I/O TYPES &
      NUMBER, ETC)
  
   3. What is the current urgency of your particular customer?

   4. How often have they experienced dstops since initially reported?

   5. Is the customer willing to replace hardware? 

      For procuring replacement parts, the following is needed:
 
      A. Shipping address and contact information for customer location.

      B. Quantity of EXPANDER boards (501-5179-17) required at the 
         location.

      C. A requested ship date for the location.
  

Relief from bug 4505200 is available through two workarounds that may
be used until replacement Expanders are available.

1. Reconfiguring Domain Hardware

   The first workaround requires reconfiguring the domain board
   layout.  The domain must be reconfigured such that system boards and
   I/O boards are on distinct expanders.  No expanders may contain
   system boards and I/O boards for the same domain.  This will require
   split slot configuration for multi-domain systems or physically
   splitting the expander for single smaller domains.  Example valid
   and invalid configurations are below.

       Valid Configuration:
           Domain A - SB0, IO1, SB2, IO3
           Domain B - IO0, SB1, SB3

       Invalid Configuration:
           Domain A - SB0, SB1, IO1       (NOTE: Expander 1 is shared)
           Domain B - IO0, SB2, SB3, IO3  (NOTE: Expander 3 is shared)


2. Disabling Streaming I/O

   Streaming I/O can be disabled by placing the following two lines in the
   domain's /etc/system file and rebooting.

	   set pcisch:pci_stream_buf_enable=0
	   set pcisch:pci_stream_buf_exists=0
	
It should be noted that customers may see degraded I/O performance
after applying the workaround. Performance degradation will vary
depending on application.  For example, transactional processing
environments will likely experience minimal degradation while database
processing environments will likely experience extreme degradation; and
therefore, in the latter case, this workaround is not recommended.


COMMENTS:  

None

============================================================================

Implementation Footnote:

i)   In case of MANDATORY FINs, Enterprise Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO index.

Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@sdpsweb.EBay
--------------------------------------------------------------------------


Copyright (c) 1997-2003 Sun Microsystems, Inc.