Document fins/I0777-1

FIN #: I0777-1

SYNOPSIS: In some configurations, after a FC loop disruption the SOC+ HBA
          intermittently takes the FC loop down

DATE: Feb/25/02

KEYWORDS: In some configurations, after a FC loop disruption the SOC+ HBA
          intermittently takes the FC loop down


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)



SYNOPSIS: In some configurations, after a FC loop disruption the SOC+ 
          HBA intermittently takes the FC loop down.


Sun Alert:          No

TOP FIN/FCO REPORT: No 
 
PRODUCT_REFERENCE:  SOC+ Host Bus Adapter  
 
PRODUCT CATEGORY:   Storage / Service


PRODUCTS AFFECTED:

Systems Affected
---------------- 
Mkt_ID    Platform    Model    Description                 Serial Number
------    --------    -----    -----------                 -------------
  -       Anysys        -      System Platform Independent       -

  
X-Options Affected
------------------  
Mkt_ID      Platform   Model   Description                   Serial Number
------      --------   -----   -----------                   -------------
  -         A5000       ALL    StorEdge A5X00                      -
  -         A3500       ALL    StorEdge A3500                      -
  -         T3          ALL    StorEdge T3                         -
X6538A         -         -     X-OPT A3500FC CONTROLLER            -
X2611A         -         -     OPT INT I/O BD FOR EXX00            -
X2612A         -         -     OPT INT I/O BD EXX00 W/FC-AL        -
X2622A         -         -     OPT INT GRAPHICS I/O BD EXX00       -
X6730A         -         -     SBus/FC HA w/GBIC                   -
X6757A         -         -     SBus Dual FC Network Adapter        -


PART NUMBERS AFFECTED: 

Part Number      Description                          Model
-----------      -----------                          -----
375-3048-01      SBus Dual FC Network Adapter           -
540-4026-01      A3500FC FC-AL Array Ctrlr w/Memory     -          
501-4266-08	 SBus I/O Board with SOC+               -
501-4833-01      I/O Board with SOC+                    -
501-4884-07	 Graphics I/O Board with SOC+           -
501-5266-04      FC-AL SBus Card (FC100/S)              -
501-5202-03      FC-AL SBus Card (FC100/S)              -


REFERENCES:

BugId: 4525143 - SOC+ SAN: SOC+ & FC switch or hub - LIP on one port kills 
                 I/O on both ports.
       4479045 - SOC+ & FC-AL hub: storage not reliably seen after reboot 
                 or loop interruption.
  
     
PROBLEM DESCRIPTION:

In some configurations, after a Fibre Channel (FC) loop disruption such
as a system reboot or a plugging/unplugging of the FC cable at the FC
hub, the SOC+ HBA might intermittently take the FC loop down.  Also, it
is possible that a LIP (Loop Initialization Primitive) received on an
idle port of a SOC+ HBA can disrupt I/O activity on the second port of
that HBA.
 
Problem 1 - Bug 4479045

Bug 4479045 Problem Description:
---------------------------------
In some configurations, after a FC loop disruption (e.g.. System reboot
or plug / unplug of FC cable in FC hub etc.) the SOC+ HBA
intermittently takes FC loop down.  When the FC loop connected to a hub
is idle, any loop interruption (e.g. rebooting one host, unplugging or
re-plugging a host or storage into or out of the loop etc.) can cause
the SOC+ chip on one of the HBAs to hang the entire loop.  This will
prevent access to the storage on that loop from the host(s) which are
still running.  The problem can still occur when there is I/O load on
the loop, but often, the driver's error recovery procedure will reset
the affected SOC+ chip causing it to run correctly again.  Sometimes
I/O errors are seen in this case, sometimes they are not.

Bug 4479045 Susceptible Configurations:
----------------------------------------
A Sun FC hub with 2 x host connections using SOC+ HBAs and storage can
be affected.  The type of storage appears to be irrelevant (problem has
been seen with T3x and A3500FC, but not yet with A5x00).  This is not
related to any cluster software so it can occur with Sun Cluster, VCS
or simply a loop with 2 hosts sharing storage using VxVM to manage data
ownership with manual failover (poor man's cluster).


Problem 2 - Bug 4525143

Bug 4525143 Problem Description:
--------------------------------
This problem is seen if there is heavy I/O load on one port of a SOC+
HBA, and a LIP (Loop Initialization Primitive) occurs on the other port
of the same SOC+ HBA while that second SOC+ port is idle.  The  LIP
could be caused by plugging or unplugging a fibre cable, rebooting a
second host attached to the same FC switch or hub, or a marginal FC
loop component (cable, GBIC).  The SOC+ port on the loop which receives
the LIP does not initialize correctly, resulting in that port going
offline.  This means that it loses contact with the storage on that
port.  The I/Os via the other SOC+ port to the other loop sometimes
continue successfully, but sometimes they fail. The affected SOC+ port
will recover if all I/Os are stopped on the other SOC+ port on that
HBA.

Bug 4525143 Susceptible Configurations:
---------------------------------------
A SOC+ HBA where both ports are used.

There are currently several workarounds available for avoidance of
Bugs 4479045 and 4525143.  See the Corrective Action Section below. 
In addition, Sun Engineering is developing a new HBA, when used in
conjunction with Solaris 8 (or higher) and SunCluster 3.0, will
prevent the problems seen with Bug 4479045.
  

IMPLEMENTATION: 

         ---
        |   |   MANDATORY (Fully Proactive)
         ---    
         
  
         ---
        |   |   CONTROLLED PROACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        | X |   REACTIVE (As Required)
         ---



CORRECTIVE ACTION:

The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives who may encounter the above
mentioned problem.

Problem 1 - Bug 4479045 Workarounds:
------------------------------------

NOTE: There are five workarounds for bug# 4479045.  Workarounds 1, 2,
      and 3 are proactive workarounds and may be implemented to prevent 
      or reduce the frequency of future incidences.  Workarounds 4 and 5 
      are reactive, and may be utilized to recover a loop in a hang 
      condition.

Proactive
---------
1. Use Sun FC switches instead of FC hubs.  However, if both ports on
   one SOC+ HBA are connected to (different) Sun FC switches, then the
   configuration is exposed to bug 4525143 (see below).

   NOTE: This requires an escalation to CPRE for technical review and a 
         CIC.

Proactive
---------
2. [Only helps with T3s]

Use this device order in the FC hub: T3/host/T3/host  
Do not use T3/T3/host/host  or host/host/T3/T3. 

This partial workaround relies on a side effect of the T3's error
recovery whereby, in the current QLogic 2100 firmware used in our T3
firmware, the 2100 chip is reset during error recovery.  If the
resulting brief loss-of-sync is 'seen' by a SOC+ HBA which has entered
the faulty state, then this causes the SOC+ HBA to recover.  See bug
4430163 for more details of this.  The use of this hub ordering is not
expected to help with any other type of storage on the loop, unless its
error recovery behavior causes a similar loss-of-sync.

Since it depends on which SOC+ HBA has entered the faulty state, this
change in hub order is not completely effective in preventing the
problem from occurring, but has been seen to reduce the incidence from
1 in 5 host reboots to 1 in 20 host reboots on a customer site.
Furthermore, there is no certainty that this T3 behavior will always
occur in the future since it is not required by the FC-AL spec.

For example:
Host/Host/T3/T3:        1 failure out of 5 tries (80% reliable)
Host/T3/Host/T3         1 failure out of 20 tries  (95% reliable)

NOTE: If this is not successful, open an escalation with CPRE for 
      technical review.

NOTE: If the configuration has 2 x hosts and 2 x T3s, consider this
      workaround.

NOTE: If there are more than 2 T3s, this workaround does not help.


Proactive
---------
3.  [Only helps with T3s]
    Use only 1 x T3 per FC hub.  This relies on the same T3 error recovery
    behavior as described in (2) above.  It "appears" to be completely
    effective in preventing this issue from being seen.

NOTE: If this is not successful, open an escalation with CPRE for 
      technical review.


Reactive
--------
4. Use a 'luxadm -e forcelip' to send a LIP to the affected loop, from
   all hosts on that loop.  This needs to be done from all hosts, since
   it *must* be done from the SOC+ HBA which has been affected, but
   it is not always possible to identify which HBA that is, without
   access to a FC analyzer.  Therefore, do it on all HBAs on the
   affected loop.  In short, this is a partial recovery method, but it
   does work sometimes.

NOTE: If one host sees the devices, the loop is functional and
      devices will be seen on the other host as well.  This means there
      is no need to issue 'luxadm -e forcelip' from the other host.
      So, after issuing 'luxadm -e forcelip', run the format command to
      see the devices.  If devices are seen, stop issuing 'forcelip',
      because loop has recovered.

   
NOTE: If this is not successful, open an escalation with CPRE for 
      technical review.


Reactive
--------
5. One can achieve the same result by unplugging and replugging the
   fibre going to any GBIC on the hub connected to the hung loop.
   Repeat that process until the loop stabilizes.  Depending on which
   fibre goes to which device on the loop, you might have to unplug and
   replug more than 1 fibre to get the loop to recover.  There is also a
   risk that the customer will unplug things from the wrong loop if they
   have several in close proximity.

NOTE: Does not work for remote access.

NOTE: If this is not successful, open an escalation with CPRE for 
      technical review.


Problem 2 - Bug 4525143 Workaround
----------------------------------

In the case of Bug 4525143, please use the following workaround:

  . Only use 1 port on a SOC+ HBA.  

  . Do not connect anything to the second port on a SOC+ card.

    NOTE: If this is not successful, open an escalation with CPRE for 
          technical review.
                                    

COMMENTS:  

None

============================================================================

Implementation Footnote:

i)   In case of MANDATORY FINs, Enterprise Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission
critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as
the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO
index.
 
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@sdpsweb.EBay
--------------------------------------------------------------------------