Document fins/I0552-1


FIN #: I0552-1

SYNOPSIS: SCSI devices (especially in a multi-hosted configuration) may go
          offline after isp errors.) 

DATE: Jan/24/00

KEYWORDS: SCSI devices (especially in a multi-hosted configuration) may go
          offline after isp errors.) 


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)



SYNOPSIS: SCSI devices (especially in a multi-hosted configuration) may
          go offline after isp errors.	  

	               
TOP FIN/FCO REPORT: Yes 
 
PRODUCT_REFERENCE:  isp driver bug
 
PRODUCT CATEGORY:   Storage / Sw Admin;   

PRODUCTS AFFECTED:  
  
Mkt_ID   Platform   Model   Description   Serial Number
------   --------   -----   -----------   -------------
Systems Affected
----------------

  -       A11        ALL    Ultra Enterprise 1              -
  -       A12        ALL    Ultra Enterprise 1E             -
  -       A14        ALL    Ultra Enterprise 2              -
  -       E3000      ALL    Ultra Enterprise 3000           -
  -       E3500      ALL    Ultra Enterprise 3500           -
  -       E4000      ALL    Ultra Enterprise 4000           -
  -       E4500      ALL    Ultra Enterprise 4500           -
  -       E5000      ALL    Ultra Enterprise 5000           -
  -       E5500      ALL    Ultra Enterprise 5500           -
  -       E6000      ALL    Ultra Enterprise 6000           -
  -       E6500      ALL    Ultra Enterprise 6500           -
  -       E10000     ALL    Ultra Enterprise 10000          -

(See Corrective Action)

X-Options Affected
------------------

  -       -          ALL    StorEdge UniPack                -
  -       -          ALL    StorEdge MultiPack              -
  -       -          ALL    StorEdge MultiPack2             -
  -       -          ALL    StorEdge A1000                  -
  -       -          ALL    Netra st A1000                  -
  -       -          ALL    StorEdge D1000                  -
  -       -          ALL    Netra st D1000                  -
  -       -          ALL    StorEdge A3500                  -
  -       -          ALL    StorEdge L280 tape library      -
  -       -          ALL    StorEdge L700 tape library      -
  -       -          ALL    StorEdge L1000 tape library     -
  -       -          ALL    StorEdge L1800 tape library     -
  -       -          ALL    StorEdge L3500 tape library     -
  -       -          ALL    StorEdge L11000 tape library    -




PART NUMBERS AFFECTED:

Part Number   Description                              Model
-----------   -----------                              -----
370-2443-0X   Differential Ultra/Wide SCSI (UDWIS/S)     -
370-1704-0X   Differential Fast/Wide SCSI (DWIS/S)       -
370-1703-0X   Single-Ended Fast/Wide SCSI (SWIS/S)       - 


REFERENCES:

BugId:    4280783
Esc:      523110 522262
FIN:      I0547-1
PatchId:  105600-XX (Solaris 2.6)
PatchId:  106924-XX (Solaris 2.7)


PROBLEM DESCRIPTION: 

When a SCSI bus reset is issued under heavy i/o, the isp driver causes the 
sd driver to report i/o errors.  Any configuration with an isp driver version
prior to the fix to bug 4280783 may be affected.

The most likely configuration to experience this problem is multi-hosted and
shared storage devices e.g. A3x00 (SCSI version only), A1000, D1000 etc.
connected to differential SCSI cards using the isp driver, or MultiPack,
UniPack etc. connected to single-ended SCSI cards using the isp driver.

However, since a SCSI bus reset can occur as part of the error recovery
controlled by the sd driver, this problem can occur under error conditions
even with SCSI devices connected just to a single HBA which uses the isp
driver.

Third party storage products attached to controllers which use the isp driver
may also be affected.

Running cluster software does not prevent the problem from occurring.


Example 1
---------

Here is an example of the sequence of events in a multi-hosted
configuration, with shared SCSI storage devices.  In this
configuration, when one node is rebooted, it issues SCSI bus resets
during the process of restarting and the other (running) node would
receive those resets.  This is normal.

The following message is registered on the running node, when the other
node reboots.  There should be one such message per shared SCSI bus.

	
	unix:  Received unexpected SCSI Reset

The i/os are returned with reset flag set as indicated by sd driver.
		
	unix: WARNING: /sbus@a,0/QLGC,isp@2,10000/sd@4,5 (sd229):
	unix: SCSI transport failed: reason 'reset': retrying command

This is also normal.
	
As a result of the SCSI bus reset, commands are transferred from the
request queue to the response queue with the reset flag set. They are
not sent to the device until a marker is sent from isp driver to the
firmware on the HBA. This is a memory to memory transfer and hence too
fast.  (Refer to bug# 4283089 for isp chip function during reset
handling.)
	
The sd driver would retry the i/o (sd_retry_count number of times).
This can be seen by the following message.

	unix: WARNING: /sbus@a,0/QLGC,isp@2,10000/sd@4,5 (sd229):
	unix: SCSI transport failed: reason 'timeout': retrying command

Due to this bug, under heavy i/o load, it can happen that before the marker
is accepted by the firmware on the HBA, all of the sd driver retries fail.
The sd driver then fails the i/o ("giving up") and returns an error to
the
upper layer.  This is identified by the following message.
 
	unix: WARNING: /sbus@a,0/QLGC,isp@2,10000/sd@4,5 (sd229):
	unix: SCSI transport failed: reason 'timeout': giving up
  
After this i/o error, the sd driver gives up trying to communicate with that
SCSI device.

[End of Example 1]


If this situation occurs, the system may lose access to one or more
devices on a SCSI bus.  This could include devices normal filesystems,
the root disk, raw devices used by databases etc.
        

IMPLEMENTATION: 
 
         ---
        |   |   MANDATORY (Fully Pro-Active)
         ---    
         
  
         ---
        | X |   CONTROLLED PRO-ACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        |   |   REACTIVE (As Required)
         ---



CORRECTIVE ACTION: 

Enterprise Customers and authorized Field Service Representatives may
avoid the above mentioned problems by following the recommendations
as shown below:

	Apply isp patch 105600 or greater for Solaris 2.6 or
	patch 106924 or greater for Solaris 7.

(If this situation occurs before the patch above is applied, then the
system may have to be rebooted to regain control of the affected
devices.)

The recommendation is to evaluate customer configurations to determine
whether this change applies.  This problem can affect any devices
connected to UDWIS, DWIS, and SWIS cards including MultiPacks, D1000,
A1000, A3x00, and A7000 Sun Storage products, as well as SCSI-attached
OEM storage products, especially (but not only) when connected in
multi-initiator configurations.

Also strongly recommend that any mission-critical sites implement this
change.


COMMENTS: 
  

--------------------------------------------------------------------------
Implementation Footnote:

i)   In case of MANDATORY FINs, Enterprise Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO index.

Supporting Documents:
---------------------
* Supporting documents for FIN/FCOs can be found on Edist.  Edist can be 
  accessed internally at the following URL: http://edist.corp/.
  
* From there, follow the hyperlink path of "Enterprise Services Documenta- 
  tion" and click on "FIN & FCO attachments", then choose the
appropriate   
  folder, FIN or FCO.  This will display supporting directories/files for 
  FINs or FCOs.
   
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@Sun.COM
---------------------------------------------------------------------------
                                                        



 


Copyright (c) 1997-2003 Sun Microsystems, Inc.