Document fins/I0886-1


FIN #: I0886-1

SYNOPSIS: StorEdge A1000, A3x00 or A3500FC LUNs running under volume manager
          software may cause upper level applications to timeout when too many
          I/O error retries occur

DATE: Oct/04/02

KEYWORDS: StorEdge A1000, A3x00 or A3500FC LUNs running under volume manager
          software may cause upper level applications to timeout when too many
          I/O error retries occur


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)



SYNOPSIS:  StorEdge A1000, A3x00 or A3500FC LUNs running under volume
           manager software may cause upper level applications to timeout
           when too many I/O error retries occur.
           
SunAlert:           No

TOP FIN/FCO REPORT: No 
  
PRODUCT_REFERENCE:  Sun StorEdge A1000, A3x00 or A3500FC
 
PRODUCT CATEGORY:   StorEdge / Service


PRODUCTS AFFECTED:

Systems Affected:
-----------------  
Mkt_ID   Platform   Model   Description                  Serial Number
------   --------   -----   -----------                  -------------
  -      ANYSYS       -     System Platform Independent        -


X-Options Affected:
-------------------
Mkt_ID           Platform   Model   Description              Serial Number
------           --------   -----   -----------              -------------
  -              A1000       ALL    A1000 Storage Array            -
  -              A3000       ALL    A3000 Storage Array            -
  -              A3500       ALL    A3500 Storage Array            -
  -              A3500FC     ALL    A3500FC Storage Array          -


PART NUMBERS AFFECTED:

Part Number   Description                             Model
-----------   -----------                             -----

825-3869-02   MNL Set SUN RSM ARRAY 2000                -
798-0188-01   SS CD ASSY RAID Manager 6.1               -
798-0522-01   RAID Manager 6.1.1                        -
798-0522-02   RAID Manager6.1.1 Update 1                -
798-0522-03   RAID Manager6.1.1 Update 2                -
704-6708-10   CD, SUN STOREDGE RAID Manager6.22         -
704-7937-05   CD, SUN STOREDGE RAID Manager6.22.1       -


REFERENCES:

BugId:  4722564 - Customer has experienced multiple A3500FC controllers 
                  offlining needs root cause.
        4423716 - I/O failure recovery exceeds Oracle aiowait timeout -- 
                  DB crashes 27062.
        4400536 - ipsserver install should stop only the relevant 
                  processes.

FIN:    I0634-1 - StorEdge A3x00 Array controller failover.

ESC:    539253 - bug 4722564/Root cause analysis for failed a3500 
                 controllers.

NOTICE: Infodoc 28087 - Oracle crashes with asynchronous I/O (AIO) - 
                        ORA-27062.
                27248 - ORA-27062 Causes / Remedy.


PROBLEM DESCRIPTION:

Systems with StorEdge A1000, A3x00 or A3500FC Arrays, which are
configured to run under volume manager software, may experience
database or application problems when repeated disk I/O error retries
occur.  Excessive RDAC and Solaris disk driver retry attempts on I/O
errors may result in database managers (such as Oracle) or applications
to timeout.  This may cause the applications to crash or run very
slowly following an I/O timeout.

This issue applies to any system type with StorEdge A1000, A3x00 or
A3500FC Arrays with LUNs that are covered by a volume manager.
Volume manager software could include Veritas Volume Manager or
Solaris DiskSuite.
 
To identify installed volume manager software:

     For Veritas Volume Manager
     
        # pkginfo -l VRTSvxvm
         
     For Solaris DiskSuite
     
        # pkginfo -l SUNWmdr
  
Symptoms for this issue may include error messages in the affected
application's error log.  For example, in Oracle 8 release, if an
Oracle I/O (AIO) is not complete within 10 minutes, it will log
messages in the Oracle alert log as follows. 
       
     LGWR: terminating instance due to error 27062
     Instance terminated by LGWR, pid=1351 
     
In addition to logging error messages, the Oracle server may crash as a
result of the I/O timeout.  In this case it would appear
non-responsive.

When there is an I/O error in an A1000, A3x00 or A3500FC LUN, Solaris
disk drivers (sd or ssd) will retry the I/O.  Under certain
circumstances, such as heavy I/O, this retry behavior of the Solaris
disk drivers, when combined with error recovery actions on the part of
RM6, can result in I/O error recovery attempts taking a long time to
complete.

Unfortunately, some database managers or applications, such as Oracle
8, will not tolerate I/O taking a long time to complete and this can
cause Oracle to timeout the I/O and then crash.  The Oracle crash
prevents the volume manager from failing over to a working disk volume
that has a mirrored image of the customer's data. 

There is a configuration workaround available which will resolve this
issue.  By setting the "Rdac_RetryCount" in rmparams to "1",
RDAC
retries on I/O errors will be eliminated (by default, RDAC retries 7
times).   This removes one layer of error recovery as it bypasses RM6
and passes the error from the disk driver directly to the volume
manager software.  This will greatly reduce the probability of
exceeding database manager or application I/O timeout limits.  See
details below.


IMPLEMENTATION:

           ---
          |   |   MANDATORY (Fully Proactive)
           ---


           ---
          |   |   CONTROLLED PROACTIVE (per Sun Geo Plan)
           ---


           ---
          | X |   REACTIVE (As Required)
           ---


CORRECTIVE ACTION:

The following recommendation is provided as a guideline for authorized
Sun Services Field Representatives who may encounter the above
mentioned problem.

To eliminate RDAC retries on I/O errors, 

1.  Edit /etc/raid/rmparams:
    Rdac_RetryCount=1  (The default value is 7).

2.  Restart the amdaemon in order to make the change effective.
    /etc/init.d/amdemon stop
 
    /etc/init.d/amdemon start


COMMENTS:

None

============================================================================

Implementation Footnote:

i)   In case of MANDATORY FINs, Sun Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Sun Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Sun Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO index.

Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@sun.com
--------------------------------------------------------------------------


Copyright (c) 1997-2003 Sun Microsystems, Inc.