Document fins/I0736-1
FIN #: I0736-1
SYNOPSIS: Current replacement procedures for an A3500FC controller in a
clustered environment could result in controller going off line.
DATE: Oct/26/01
KEYWORDS: Current replacement procedures for an A3500FC controller in a
clustered environment could result in controller going off line.
---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------
FIELD INFORMATION NOTICE
(For Authorized Distribution by SunService)
SYNOPSIS: Current replacement procedures for an A3500FC controller in a
clustered environment could result in controller going off line.
Sun Alert: No
TOP FIN/FCO REPORT: No
PRODUCT_REFERENCE: A3500FC Controller
PRODUCT CATEGORY: Storage / Service
PRODUCTS AFFECTED:
Systems Affected
----------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- ANYSYS - System Platform Independent -
X-Options Affected
------------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
X6532A A3000 - A3000 15*4.2GB/7200 FWSCSI -
X6533A - - A3000 35*4.2GB/7200 FWSCSI -
X6534A - - A3000 15*9.1GB/7200 FWSCSI -
X6535A - - A3000 35*9.1GB/7200 FWSCSI -
X6536A - - A3000 StorEdge Controller -
X6537A A3500 - A3500 SCSI controller -
X6538A A3500FC - A3500FC StorEdge Controller -
SG-XARY3* - - A3500 7200/10K Controller -
PART NUMBERS AFFECTED:
Part Number Description Model
----------- ----------- -----
- - -
REFERENCES:
BugId: 4476951 - A3500FC intermittent controller offlines during
normal operations.
ESC: 531240
MANUAL: 805-7854-11 A3x00 Controller Replacement Guide.
805-6887-10 Sun StorEdge RAID Manager 6.2 User's Guide.
806-7073-10 Sun Cluster 3.0 U1 System Administration Guide.
805-7076-10 Sun Cluster 3.0 U1 Error Messages Manual.
806-5343-10 Sun Cluster 2.2 System Administration Guide.
805-4202-10 Sun Cluster 2.2 Error Messages Manual.
805-3991-10 Sun Cluster 2.1 System Administration Guide.
805-4106-10 Sun Cluster 2.1 Error Messages Manual.
PROBLEM DESCRIPTION:
If proper steps are not taken to replace an A3500FC controller in a
clustered environment, one of the nodes will not recognize the new
controller and force it to go offline resulting in a single point of
failure. Also it will result in the WWN (World Wide Number) of the
new controller not being updated on one of the nodes in the cluster.
Here's the scenario of configuration for which this problem could
occur:
Any Host HW that supports A3500FC & Clustering.
StorEdge A3500FC
RAID Manager 6.22 or higher
Solaris 2.6 or higher
raidutil -c cXtXdXs2 -i | grep StorEdgeA3500FC
Within 5 minutes of replacing a controller, the other node will attempt
to communicate with the WWN from the old controller. Since it will
not be able to do so, it will offline the controller.
The WWN is part of the device path for an A3500FC RAID Module and it is
unique to a specific controller. When you replace that controller
using Recovery Guru from only one node, the other node will not be
updated. This will result in that node forcing that data path
offline.
Here's an example of an A3500FC data path:
# cd /dev/osa/dev/dsk
# ls -l c3t5d*s2
lrwxrwxrwx 1 root root 70 Jun 25 12:58 c3t5d0s2 ->
../../devices/sbus@3,0/SUNW,socal@2,0/sf@1,0/ssd@w200600a0b8078c3b,0:c
lrwxrwxrwx 1 root root 70 Jun 25 12:58 c3t5d1s2 ->
../../devices/sbus@3,0/SUNW,socal@2,0/sf@1,0/ssd@w200600a0b8078c3b,1:c
lrwxrwxrwx 1 root root 70 Jun 25 12:58 c3t5d2s2 ->
../../devices/sbus@3,0/SUNW,socal@2,0/sf@1,0/ssd@w200600a0b8078c3b,2:c
lrwxrwxrwx 1 root root 70 Jun 25 12:58 c3t5d3s2 ->
../../devices/sbus@3,0/SUNW,socal@2,0/sf@1,0/ssd@w200600a0b8078c3b,3:c
#
The WWN is the w200600a0b8078c3b part of it. If this changes, the RAID
Manager software will not be able to communicate with the controller.
This is why it is necessary to run recovery guru from BOTH nodes. This
ensures that the device trees are updated on both nodes.
The most likely cause of a controller going offline is a combination of
a hardware failure and the customer running RM6 commands from 2 hosts
at the same time (which could also compound an initial hardware
failure). Given that it's possible that a controller can have a PCI
error and subsequently continue to function with no further hardware
error, these errors are transient, and therefore, do not swap hardware
when a second error occurs within a set period of time.
IMPLEMENTATION:
---
| | MANDATORY (Fully Proactive)
---
---
| | CONTROLLED PROACTIVE (per Sun Geo Plan)
---
---
| X | REACTIVE (As Required)
---
CORRECTIVE ACTION:
The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives to avoid the above mentioned
problem.
Please adhere to the following guidelines to prevent controllers going
offline:
. Recovery Guru must be run from both nodes in the cluster (one at a
time) after the controller is replaced. Otherwise the node that did
not have recovery guru run will have a device tree (WWN) that is not
in sync with the new controller.
. Don't run RM6 commands (including explorer) from multiple hosts at
the same time.
COMMENTS:
None
============================================================================
Implementation Footnote:
i) In case of MANDATORY FINs, Enterprise Services will attempt to
contact all affected customers to recommend implementation of
the FIN.
ii) For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical
support teams will recommend implementation of the FIN (to their
respective accounts), at the convenience of the customer.
iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the
need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:
SunWeb Access:
--------------
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/
* From there, select the appropriate link to query or browse the FIN and
FCO Homepage collections.
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/
* From there, select the appropriate link to browse the FIN or FCO index.
Supporting Documents:
---------------------
* Supporting documents for FIN/FCOs can be found on Edist. Edist can be
accessed internally at the following URL: http://edist.corp/.
* From there, follow the hyperlink path of "Enterprise Services Documenta-
tion" and click on "FIN & FCO attachments", then choose the
appropriate
folder, FIN or FCO. This will display supporting directories/files for
FINs or FCOs.
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@Sun.COM
---------------------------------------------------------------------------
Copyright (c) 1997-2003 Sun Microsystems, Inc.