InfoDoc ID |
|
Synopsis |
|
Date |
19163 |
|
GBIC replacement guidelines for FCO A0144-2 |
|
28 Jun 1999 |
FCO: A0144-2 GBIC Replacement Guidelines DATE: June 1999
=========================================================================
Authors: Andy Fox (andrew.fox@uk), Ross Keeley (ross.keeley@uk)
The GBIC Logistics FRU part number for FCO 144 is : 370-2303-FCO
OVERVIEW:
GBIC's are found predominantly within the A5x00 product line, but they
also occur within those E3500 servers that utilise FC disc technology,
within HUB's and on the server-side of Photon connections (Sbus FC host
adaptors).
This document is designed to provide some 'best practices' guidelines
for implementing FCO A0144 on A5x00's. This FCO relates to the
replacement of ALL rev -02 (or earlier) Vixel GBIC's.
***************************************************************************
These guidelines are NOT designed to cover all possible scenarios, so they
should be used carefully in conjunction with common sense and the knowledge
of suitably trained engineers.
It should also be noted that the only 100% safe way to change A5x00 GBIC's
is in an off-line environment - any other method carries risk of data
unavailabilty, owing to (unavoidable) loss of redundancy during GBIC
exchange.
***************************************************************************
PRELIMINARY ACTIVITIES:
The ultimate goal when undertaking the GBIC replacement procedure is to
minimise or eliminate down-time in the customer storage configuration.
In some cases, it will simply not be possible to avoid downtime, so it
is vital that some form of audit of the customer configuration is
performed, prior to undertaking the replacement procedure.
In this audit, you should be looking for three major things:
1) Redundancy in the storage configuration.
Can you pull GBIC's, one at a time for replacement, without causing loss
of service? This is particularly important in a cluster, as loss of
service will result in a failover to another node.
Things to look for are redundancy in loop connections, HBAs, and IBs.
Also make sure to check the software configuration to ensure that
suitable redundancy is implemented. In particular it would be of significant
benefit to check if those sites that are using SEVM as their volume
management utility have the latest SEVM patches loaded. These patches will
allow the use of the SEVM DMP path management utility via STORtools which is
the only safe way of failing a DMP redundant path.
The applicable SEVM patches are
SEVM 2.5: SEVM 2.6:
Patch-ID# 105463-07 Patch-ID# 106606-02
These patches supercede patches 107146 and 107142. STORtools 3.0 will need
to have the stormenu script edited to replace instances of 107146 with 105463
and instances of 107142 with 106606. We are advised that STORtools 3.1 will
include these changes. If these changes are not made then the DMP path
enable / disable option will NOT be displayed in the STORtools menu.
Note: DMP functionality is NOT currently supported for cluster configurations.
2) Vixel GBICs which may need replacing.
At this stage you will be able to get an accurate count of how many
replacements may be required. Vixel GBIC's can be easily identified as
they do not have the metal bail handle (a'la IBM GBIC's). The Vixel GBIC's
are all removed by simply pushing the two plastics tabs on either side
together. Rev -02 appears as "02/xx" on top of connector.
As the current mode of this FCO is for "on failure" replacement the best way
to know if suspect GBIC's may be present is by monitoring the messages file
for loop errors (OFFLINEs and CRC's). You can use STORtools automatic
monitoring with email notification for this. Details from the messages should
provide you with a loop or loop instances ie an "sfx" (Sbus) FC-AL path or an
"ifpx" (PCI) FC-AL path to focus the GBIC remedial work on.
The only way to check for the rev code is to remove the GBIC which will
take the loop or loops offline. Given the current implementation mode
of this FCO, monitoring the messages file for loop error indicators is the
currently recommended process to find loops which may contain suspect GBIC's.
3) Any other applicable A5x00 patches, FIN's or FCO's that require
implementation, such as FIN I0400-1
These items can be checked for and then implemented either before or
after FCO-144 implementation. NOTE that a period of stabilty should be
achieved after any intervention, BEFORE proceding with a subsequent one.
For example, DO NOT implement patch(es), FCO-144 and other outstanding
FINs/FCOs at the same time .. rather treat each as a separate intervention
and achieve stability of the system between each one.
The official A5x00 patch matrix can be found at
http://storageweb.eng/techmark_site/photon/patch/index.html
A5x00 FIN's and FCO's can be found on SunSolve or at
http://storageweb.eng/techmark_site/photon/main/fin.html
http://storageweb.eng/techmark_site/photon/main/fco.html
See also the NEW "A5x00 Troubleshooting Guide" 88 pages Postscipt doc:
http://storageweb.eng.sun.com/techmark_site/photon/main/index.html
under "Technical Information -> Service"
STORtools is available for download from;
http://storageweb.eng/tm/STORtools/StorTools1.html
As a result of this audit, you should then be able to decide whether
downtime will be necessary, and request the customer to schedule it
appropriately. In an ideal world, all GBIC replacements should be
undertaken during scheduled downtime, as this prevents the risk of
unscheduled downtime should something go wrong.
It is important to note where redundant configurations allow, GBIC's
should ideally only be changed on one loop at a time. That is, where an
A5x00 enclosure and its associated disks have two active FC-AL paths to
it, the GBIC's should be changed on only one of these paths. Then at
a later time (ideally the next day or later) and after checking that there
are NO error messages in the messages files relating to the loop(s) which
have previously had GBIC's changed, the GBIC's on the other loop should be
changed. This will reduce the risk that any failures of the new (but unknown)
GBIC components will bring down both paths to a customer's data simultaneously.
Any FC-AL loop having GBIC's changed should first be tested via the
STORtools Loop Integrity Test (Option 6 from the Main STORtools menu and
then Option 1 from the Diagnostics Menu) to ensure that it is operational
and its condition is known PRIOR to any GBIC / component change. The same also
applies to any DMP or AP partner loop that will carry the data for the loop
that is having GBIC's changed. This should alleviate any risk that the
alternate loop has a major problem and will fail when placed under the
additional data load.
Due to a limitation of the PCI FC-AL HBA controller it is not possible
to run the STROtools Loop Integrity Test tests on A5X00 arrays connected to
these controller cards. In this situation you will only be able to review the
/var/adm/messages files to try to ascertain the stability and integrity of
the PCI FC-AL loops, both prior to and after any GBIC replacements have
been instigated.
Any STORtools failures noted should be accurately diagnosed and resolved
PRIOR to implementing FCO 144 (contrary to popular belief GBIC's are not
always the cause of loop issues ).
REPLACEMENT FOR NON CLUSTER SYSTEMS:
Assuming that you cannot get scheduled downtime, and that there is
the required redundancy in the configuration, you may proceed as
follows with the GBIC replacement:
The loop that is to be have remedial actions completed should be quiesced
ie offlined. STORtools SEVM DMP path management if available should be
utilised, SEVM mirrors detached and SEVM volumes unmounted if necessary.
There are flow charts in the A5x00 Troubleshooting Guide which
can be utilized or referenced if required.
1) First ensure that all nodes are seen on the loop, check the array FPMs
for correct node counts (node count of zero indicates a hung loop).
2) Ensure the integrity of ALL loops are known by running at least
one full iteration of STORtools Loop Integrity Test (light load)
or if a PCI based host review the /var/adm/messages files for any
FC-AL loop errors.
3) Note the location of and remove the GBIC with cable still attached.
4) Check that GBIC is Vixel (IBM's are OK).
5) If GBIC is Vixel rev -02 or earlier, remove it from the cable, insert
the replacement GBIC into EXACTLY the same location as noted in 3)
and reconnect the cable.
6) Repeat as necessary until all affected GBIC's for the loop are replaced.
7) Ensure that the loop node count as displayed on the array FPMs is
correct ie as previously noted in action 1) (a node count of zero
indicates a hung loop). If not then fault find and correct before
proceeding further.
8) Ensure the integrity of the loop(s) that have had GBIC's changed is good
by running at least one full iteration of the STORtools Loop Integrity
Test (light load) or if a PCI based host review the /var/adm/messages files
for any FC-AL loop errors. If any failures are noted accurately diagnose
and replace faulty components until the STORtools Loop Integrity Test
(light load) runs without failure, or in the case of PCI FC-AL loops the
messages files are clear of loop related errors.
9) At this point, some software intervention may be required in order to
re-initialise the failed path. If the DMP path mangement utility has been
used to quiesce the loop originally then it should be used to bring the
loop back online. If not then in the case of Volume Manager, this means
you will need to run 'vxdctl enable' for the failed path.
10) If any SEVM mirrors have been detached / unmounted they will now need to be
re-attached and mounted if necessary and the resync process will commence.
REPLACEMENT FOR CLUSTER SYSTEMS WITH REDUNDANT STORAGE CONFIGURATION:
If the cluster has redundant storage configured on each node, then it
is possible to replace the GBIC's as detailed above for non cluster
systems. Be aware that the cluster will failover if any loss of service
occurs. This should not happen in the case of correctly configured
A5x00 arrays.
Note that as DMP is NOT currently supported for clusters so you will not be
able to use the SEVM DMP path management facilty for cluster configurations.
Note: For cluster configurations there is a trade off between long resyncs and
system downtime to consider so this is best discussed with the customer and
a decision made which option is best for them. Long resyncs are bad due to
the performace impact and the potential loss of redundancy (some resyncs can
take 8 hours+). The customer might decide to be cautious and schedule a few
hours of downtime, rather than risk any data loss or an unscheduled loss of
service.
REPLACEMENT FOR MULTI HOSTED ARRAYS IN A NON-REDUNDANT STORAGE CONFIGURATION:
In the unlikely event that you encounter this configuration, replacement
can still be carried out, but will need to be done as follows:
1) Identify host node which is not currently in use.
2) Ensure that all nodes are seen on the loop, check the array FPMs
for correct node counts (node count of zero indicates a hung loop).
3) Ensure the integrity of ALL loops are known good by running at least
one full iteration of STORtools Loop Integrity Test (light load) or if
a PCI based host review the /var/adm/messages files for any FC-AL loop
error conditions.
4) Replace GBIC's as described above for non cluster systems.
5) Ensure that the loop node count as displayed on the array FPMs is
correct ie as previously noted in action 2) (a node count of zero
indicates a hung loop). If not then fault find and correct before
proceeding further.
6) Ensure the integrity of the loop(s) that have had GBIC's changed is good
by running at least one full iteration of STORtools Loop Integrity Test
(light load) or if a PCI based host review the /var/adm/messages files for
any FC-AL loop error conditions. If any failures are noted accurately
diagnose and replace faulty components
7) "Failover" to the alternate node to gain access the alternate nodes loop(s)
which have not had GBIC's replaced. Replace GBIC's in those loops.
8) Repeat steps 2) thru 6) as necessary until all GBIC's have been replaced.
Once the GBIC replacement exercise is complete, please be sure to return any
unused good modules and the replaced modules to stores, ensuring that the
rev -02 (or earlier) Vixel GBIC's are marked as BAD, and the unused GBIC's
are marked as GOOD.
GBIC - GigaBit Interface Converter
HBA - Host Bus Adaptor
IB - A5x00 Interface Board
SUBMITTER: Ross Keeley
APPLIES TO: Hardware/Disk Storage Subsystem/StorEdge Disk Array/StorEdge A5000
ATTACHMENTS:
Copyright (c) 1997-2003 Sun Microsystems, Inc.