InfoDoc ID   Synopsis   Date
19163   GBIC replacement guidelines for FCO A0144-2   28 Jun 1999

Status Issued

Description
 FCO: A0144-2  GBIC Replacement Guidelines	      DATE: June 1999
 =========================================================================
 Authors: Andy Fox (andrew.fox@uk), Ross Keeley (ross.keeley@uk)
 
 The GBIC Logistics FRU part number for FCO 144 is : 370-2303-FCO
  
 OVERVIEW:
 
 GBIC's are found predominantly within the A5x00 product line, but they
 also occur within those E3500 servers that utilise FC disc technology,
 within HUB's and on the server-side of Photon connections (Sbus FC host
 adaptors).
 
 This document is designed to provide some 'best practices' guidelines
 for implementing FCO A0144 on A5x00's. This FCO relates to the
 replacement of ALL rev -02 (or earlier) Vixel GBIC's. 

 ***************************************************************************
 These guidelines are NOT designed to cover all possible scenarios, so they
 should be used carefully in conjunction with common sense and the knowledge 
 of suitably trained engineers.

 It should also be noted that the only 100% safe way to change A5x00 GBIC's
 is in an off-line environment - any other method carries risk of data
 unavailabilty, owing to (unavoidable) loss of redundancy during GBIC
 exchange.
 ***************************************************************************
 
 
 PRELIMINARY ACTIVITIES:
 
 The ultimate goal when undertaking the GBIC replacement procedure is to
 minimise or eliminate down-time in the customer storage configuration.
 In some cases, it will simply not be possible to avoid downtime, so it
 is vital that some form of audit of the customer configuration is
 performed, prior to undertaking the replacement procedure.
 
 In this audit, you should be looking for three major things:
 
 1) Redundancy in the storage configuration. 
 
 Can you pull GBIC's, one at a time for replacement, without causing loss 
 of service? This is particularly important in a cluster, as loss of 
 service will result in a failover to another node.
 Things to look for are redundancy in loop connections, HBAs, and IBs.
 Also make sure to check the software configuration to ensure that 
 suitable redundancy is implemented. In particular it would be of significant
 benefit to check if those sites that are using SEVM as their volume
 management utility have the latest SEVM patches loaded. These patches will
 allow the use of the SEVM DMP path management utility via STORtools which is
 the only safe way of failing a DMP redundant path.

 The applicable SEVM patches are

 SEVM 2.5:			SEVM 2.6:
 Patch-ID# 105463-07 		Patch-ID# 106606-02  
  
 These patches supercede patches 107146 and 107142. STORtools 3.0 will need
 to have the stormenu script edited to replace instances of 107146 with 105463
 and instances of 107142 with 106606. We are advised that STORtools 3.1 will
 include these changes. If these changes are not made then the DMP path 
 enable / disable option will NOT be displayed in the STORtools menu.

 Note: DMP functionality is NOT currently supported for cluster configurations.

 2) Vixel GBICs which may need replacing. 
 
 At this stage you will be able to get an accurate count of how many 
 replacements may be required. Vixel GBIC's can be easily identified as 
 they do not have the metal bail handle (a'la IBM GBIC's). The Vixel GBIC's 
 are all removed by simply pushing the two plastics tabs on either side 
 together. Rev -02 appears as "02/xx" on top of connector.
 
 As the current mode of this FCO is for "on failure" replacement the best way 
 to know if suspect GBIC's may be present is by monitoring the messages file
 for loop errors (OFFLINEs and CRC's). You can use STORtools automatic
 monitoring with email notification for this. Details from the messages should
 provide you with a loop or loop instances ie an "sfx" (Sbus) FC-AL path or an
 "ifpx" (PCI) FC-AL path to focus the GBIC remedial work on.

 The only way to check for the rev code is to remove the GBIC which will
 take the loop or loops offline. Given the current implementation mode
 of this FCO, monitoring the messages file for loop error indicators is the
 currently recommended process to find loops which may contain suspect GBIC's.
 
 3) Any other applicable A5x00 patches, FIN's or FCO's that require 
    implementation, such as FIN I0400-1
 
 These items can be checked for and then implemented either before or
 after FCO-144 implementation. NOTE that a period of stabilty should be 
 achieved after any intervention, BEFORE proceding with a subsequent one.
 
 For example, DO NOT implement patch(es), FCO-144 and other outstanding
 FINs/FCOs at the same time .. rather treat each as a separate intervention
 and achieve stability of the system between each one.
 
 The official A5x00 patch matrix can be found at
 
    http://storageweb.eng/techmark_site/photon/patch/index.html
 
 A5x00 FIN's and FCO's can be found on SunSolve or at
 
    http://storageweb.eng/techmark_site/photon/main/fin.html
 
    http://storageweb.eng/techmark_site/photon/main/fco.html
   
 See also the NEW "A5x00 Troubleshooting Guide" 88 pages Postscipt doc:
 
   http://storageweb.eng.sun.com/techmark_site/photon/main/index.html
   
 under "Technical Information -> Service"
 
 STORtools is available for download from;
 
   http://storageweb.eng/tm/STORtools/StorTools1.html
 
 As a result of this audit, you should then be able to decide whether
 downtime will be necessary, and request the customer to schedule it
 appropriately. In an ideal world, all GBIC replacements should be
 undertaken during scheduled downtime, as this prevents the risk of
 unscheduled downtime should something go wrong.
 
 It is important to note where redundant configurations allow, GBIC's
 should ideally only be changed on one loop at a time. That is, where an
 A5x00 enclosure and its associated disks have two active FC-AL paths to
 it, the GBIC's should be changed on only one of these paths. Then at
 a later time (ideally the next day or later) and after checking that there
 are NO error messages in the messages files relating to the loop(s) which
 have previously had GBIC's changed, the GBIC's on the other loop should be
 changed. This will reduce the risk that any failures of the new (but unknown)
 GBIC components will bring down both paths to a customer's data simultaneously.
 
 Any FC-AL loop having GBIC's changed should first be tested via the
 STORtools Loop Integrity Test (Option 6 from the Main STORtools menu and
 then Option 1 from the Diagnostics Menu) to ensure that it is operational 
 and its condition is known PRIOR to any GBIC / component change. The same also 
 applies to any DMP or AP partner loop that will carry the data for the loop 
 that is having GBIC's changed. This should alleviate any risk that the 
 alternate loop has a major problem and will fail when placed under the 
 additional data load.

 Due to a limitation of the PCI FC-AL HBA controller it is not possible
 to run the STROtools Loop Integrity Test tests on A5X00 arrays connected to
 these controller cards. In this situation you will only be able to review the
 /var/adm/messages files to try to ascertain the stability and integrity of 
 the PCI FC-AL loops, both prior to and after any GBIC replacements have
 been instigated.
 
 Any STORtools failures noted should be accurately diagnosed and resolved
 PRIOR to implementing FCO 144 (contrary to popular belief GBIC's are not
 always the cause of loop issues ).
 
 REPLACEMENT FOR NON CLUSTER SYSTEMS:
 
 Assuming that you cannot get scheduled downtime, and that there is
 the required redundancy in the configuration, you may proceed as 
 follows with the GBIC replacement:
 
 The loop that is to be have remedial actions completed should be quiesced
 ie offlined. STORtools SEVM DMP path management if available should be
 utilised, SEVM mirrors detached and SEVM volumes unmounted if necessary.

 There are flow charts in the A5x00 Troubleshooting Guide which
 can be utilized or referenced if required.

 1) First ensure that all nodes are seen on the loop, check the array FPMs 
    for correct node counts (node count of zero indicates a hung loop).
 2) Ensure the integrity of ALL loops are known by running at least 
    one full iteration of STORtools Loop Integrity Test (light load)
    or if a PCI based host review the /var/adm/messages files for any 
    FC-AL loop errors. 
 3) Note the location of and remove the GBIC with cable still attached.
 4) Check that GBIC is Vixel (IBM's are OK). 
 5) If GBIC is Vixel rev -02 or earlier, remove it from the cable, insert 
    the replacement GBIC into EXACTLY the same location as noted in 3) 
    and reconnect the cable.
 6) Repeat as necessary until all affected GBIC's for the loop are replaced.
 7) Ensure that the loop node count as displayed on the array FPMs is
    correct ie as previously noted in action 1) (a node count of zero 
    indicates a hung loop). If not then fault find and correct before
    proceeding further.
 8) Ensure the integrity of the loop(s) that have had GBIC's changed is good 
    by running at least one full iteration of the STORtools Loop Integrity 
    Test (light load) or if a PCI based host review the /var/adm/messages files
    for any FC-AL loop errors. If any failures are noted accurately diagnose 
    and replace faulty components until the STORtools Loop Integrity Test 
    (light load) runs without failure, or in the case of PCI FC-AL loops the
    messages files are clear of loop related errors.
 9) At this point, some software intervention may be required in order to
    re-initialise the failed path. If the DMP path mangement utility has been
    used to quiesce the loop originally then it should be used to bring the
    loop back online. If not then in the case of Volume Manager, this means
    you will need to run 'vxdctl enable' for the failed path.  
10) If any SEVM mirrors have been detached / unmounted they will now need to be 
    re-attached and mounted if necessary and the resync process will commence.
 
 REPLACEMENT FOR CLUSTER SYSTEMS WITH REDUNDANT STORAGE CONFIGURATION:
 
 If the cluster has redundant storage configured on each node, then it
 is possible to replace the GBIC's as detailed above for non cluster
 systems. Be aware that the cluster will failover if any loss of service
 occurs. This should not happen in the case of correctly configured
 A5x00 arrays.

 Note that as DMP is NOT currently supported for clusters so you will not be
 able to use the SEVM DMP path management facilty for cluster configurations.

 Note: For cluster configurations there is a trade off between long resyncs and 
 system downtime to consider so this is best discussed with the customer and 
 a decision made which option is best for them.  Long resyncs are bad due to
 the performace impact and the potential loss of redundancy (some resyncs can 
 take 8 hours+). The customer might decide to be cautious and schedule a few
 hours of downtime, rather than risk any data loss or an unscheduled loss of
 service.

 REPLACEMENT FOR MULTI HOSTED ARRAYS IN A NON-REDUNDANT STORAGE CONFIGURATION:
 
 In the unlikely event that you encounter this configuration, replacement
 can still be carried out, but will need to be done as follows:
 
 1) Identify host node which is not currently in use.
 2) Ensure that all nodes are seen on the loop, check the array FPMs 
    for correct node counts (node count of zero indicates a hung loop).
 3) Ensure the integrity of ALL loops are known good by running at least 
    one full iteration of STORtools Loop Integrity Test (light load) or if 
    a PCI based host review the /var/adm/messages files for any FC-AL loop
    error conditions.
 4) Replace GBIC's as described above for non cluster systems.
 5) Ensure that the loop node count as displayed on the array FPMs is
    correct ie as previously noted in action 2) (a node count of zero 
    indicates a hung loop). If not then fault find and correct before
    proceeding further.
 6) Ensure the integrity of the loop(s) that have had GBIC's changed is good 
    by running at least one full iteration of STORtools Loop Integrity Test
    (light load) or if a PCI based host review the /var/adm/messages files for
    any FC-AL loop error conditions. If any failures are noted accurately
    diagnose and replace faulty components
 7) "Failover" to the alternate node to gain access the alternate nodes loop(s)
    which have not had GBIC's replaced. Replace GBIC's in those loops.
 8) Repeat steps 2) thru 6) as necessary until all GBIC's have been replaced.
 
 Once the GBIC replacement exercise is complete, please be sure to return any
 unused good modules and the replaced modules to stores, ensuring that the 
 rev -02 (or earlier) Vixel GBIC's are marked as BAD, and the unused GBIC's 
 are marked as GOOD. 
 
 GBIC - GigaBit Interface Converter
 HBA  - Host Bus Adaptor
 IB   - A5x00 Interface Board
 

 


SUBMITTER: Ross Keeley APPLIES TO: Hardware/Disk Storage Subsystem/StorEdge Disk Array/StorEdge A5000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.