Document fins/I0536-2


FIN #: I0536-2

SYNOPSIS: E10000 systems with an A3X00 attached may encounter Dynamic
          Reconfiguration errors

DATE: May/21/02

KEYWORDS: E10000 systems with an A3X00 attached may encounter Dynamic
          Reconfiguration errors


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)



SYNOPSIS:  E10000 systems with an A3X00 attached may encounter  
           Dynamic Reconfiguration errors. 
              

TOP FIN/FCO REPORT: No 
 
PRODUCT_REFERENCE:  E10000 with A3X00 DR information  
 
PRODUCT CATEGORY:   Server / SW Admin;  Storage / SW Admin 

PRODUCTS AFFECTED:  
  
Mkt_ID   Platform   Model   Description   Serial Number
------   --------   -----   -----------   -------------
Systems Affected
----------------

  -      E10000     All      Sun Enterprise 10000 Server    -
  -      HPC10000   All      Sun Enterprise 10000 Server    -


X-Options Affected
------------------
6534A              A3000  -   A3000 15*9.1GB/7200 FWSCSI      -       
6535A              -  -   A3000 35*9.1GB/7200 FWSCSI      -
6534A              -  -   A3000 15*9.1GB/7200 FWSCSI      -       
6535A              -  -   A3000 35*9.1GB/7200 FWSCSI      -
SG-ARY351A-180G    A3500  -   A3500 1 CONT MOD./5 TRAYS/18GB  -   
SG-ARY353A-360G    -  -   A3500 2 CONT/7 TRAYS/18GB       -  
SG-ARY360A-90G     -  -   A3500 1 CONT/5 TRAYS/9GB(10K)   -  
SG-ARY362A-180G    -  -   A3500 2 CONT/7 TRAYS/9GB(10K)   -
SG-ARY366A-72G     -  -   A3500 1 CONT/2 TRAYS/9GB(10K)   -  
SG-ARY366A-72GR5   -  -   A3500 1 CONT/2 TRAYS/9GB(10K)   -  
SG-ARY370A-91G     -  -   91-GB A3500 (1x5x9-GB)          -  
SG-ARY372A-182G    -  -   182-GB A3500 (2x7x9-GB)         - 
SG-ARY374A-273G    -  -   273-GB A3500 w/(3x15x9-GB)      -
SG-ARY380A-182G    -  -   182-GB A3500 (1x5x18-GB)        - 
SG-ARY382A-364G    -  -   364-GB A3500 (2x7x18-GB)        - 
SG-ARY384A-546G    -  -   546-GB A3500 (3x15x18-GB)       - 
SG-XARY351A-180G   -  -   A3500 1 CONT MOD/5 TRAYS/18GB   - 
SG-XARY353A-1008G  -  -   A3500 2 CONT/7 TRAYS/18GB       -
SG-XARY353A-360G   -  -   A3500 2 CONT/7 TRAYS/18GB       -
SG-XARY355A-2160G  -  -   A3500 3 CONT/15 TRAYS/18GB      -
SG-XARY360A-545G   -  -   545-GB A3500 (1X5X9-GB)         -
SG-XARY360A-90G    -  -   A3500 1 CONT/5 TRAYS/9GB(10K)   - 
SG-XARY362A-180G   -  -   A3500 2 CONT/7 TRAYS/9GB(10K)   -
SG-XARY362A-763G   -  -   A3500 2 CONT/7 TRAYS/9GB(10K)   -
SG-XARY364A-1635G  -  -   A3500 3 CONT/15 TRAYS/9GB(10K)  -
SG-XARY366A-72G    -  -   A3500 1 CONT/2 TRAYS/9GB(10K)   - 
SG-XARY380A-1092G  -  -   1092-GB A3500 (1x5x18-GB)       -    


PART NUMBERS AFFECTED: 

Part Number   Description                               Model
-----------   -----------                               -----
798-0522-0X   RAID Manager 6.1.1                          -
704-6708-10   RAID Manager 6.22                           -
704-7937-05   CD RAID Manager 6.22.1             	  -
380-0083-XX   A3000 Assembly for the StorEdge A3500       -


REFERENCES:

BugId: 4274772 - rdriver not suspend safe.
       4100212 - Sonoma daemon cannot be suspended.
       4348062 - RM 6.22 Unable to detatch a system board which has an 
                 A3500FC.
       4347782 - Need to correct documentation in FIN-I0536-1.
       4618948 - DR: drshow io says ;No unsafe device currently open;. 
                 but complete_detach fails.

ESC:   520860 - Cannot DR system board that has bad CPU.
       526218 - unable to detach board, "ioctl failed....I/O error" 
                Regression of BugId 4274772..
       534046 - il essaye de faire une DR pdt qu'il y de IO mais ca ne 
                fonctionne pas.

DOC:   805-3656-12  Sun StorEdge RAID Manager 6.1.1 Release Notes.
       805-3656-12  Sun StorEdge RAID Manager 6.1.1 Update 2 Release Notes.
       805-7758-11  Sun StorEdge RAID Manager 6.22 Release Notes.
       805-7756-10  Sun StorEdge RAID Manager 6.22 Installation and Support 
                    Guide for Solaris.
       806-7792-13 Sun StorEdge Raid Manager RM6.22.1 Upgrade Guide.
       806-7758-13 Sun StorEdge Raid Manager RM6.22.1 Release Notes.


PROBLEM DESCRIPTION: 

For Solaris 2.5.1 or 2.6 OS software:
-------------------------------------

System downtime may be required to add entries in /etc/system that
identify the rdriver as a DR safe device. Dynamic Reconfiguration (DR)
detach of a system board that contains non-pageable memory may fail to
quiesce the OS if it is configured with an A3000, A3500, or A3500FC
storage array, RM 6.1.1 or 6.22 raid controller software, and Solaris
2.5.1 or 2.6 OS software.    

Error message:

  DR op: DRAIN BOARD (board 1)...
  DR op: DETACH BOARD (board 1)...
  NOTICE: hswp: Performing OS QUIESCE...
  WARNING: hswp: unsafe device (rdriver)
  WARNING: dr_mem_detach_unit: OS Quiesce failed (error = 8)
  WARNING: dr_mem_detach_unit: errors occurred.  rv = 0x4803
  WARNING: dr_mem_detach: detach unit returned 0x4803 reason 0x20
  dr_daemon[1530]: Error detaching board (mem-unit1): OS Quiesce failed.
  
The RM 6.1.1 and 6.22 rdriver is known to be dr-suspend-unsafe. It is
necessary to add entries to the /etc/system file for Solaris 2.5.1 and 
2.6 and reboot the domain to update the detach_safe and suspend_safe 
lists.  It is also necessary to stop the Array Monitor and RDAC daemons, 
any paritychk processes, and exit the RM6 GUI and any RM6 applications 
before DR_detach is attempted.  See corrective actions below.

For Solaris 2.5.1, 2.6 and 8 OS software:
-----------------------------------------
Dynamic Reconfiguration (DR) detach of a system board configured
with an A3000, A3500, or A3500FC storage array, RM 6.1.1 or 6.22 raid
controller software can fail also because some devices have a layer'ed 
open count greater than 0.

Error message:

   ssp% deleteboard -b 5
   deleteboard: Attempting to acquire DR lock
   deleteboard: Attempting to initialize daemon communications
   Checking environment...
   Establishing Control Board Server connection...
   Initializing SSP SNMP MIB...
   Establishing communication with DR daemon...

                   xfiredm5: System Status - Summary

   BOARD #: 0 1 2 3 physically present
   BOARD #: 5 detach in progress. Board Draining.
   BOARD #: 4 being used by domain xfiredm5
   deleteboard: Testing eligibility of board 5 for detachment
   deleteboard: Starting complete detachment stage for board 5
   Completing detach of board 5.
   DR Error: Error detaching board: ioctl failed....I/O error
   Board detachment failed.
   Retry the COMPLETE or ABORT the operation.
   deleteboard: Failed in complete detachment stage for board 5
   Aborting detach of board 5.
   Abort boarddetach completed successfully.
   deleteboard: Attempting to release DR lock
   deleteboard: dr_detach_complete failed
   ssp%

Since the rdriver depends on active I/O to signal that a LUN path needs
to be switched, layered opens will exist on inactive LUNs causing DR to
fail.

 -------------------------
| Update for FIN I0536-2; |
 -------------------------
In this -2, the following has been updated to FINI0536-1;

  1) The REFERENCE section has been updated as follows:
     . The following BugId has been added; 4100212, 4348062, 4347782
       and 4618948.
     . The ESC 526218 and 534046 has been added.
     . The following DOCs has been added; 805-7758-11, 805-7756-10,
       806-7792-13, and 806-7758-13. 

  2) The P/N: 704-6708-10, and 704-7937-05 has been added to the PART 
     NUMBERS AFFECTED section.

  3) The PROBLEM DESCRIPTION has been updated, when FIN I0536-1 was 
     released, Only RM6.1.1 was known to be affected.  Now RM6.22 is 
     also affected as well as the problem with layered opens had not
     been discovered resulting in an update to this FIN.
 

IMPLEMENTATION:  
 
         ---
        |   |   MANDATORY (Fully Pro-Active)
         ---    
         
  
         ---
        |   |   CONTROLLED PRO-ACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        | X |   REACTIVE (As Required)
         ---
         

CORRECTIVE ACTION: 

Authorized Enterprise Service Representatives may avoid the above
mentioned problems by following the recommendations on Dynamic 
Reconfiguration errors as shown below;

To perform DR on E10000 system boards with non-pageable memory if 
an A3x00 storage device is attached to the domain these steps are
required.

1. Add the following to the /etc/system file for Solaris 2.5.1 and
   Solaris 2.6 systems (safe list is not used with Solaris 7 and 
   above):

   set dr:detach_safe_list1="rdriver"
   set hswp:suspend_safe_list1="rdriver"

   For Solaris 8 systems, there is no need to set hswp:suspend_safe_list1
   and dr:detach_safe_list1 doesn't exist any more.

2. For Solaris 2.5.1 and Solaris 2.6 systems:
   Note that the domain needs to have been rebooted since the
   /etc/system entries were added for these settings to take effect.

   For Solaris 8 systems, ignore this step as /etc/system was not
   modified.

3. If the A3x00 is connected to the board which is being detached,
   move the LUNs for that controller to the other controller using
   the Maintenance and Tuning App -> LUN Balancing utility or the
   lad and rdacutil commands. 

   Note that the rdacutil failover command is issued using the 
   controller that will remain in the system, the other controller 
   is the one that is failed. To return a failed controller to the 
   configuration after DR, use the -U option to rdacutil. 

   Some examples of command line LUN manipulation:

        # ls -la /dev/osa/dev/rdsk | grep c5t5d0s0
        lrwxrwxrwx   1 root     other         53 Nov 19 03:38 c5t5d0s0 -> 
        ../../devices/sbus@49,0/QLGC,isp@1,10000/sd@5,0:a,raw

        # lad
        c4t4d1s0 1T71017866 LUNS: 1 3 
        c5t5d0s0 1T71017874 LUNS: 0 2 4 
        # rdacutil -i <array name>

        <array name>:   dual-active
                Active    controller a (c5t5d0s0)             units:    0 2 4 
                Active    controller b (c4t4d1s0)             units:    1 3 

        rdacutil succeeded!
        # rdacutil -F c5t5d0s0

        rdacutil succeeded!
        # rdacutil -i <array name>

        <array name>:   active/passive
                Active    controller a (c5t5d0s0)         units:    0 1 2 3 4
                Failed    controller b (1T71017866)       units:    none

        rdacutil succeeded!
        # lad
        WARNING: /sbus@4c,0/QLGC,isp@0,10000/sd@4,0 (sd64):
                offline

        c5t5d0s0 1T71017874 LUNS: 0 1 2 3 4 
        # rdacutil -U c5t5d0s0

        rdacutil succeeded!
        # lad
        sd64:   disk okay
        sd523:  disk okay
        c4t4d1s0 1T71017866 LUNS: 1 3 
        c5t5d0s0 1T71017874 LUNS: 0 2 4 

 IO must be done to all moved luns before attempting to detach the board.
 If needed, that can be done via a dd using raw devices:

   # dd if=/dev/rdsk/c#t#d#s# of=/dev/null count=3

     Slice s2 or an other existing slice can be used.
     Only few blocks are enough.

     If io has been sent to a lun, the layered_count associated
     to this device may go to zero or the sd/ssd structure may not
     exist because the path has switched and an other sd/ssd
     structure is created for the new path.

   # adb -k /dev/ksyms /dev/mem
     physmem 79e9
     *(*sd_state)+(0tXXX*8)/J where XXX is the instance number of the 
     device (sdXXX)
     30002543760$<scsi_disk
     ...
     ...
     0x30002543af8:  detach_count    layer_count     opens_in_progress
                           0              0                  0

     NOTE: the 'drshow <sb> io' command gives all devices attached to
           the controllers of the system board to be detached. So for
           Raid Manager devices, 'drshow <sb> io' lists all the luns,
           so the device list can be different than the device list 
           given by format which reflects lun repartition (lad command) 
           at reconfiguration boot time.

     EXAMPLE:

     controller c3 attached to system board 1, controller c2 attached to
     system boad 0, the format output is:

     11. c2t3d0 <SYMBIOS-RSMArray2000-0205 cyl 8106 alt 2 hd 64 sec 64>
          /pseudo/rdnexus@2/rdriver@3,0
     12. c2t3d2 <SYMBIOS-RSMArray2000-0205 cyl 8106 alt 2 hd 64 sec 64>
          /pseudo/rdnexus@2/rdriver@3,2
     13. c2t3d4 <SYMBIOS-RSMArray2000-0205 cyl 8106 alt 2 hd 64 sec 64>
          /pseudo/rdnexus@2/rdriver@3,4
     14. c2t3d6 <SYMBIOS-RSMArray2000-0205 cyl 8106 alt 2 hd 64 sec 64>
          /pseudo/rdnexus@2/rdriver@3,6
     15. c2t5d0 <SYMBIOS-RSMArray2000-0205 cyl 17192 alt 2 hd 64 sec
64>
          /pseudo/rdnexus@2/rdriver@5,0
     16. c2t5d2 <SYMBIOS-RSMArray2000-0205 cyl 17192 alt 2 hd 64 sec
64>
          /pseudo/rdnexus@2/rdriver@5,2
     17. c2t5d4 <SYMBIOS-RSMArray2000-0205 cyl 17192 alt 2 hd 64 sec
64>
          /pseudo/rdnexus@2/rdriver@5,4
     18. c2t5d6 <SYMBIOS-RSMArray2000-0205 cyl 17192 alt 2 hd 64 sec
64>
          /pseudo/rdnexus@2/rdriver@5,6
     19. c3t1d1 <SYMBIOS-RSMArray2000-0205 cyl 8106 alt 2 hd 64 sec 64>
          /pseudo/rdnexus@3/rdriver@1,1
     20. c3t1d3 <SYMBIOS-RSMArray2000-0205 cyl 8106 alt 2 hd 64 sec 64>
          /pseudo/rdnexus@3/rdriver@1,3
     21. c3t1d5 <SYMBIOS-RSMArray2000-0205 cyl 8106 alt 2 hd 64 sec 64>
          /pseudo/rdnexus@3/rdriver@1,5
     22. c3t4d1 <SYMBIOS-RSMArray2000-0205 cyl 17192 alt 2 hd 64 sec
64>
          /pseudo/rdnexus@3/rdriver@4,1
     23. c3t4d3 <SYMBIOS-RSMArray2000-0205 cyl 17192 alt 2 hd 64 sec
64>
          /pseudo/rdnexus@3/rdriver@4,3
     24. c3t4d5 <SYMBIOS-RSMArray2000-0205 cyl 17192 alt 2 hd 64 sec
64>
          /pseudo/rdnexus@3/rdriver@4,5

the 'drshow 1 io' shows:

device  opens  name                usage               // Ctrl  Lun  Partitions
------  -----  ----                -----
sd46           /dev/dsk/c3t1d0s0                       // c2t3   0    0, 
sd49           /dev/dsk/c3t4d0s0                       // c2t5   0    0, 2
sd274     0    /dev/dsk/c3t1d1s0   /oracle/ficomop/d7
          1    /dev/rdsk/c3t1d1s2
sd275          /dev/dsk/c3t1d2s0                       // c2t3   2    0, 2
sd276     1    /dev/rdsk/c3t1d3s2
sd277          /dev/dsk/c3t1d4s0                       // c2t3   4    0, 2
sd278     1    /dev/rdsk/c3t1d5s2
sd279          /dev/dsk/c3t1d6s0                       // c2t3   6    0, 2
sd295     0    /dev/dsk/c3t4d1s0   /oracle/ficomop/d3
          1    /dev/rdsk/c3t4d1s2
sd296          /dev/dsk/c3t4d2s0                       // c2t5   2    0, 2
sd297     0    /dev/dsk/c3t4d3s0   /oracle/ficomop/d4
          1    /dev/rdsk/c3t4d3s2
sd298          /dev/dsk/c3t4d4s0                       // c2t5   4    0, 2
sd299     1    /dev/rdsk/c3t4d5s2
sd300          /dev/dsk/c3t4d6s0                       // c2t5   6    0, 2


4. Exit and close the RM6 GUI and any of the applications that may be
   running.

5. Before stopping the amdemon, ensure that I/O has been sent to all 
   LUNs that were previously on the board to be removed. This can be 
   accomplished with the dd command, for example:

        # dd if=/dev/rdsk/c2t4d1s0 of=/dev/null count=3
        # dd if=/dev/rdsk/c2t4d2s0 of=/dev/null count=3
	(etc...)

6. Stop the Array Monitor and RDAC daemons with:
        # /etc/init.d/amdemon stop

7. Stop any parityck processes.
        # ps -ef | grep parity
        Save the output from this command.
        Kill this process.

8. Perform AP switch, mirror plex dissociation, removal of disks
   from VM control, offline of disks on controllers to be detached, 
   AP database removal, and other necessary DR preparations. 

9. Perform the DR operation.

10. Restart the Array Monitor and RDAC daemons with:
        # /etc/init.d/amdemon start 
        
11. Restart the RM6 GUI and any parityck operations that were in
    progress. Use the information from the output from Step 6 to 
    reissue the parityck command.

12. If the board is to be re-installed after reconfiguration or 
    maintenance, the LUNs should be re-balanced using the Maintenance
    and Tuning APP -> Lun Balancing utility once the board has been re-
    installed.

If the system board to be detached does not have non-pageable 
memory, DR can be performed using the same steps as above, with
the exception that Step 5, Stopping the Array Monitor and RDAC 
daemons, is not a requirement (though it may be included with no 
harmful effects).


COMMENTS:

None  

------------------------------------------------------------------------------
Implementation Footnote:

i)   In case of MANDATORY FINs, Enterprise Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO index.

Supporting Documents:
---------------------
* Supporting documents for FIN/FCOs can be found on Edist.  Edist can be 
  accessed internally at the following URL: http://edist.corp/.
  
* From there, follow the hyperlink path of "Enterprise Services Documenta- 
  tion" and click on "FIN & FCO attachments", then choose the
appropriate   
  folder, FIN or FCO.  This will display supporting directories/files for 
  FINs or FCOs.
   
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@Sun.COM
---------------------------------------------------------------------------
                                                        


Copyright (c) 1997-2003 Sun Microsystems, Inc.