Document fins/I0536-2
FIN #: I0536-2
SYNOPSIS: E10000 systems with an A3X00 attached may encounter Dynamic
Reconfiguration errors
DATE: May/21/02
KEYWORDS: E10000 systems with an A3X00 attached may encounter Dynamic
Reconfiguration errors
---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------
FIELD INFORMATION NOTICE
(For Authorized Distribution by SunService)
SYNOPSIS: E10000 systems with an A3X00 attached may encounter
Dynamic Reconfiguration errors.
TOP FIN/FCO REPORT: No
PRODUCT_REFERENCE: E10000 with A3X00 DR information
PRODUCT CATEGORY: Server / SW Admin; Storage / SW Admin
PRODUCTS AFFECTED:
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
Systems Affected
----------------
- E10000 All Sun Enterprise 10000 Server -
- HPC10000 All Sun Enterprise 10000 Server -
X-Options Affected
------------------
6534A A3000 - A3000 15*9.1GB/7200 FWSCSI -
6535A - - A3000 35*9.1GB/7200 FWSCSI -
6534A - - A3000 15*9.1GB/7200 FWSCSI -
6535A - - A3000 35*9.1GB/7200 FWSCSI -
SG-ARY351A-180G A3500 - A3500 1 CONT MOD./5 TRAYS/18GB -
SG-ARY353A-360G - - A3500 2 CONT/7 TRAYS/18GB -
SG-ARY360A-90G - - A3500 1 CONT/5 TRAYS/9GB(10K) -
SG-ARY362A-180G - - A3500 2 CONT/7 TRAYS/9GB(10K) -
SG-ARY366A-72G - - A3500 1 CONT/2 TRAYS/9GB(10K) -
SG-ARY366A-72GR5 - - A3500 1 CONT/2 TRAYS/9GB(10K) -
SG-ARY370A-91G - - 91-GB A3500 (1x5x9-GB) -
SG-ARY372A-182G - - 182-GB A3500 (2x7x9-GB) -
SG-ARY374A-273G - - 273-GB A3500 w/(3x15x9-GB) -
SG-ARY380A-182G - - 182-GB A3500 (1x5x18-GB) -
SG-ARY382A-364G - - 364-GB A3500 (2x7x18-GB) -
SG-ARY384A-546G - - 546-GB A3500 (3x15x18-GB) -
SG-XARY351A-180G - - A3500 1 CONT MOD/5 TRAYS/18GB -
SG-XARY353A-1008G - - A3500 2 CONT/7 TRAYS/18GB -
SG-XARY353A-360G - - A3500 2 CONT/7 TRAYS/18GB -
SG-XARY355A-2160G - - A3500 3 CONT/15 TRAYS/18GB -
SG-XARY360A-545G - - 545-GB A3500 (1X5X9-GB) -
SG-XARY360A-90G - - A3500 1 CONT/5 TRAYS/9GB(10K) -
SG-XARY362A-180G - - A3500 2 CONT/7 TRAYS/9GB(10K) -
SG-XARY362A-763G - - A3500 2 CONT/7 TRAYS/9GB(10K) -
SG-XARY364A-1635G - - A3500 3 CONT/15 TRAYS/9GB(10K) -
SG-XARY366A-72G - - A3500 1 CONT/2 TRAYS/9GB(10K) -
SG-XARY380A-1092G - - 1092-GB A3500 (1x5x18-GB) -
PART NUMBERS AFFECTED:
Part Number Description Model
----------- ----------- -----
798-0522-0X RAID Manager 6.1.1 -
704-6708-10 RAID Manager 6.22 -
704-7937-05 CD RAID Manager 6.22.1 -
380-0083-XX A3000 Assembly for the StorEdge A3500 -
REFERENCES:
BugId: 4274772 - rdriver not suspend safe.
4100212 - Sonoma daemon cannot be suspended.
4348062 - RM 6.22 Unable to detatch a system board which has an
A3500FC.
4347782 - Need to correct documentation in FIN-I0536-1.
4618948 - DR: drshow io says ;No unsafe device currently open;.
but complete_detach fails.
ESC: 520860 - Cannot DR system board that has bad CPU.
526218 - unable to detach board, "ioctl failed....I/O error"
Regression of BugId 4274772..
534046 - il essaye de faire une DR pdt qu'il y de IO mais ca ne
fonctionne pas.
DOC: 805-3656-12 Sun StorEdge RAID Manager 6.1.1 Release Notes.
805-3656-12 Sun StorEdge RAID Manager 6.1.1 Update 2 Release Notes.
805-7758-11 Sun StorEdge RAID Manager 6.22 Release Notes.
805-7756-10 Sun StorEdge RAID Manager 6.22 Installation and Support
Guide for Solaris.
806-7792-13 Sun StorEdge Raid Manager RM6.22.1 Upgrade Guide.
806-7758-13 Sun StorEdge Raid Manager RM6.22.1 Release Notes.
PROBLEM DESCRIPTION:
For Solaris 2.5.1 or 2.6 OS software:
-------------------------------------
System downtime may be required to add entries in /etc/system that
identify the rdriver as a DR safe device. Dynamic Reconfiguration (DR)
detach of a system board that contains non-pageable memory may fail to
quiesce the OS if it is configured with an A3000, A3500, or A3500FC
storage array, RM 6.1.1 or 6.22 raid controller software, and Solaris
2.5.1 or 2.6 OS software.
Error message:
DR op: DRAIN BOARD (board 1)...
DR op: DETACH BOARD (board 1)...
NOTICE: hswp: Performing OS QUIESCE...
WARNING: hswp: unsafe device (rdriver)
WARNING: dr_mem_detach_unit: OS Quiesce failed (error = 8)
WARNING: dr_mem_detach_unit: errors occurred. rv = 0x4803
WARNING: dr_mem_detach: detach unit returned 0x4803 reason 0x20
dr_daemon[1530]: Error detaching board (mem-unit1): OS Quiesce failed.
The RM 6.1.1 and 6.22 rdriver is known to be dr-suspend-unsafe. It is
necessary to add entries to the /etc/system file for Solaris 2.5.1 and
2.6 and reboot the domain to update the detach_safe and suspend_safe
lists. It is also necessary to stop the Array Monitor and RDAC daemons,
any paritychk processes, and exit the RM6 GUI and any RM6 applications
before DR_detach is attempted. See corrective actions below.
For Solaris 2.5.1, 2.6 and 8 OS software:
-----------------------------------------
Dynamic Reconfiguration (DR) detach of a system board configured
with an A3000, A3500, or A3500FC storage array, RM 6.1.1 or 6.22 raid
controller software can fail also because some devices have a layer'ed
open count greater than 0.
Error message:
ssp% deleteboard -b 5
deleteboard: Attempting to acquire DR lock
deleteboard: Attempting to initialize daemon communications
Checking environment...
Establishing Control Board Server connection...
Initializing SSP SNMP MIB...
Establishing communication with DR daemon...
xfiredm5: System Status - Summary
BOARD #: 0 1 2 3 physically present
BOARD #: 5 detach in progress. Board Draining.
BOARD #: 4 being used by domain xfiredm5
deleteboard: Testing eligibility of board 5 for detachment
deleteboard: Starting complete detachment stage for board 5
Completing detach of board 5.
DR Error: Error detaching board: ioctl failed....I/O error
Board detachment failed.
Retry the COMPLETE or ABORT the operation.
deleteboard: Failed in complete detachment stage for board 5
Aborting detach of board 5.
Abort boarddetach completed successfully.
deleteboard: Attempting to release DR lock
deleteboard: dr_detach_complete failed
ssp%
Since the rdriver depends on active I/O to signal that a LUN path needs
to be switched, layered opens will exist on inactive LUNs causing DR to
fail.
-------------------------
| Update for FIN I0536-2; |
-------------------------
In this -2, the following has been updated to FINI0536-1;
1) The REFERENCE section has been updated as follows:
. The following BugId has been added; 4100212, 4348062, 4347782
and 4618948.
. The ESC 526218 and 534046 has been added.
. The following DOCs has been added; 805-7758-11, 805-7756-10,
806-7792-13, and 806-7758-13.
2) The P/N: 704-6708-10, and 704-7937-05 has been added to the PART
NUMBERS AFFECTED section.
3) The PROBLEM DESCRIPTION has been updated, when FIN I0536-1 was
released, Only RM6.1.1 was known to be affected. Now RM6.22 is
also affected as well as the problem with layered opens had not
been discovered resulting in an update to this FIN.
IMPLEMENTATION:
---
| | MANDATORY (Fully Pro-Active)
---
---
| | CONTROLLED PRO-ACTIVE (per Sun Geo Plan)
---
---
| X | REACTIVE (As Required)
---
CORRECTIVE ACTION:
Authorized Enterprise Service Representatives may avoid the above
mentioned problems by following the recommendations on Dynamic
Reconfiguration errors as shown below;
To perform DR on E10000 system boards with non-pageable memory if
an A3x00 storage device is attached to the domain these steps are
required.
1. Add the following to the /etc/system file for Solaris 2.5.1 and
Solaris 2.6 systems (safe list is not used with Solaris 7 and
above):
set dr:detach_safe_list1="rdriver"
set hswp:suspend_safe_list1="rdriver"
For Solaris 8 systems, there is no need to set hswp:suspend_safe_list1
and dr:detach_safe_list1 doesn't exist any more.
2. For Solaris 2.5.1 and Solaris 2.6 systems:
Note that the domain needs to have been rebooted since the
/etc/system entries were added for these settings to take effect.
For Solaris 8 systems, ignore this step as /etc/system was not
modified.
3. If the A3x00 is connected to the board which is being detached,
move the LUNs for that controller to the other controller using
the Maintenance and Tuning App -> LUN Balancing utility or the
lad and rdacutil commands.
Note that the rdacutil failover command is issued using the
controller that will remain in the system, the other controller
is the one that is failed. To return a failed controller to the
configuration after DR, use the -U option to rdacutil.
Some examples of command line LUN manipulation:
# ls -la /dev/osa/dev/rdsk | grep c5t5d0s0
lrwxrwxrwx 1 root other 53 Nov 19 03:38 c5t5d0s0 ->
../../devices/sbus@49,0/QLGC,isp@1,10000/sd@5,0:a,raw
# lad
c4t4d1s0 1T71017866 LUNS: 1 3
c5t5d0s0 1T71017874 LUNS: 0 2 4
# rdacutil -i <array name>
<array name>: dual-active
Active controller a (c5t5d0s0) units: 0 2 4
Active controller b (c4t4d1s0) units: 1 3
rdacutil succeeded!
# rdacutil -F c5t5d0s0
rdacutil succeeded!
# rdacutil -i <array name>
<array name>: active/passive
Active controller a (c5t5d0s0) units: 0 1 2 3 4
Failed controller b (1T71017866) units: none
rdacutil succeeded!
# lad
WARNING: /sbus@4c,0/QLGC,isp@0,10000/sd@4,0 (sd64):
offline
c5t5d0s0 1T71017874 LUNS: 0 1 2 3 4
# rdacutil -U c5t5d0s0
rdacutil succeeded!
# lad
sd64: disk okay
sd523: disk okay
c4t4d1s0 1T71017866 LUNS: 1 3
c5t5d0s0 1T71017874 LUNS: 0 2 4
IO must be done to all moved luns before attempting to detach the board.
If needed, that can be done via a dd using raw devices:
# dd if=/dev/rdsk/c#t#d#s# of=/dev/null count=3
Slice s2 or an other existing slice can be used.
Only few blocks are enough.
If io has been sent to a lun, the layered_count associated
to this device may go to zero or the sd/ssd structure may not
exist because the path has switched and an other sd/ssd
structure is created for the new path.
# adb -k /dev/ksyms /dev/mem
physmem 79e9
*(*sd_state)+(0tXXX*8)/J where XXX is the instance number of the
device (sdXXX)
30002543760$<scsi_disk
...
...
0x30002543af8: detach_count layer_count opens_in_progress
0 0 0
NOTE: the 'drshow <sb> io' command gives all devices attached to
the controllers of the system board to be detached. So for
Raid Manager devices, 'drshow <sb> io' lists all the luns,
so the device list can be different than the device list
given by format which reflects lun repartition (lad command)
at reconfiguration boot time.
EXAMPLE:
controller c3 attached to system board 1, controller c2 attached to
system boad 0, the format output is:
11. c2t3d0 <SYMBIOS-RSMArray2000-0205 cyl 8106 alt 2 hd 64 sec 64>
/pseudo/rdnexus@2/rdriver@3,0
12. c2t3d2 <SYMBIOS-RSMArray2000-0205 cyl 8106 alt 2 hd 64 sec 64>
/pseudo/rdnexus@2/rdriver@3,2
13. c2t3d4 <SYMBIOS-RSMArray2000-0205 cyl 8106 alt 2 hd 64 sec 64>
/pseudo/rdnexus@2/rdriver@3,4
14. c2t3d6 <SYMBIOS-RSMArray2000-0205 cyl 8106 alt 2 hd 64 sec 64>
/pseudo/rdnexus@2/rdriver@3,6
15. c2t5d0 <SYMBIOS-RSMArray2000-0205 cyl 17192 alt 2 hd 64 sec
64>
/pseudo/rdnexus@2/rdriver@5,0
16. c2t5d2 <SYMBIOS-RSMArray2000-0205 cyl 17192 alt 2 hd 64 sec
64>
/pseudo/rdnexus@2/rdriver@5,2
17. c2t5d4 <SYMBIOS-RSMArray2000-0205 cyl 17192 alt 2 hd 64 sec
64>
/pseudo/rdnexus@2/rdriver@5,4
18. c2t5d6 <SYMBIOS-RSMArray2000-0205 cyl 17192 alt 2 hd 64 sec
64>
/pseudo/rdnexus@2/rdriver@5,6
19. c3t1d1 <SYMBIOS-RSMArray2000-0205 cyl 8106 alt 2 hd 64 sec 64>
/pseudo/rdnexus@3/rdriver@1,1
20. c3t1d3 <SYMBIOS-RSMArray2000-0205 cyl 8106 alt 2 hd 64 sec 64>
/pseudo/rdnexus@3/rdriver@1,3
21. c3t1d5 <SYMBIOS-RSMArray2000-0205 cyl 8106 alt 2 hd 64 sec 64>
/pseudo/rdnexus@3/rdriver@1,5
22. c3t4d1 <SYMBIOS-RSMArray2000-0205 cyl 17192 alt 2 hd 64 sec
64>
/pseudo/rdnexus@3/rdriver@4,1
23. c3t4d3 <SYMBIOS-RSMArray2000-0205 cyl 17192 alt 2 hd 64 sec
64>
/pseudo/rdnexus@3/rdriver@4,3
24. c3t4d5 <SYMBIOS-RSMArray2000-0205 cyl 17192 alt 2 hd 64 sec
64>
/pseudo/rdnexus@3/rdriver@4,5
the 'drshow 1 io' shows:
device opens name usage // Ctrl Lun Partitions
------ ----- ---- -----
sd46 /dev/dsk/c3t1d0s0 // c2t3 0 0,
sd49 /dev/dsk/c3t4d0s0 // c2t5 0 0, 2
sd274 0 /dev/dsk/c3t1d1s0 /oracle/ficomop/d7
1 /dev/rdsk/c3t1d1s2
sd275 /dev/dsk/c3t1d2s0 // c2t3 2 0, 2
sd276 1 /dev/rdsk/c3t1d3s2
sd277 /dev/dsk/c3t1d4s0 // c2t3 4 0, 2
sd278 1 /dev/rdsk/c3t1d5s2
sd279 /dev/dsk/c3t1d6s0 // c2t3 6 0, 2
sd295 0 /dev/dsk/c3t4d1s0 /oracle/ficomop/d3
1 /dev/rdsk/c3t4d1s2
sd296 /dev/dsk/c3t4d2s0 // c2t5 2 0, 2
sd297 0 /dev/dsk/c3t4d3s0 /oracle/ficomop/d4
1 /dev/rdsk/c3t4d3s2
sd298 /dev/dsk/c3t4d4s0 // c2t5 4 0, 2
sd299 1 /dev/rdsk/c3t4d5s2
sd300 /dev/dsk/c3t4d6s0 // c2t5 6 0, 2
4. Exit and close the RM6 GUI and any of the applications that may be
running.
5. Before stopping the amdemon, ensure that I/O has been sent to all
LUNs that were previously on the board to be removed. This can be
accomplished with the dd command, for example:
# dd if=/dev/rdsk/c2t4d1s0 of=/dev/null count=3
# dd if=/dev/rdsk/c2t4d2s0 of=/dev/null count=3
(etc...)
6. Stop the Array Monitor and RDAC daemons with:
# /etc/init.d/amdemon stop
7. Stop any parityck processes.
# ps -ef | grep parity
Save the output from this command.
Kill this process.
8. Perform AP switch, mirror plex dissociation, removal of disks
from VM control, offline of disks on controllers to be detached,
AP database removal, and other necessary DR preparations.
9. Perform the DR operation.
10. Restart the Array Monitor and RDAC daemons with:
# /etc/init.d/amdemon start
11. Restart the RM6 GUI and any parityck operations that were in
progress. Use the information from the output from Step 6 to
reissue the parityck command.
12. If the board is to be re-installed after reconfiguration or
maintenance, the LUNs should be re-balanced using the Maintenance
and Tuning APP -> Lun Balancing utility once the board has been re-
installed.
If the system board to be detached does not have non-pageable
memory, DR can be performed using the same steps as above, with
the exception that Step 5, Stopping the Array Monitor and RDAC
daemons, is not a requirement (though it may be included with no
harmful effects).
COMMENTS:
None
------------------------------------------------------------------------------
Implementation Footnote:
i) In case of MANDATORY FINs, Enterprise Services will attempt to
contact all affected customers to recommend implementation of
the FIN.
ii) For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical
support teams will recommend implementation of the FIN (to their
respective accounts), at the convenience of the customer.
iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the
need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:
SunWeb Access:
--------------
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/
* From there, select the appropriate link to query or browse the FIN and
FCO Homepage collections.
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/
* From there, select the appropriate link to browse the FIN or FCO index.
Supporting Documents:
---------------------
* Supporting documents for FIN/FCOs can be found on Edist. Edist can be
accessed internally at the following URL: http://edist.corp/.
* From there, follow the hyperlink path of "Enterprise Services Documenta-
tion" and click on "FIN & FCO attachments", then choose the
appropriate
folder, FIN or FCO. This will display supporting directories/files for
FINs or FCOs.
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@Sun.COM
---------------------------------------------------------------------------
Copyright (c) 1997-2003 Sun Microsystems, Inc.