Troubleshooting Solaris Volume Manager
This chapter describes how to troubleshoot problems related to Solaris Volume Manager. This chapter provides both general troubleshooting guidelines and specific procedures for resolving some particular known problems.
This chapter includes the following information:
This chapter describes some Solaris Volume Manager problems and their appropriate solution. It is not intended to be all-inclusive but rather to present common scenarios and recovery procedures.
Troubleshooting Solaris Volume Manager (Task Map)
The following task map identifies some procedures needed to troubleshoot Solaris Volume Manager.
Task | Description | Instructions |
---|---|---|
Replace a failed disk | Replace a disk, then update state database replicas and logical volumes on the new disk. | "How to Replace a Failed Disk" |
Recover from improper /etc/vfstab entries | Use the fsck command on the mirror, then edit the /etc/vfstab file so the system will boot correctly. | |
Recover from a boot device failure | Boot from a different submirror. | "How to Recover From a Boot Device Failure" |
Recover from insufficient state database replicas | Delete unavailable replicas by using the metadb command. | |
Recover configuration data for a lost soft partition | Use the metarecover command to recover configuration data for soft partitions. | |
Recover a Solaris Volume Manager configuration from salvaged disks | Attach disks to a new system and have Solaris Volume Manager rebuild the configuration from the existing state database replicas. | "How to Recover a Configuration"
|
Overview of Troubleshooting the System
Prerequisites for Troubleshooting the System
To troubleshoot storage management problems related to Solaris Volume Manager, you need to do the following:
Have root privilege
Have a current backup of all data
General Guidelines for Troubleshooting Solaris Volume Manager
You should have the following information on hand when you troubleshoot Solaris Volume Manager problems:
Output from the metadb command.
Output from the metastat command.
Output from the metastat -p command.
Backup copy of the /etc/vfstab file.
Backup copy of the /etc/lvm/mddb.cf file.
Disk partition information, from the prtvtoc command (SPARC systems) or the fdisk command (IA-based systems)
Solaris version
Solaris patches installed
Solaris Volume Manager patches installed
Tip - Any time you update your Solaris Volume Manager configuration, or make other storage or operating environment-related changes to your system, generate fresh copies of this configuration information. You could also generate this information automatically with a cron job.
General Troubleshooting Approach
Although there is no one procedure that will enable you to evaluate all problems with Solaris Volume Manager, the following process provides one general approach that might help.
Gather information about current configuration.
Look at the current status indicators, including the output from the metastat and metadb commands. There should be information here that indicates which component is faulty.
Check the hardware for obvious points of failure. (Is everything connected properly? Was there a recent electrical outage? Have you recently added or changed equipment?)
Replacing Disks
This section describes how to replace disks in a Solaris Volume Manager environment.
How to Replace a Failed Disk
Identify the failed disk to be replaced by examining the /var/adm/messages file and the metastat command output.
Locate any state database replicas that might have been placed on the failed disk.
Use the metadb command to find the replicas.
The metadb command might report errors for the state database replicas located on the failed disk. In this example, c0t1d0 is the problem device.
# metadb flags first blk block count a m u 16 1034 /dev/dsk/c0t0d0s4 a u 1050 1034 /dev/dsk/c0t0d0s4 a u 2084 1034 /dev/dsk/c0t0d0s4 W pc luo 16 1034 /dev/dsk/c0t1d0s4 W pc luo 1050 1034 /dev/dsk/c0t1d0s4 W pc luo 2084 1034 /dev/dsk/c0t1d0s4
The output shows three state database replicas on slice 4 of the local disks, c0t0d0 and c0t1d0. The W in the flags field of the c0t1d0s4 slice indicates that the device has write errors. Three replicas on the c0t0d0s4 slice are still good.
Record the slice name where the state database replicas reside and the number of state database replicas, then delete the state database replicas.
The number of state database replicas is obtained by counting the number of appearances of a slice in the metadb command output in Step 2. In this example, the three state database replicas that exist on c0t1d0s4 are deleted.
# metadb -d c0t1d0s4
Caution - If, after deleting the bad state database replicas, you are left with three or fewer, add more state database replicas before continuing. This will help ensure that configuration information remains intact.
Locate any submirrors that use slices on the failed disk and detach them.
The metastat command can show the affected mirrors. In this example, one submirror, d10, is using c0t1d0s4. The mirror is d20.
# metadetach d20 d10 d20: submirror d10 is detached
Delete any hot spares on the failed disk.
# metahs -d hsp000 c0t1d0s6 hsp000: Hotspare is deleted
Halt the system and boot to single-user mode.
# halt ... ok boot -s ...
Physically replace the failed disk.
Repartition the new disk.
Use the format command or the fmthard command to partition the disk with the same slice information as the failed disk. If you have the prtvtoc output from the failed disk, you can format the replacement disk with fmthard -s /tmp/failed-disk-prtvtoc-output
If you deleted state database replicas in Step 3, add the same number back to the appropriate slice.
In this example, /dev/dsk/c0t1d0s4 is used.
# metadb -a c 3 c0t1d0s4
Depending on how the disk was used, you have a variety of things to do. Use the following table to decide what to do next.
Table 24-1 Disk Replacement Decision Table
Type of Device...
Task
Slice
Use normal data recovery procedures.
Unmirrored RAID 0 volume or Soft Partition
If the volume is used for a file system, run the newfs command, mount the file system then restore data from backup. If the RAID 0 volume is used for an application that uses the raw device, that application must have its own recovery procedures.
RAID 1 volume (Submirror)
Run the metattach command to reattach a detached submirror, which causes the resynchronization to start.
RAID 5 volume
Run the metareplace command to re-enable the slice, which causes the resynchronization to start.
Transactional volume
Depends on underlying volume type. If the transactional volume is on a RAID 5 volume, for example, follow those instructions in this table.
Replace hot spares that were deleted, and add them to the appropriate hot spare pool or pools.
# metahs -a hsp000 c0t0d0s6 hsp000: Hotspare is added
Validate the data.
Check the user/application data on all volumes. You might have to run an application-level consistency checker or use some other method to check the data.