Recovering From State Database Replica Failures
How to Recover From Insufficient State Database Replicas
If the state database replica quorum is not met, for example, due to a drive failure, the system cannot be rebooted into multiuser mode. This situation could follow a panic (when Solaris Volume Manager discovers that fewer than half the state database replicas are available) or could occur if the system is rebooted with exactly half or fewer functional state database replicas. In Solaris Volume Manager terms, the state database has gone "stale." This task explains how to recover from this problem.
Boot the system to determine which state database replicas are down.
Determine which state database replicas are unavailable
Use the following format of the metadb command:
metadb -i
If one or more disks are known to be unavailable, delete the state database replicas on those disks. Otherwise, delete enough errored state database replicas (W, M, D, F, or R status flags reported by metadb) to ensure that a majority of the existing state database replicas are not errored.
Delete the state database replica on the bad disk using the metadb -d command.
Tip - State database replicas with a capitalized status flag are in error, while those with lowercase status flags are functioning normally.
Verify that the replicas have been deleted by using the metadb command.
Reboot.
If necessary, you can replace the disk, format it appropriately, then add any state database replicas needed to the disk. Following the instructions in "Creating State Database Replicas".
Once you have a replacement disk, halt the system, replace the failed disk, and once again, reboot the system. Use the format command or the fmthard command to partition the disk as it was configured before the failure.
Example--Recovering From Stale State Database Replicas
In the following example, a disk containing seven replicas has gone bad. This leaves the system with only three good replicas, and the system panics, then cannot reboot into multi-user mode.
panic[cpu0]/thread=70a41e00: md: state database problem 403238a8 md:mddb_commitrec_wrapper+6c (2, 1, 70a66ca0, 40323964, 70a66ca0, 3c) %l0-7: 0000000a 00000000 00000001 70bbcce0 70bbcd04 70995400 00000002 00000000 40323908 md:alloc_entry+c4 (70b00844, 1, 9, 0, 403239e4, ff00) %l0-7: 70b796a4 00000001 00000000 705064cc 70a66ca0 00000002 00000024 00000000 40323968 md:md_setdevname+2d4 (7003b988, 6, 0, 63, 70a71618, 10) %l0-7: 70a71620 00000000 705064cc 70b00844 00000010 00000000 00000000 00000000 403239f8 md:setnm_ioctl+134 (7003b968, 100003, 64, 0, 0, ffbffc00) %l0-7: 7003b988 00000000 70a71618 00000000 00000000 000225f0 00000000 00000000 40323a58 md:md_base_ioctl+9b4 (157ffff, 5605, ffbffa3c, 100003, 40323ba8, ff1b5470) %l0-7: ff3f2208 ff3f2138 ff3f26a0 00000000 00000000 00000064 ff1396e9 00000000 40323ad0 md:md_admin_ioctl+24 (157ffff, 5605, ffbffa3c, 100003, 40323ba8, 0) %l0-7: 00005605 ffbffa3c 00100003 0157ffff 0aa64245 00000000 7efefeff 81010100 40323b48 md:mdioctl+e4 (157ffff, 5605, ffbffa3c, 100003, 7016db60, 40323c7c) %l0-7: 0157ffff 00005605 ffbffa3c 00100003 0003ffff 70995598 70995570 0147c800 40323bb0 genunix:ioctl+1dc (3, 5605, ffbffa3c, fffffff8, ffffffe0, ffbffa65) %l0-7: 0114c57c 70937428 ff3f26a0 00000000 00000001 ff3b10d4 0aa64245 00000000 panic: stopped at edd000d8: ta %icc,%g0 + 125 Type 'go' to resume ok boot -s Resetting ... Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 270MHz), No Keyboard OpenBoot 3.11, 128 MB memory installed, Serial #9841776. Ethernet address 8:0:20:96:2c:70, Host ID: 80962c70. Rebooting with command: boot -s Boot device: /pci@1f,0/pci@1,1/ide@3/disk@0,0:a File and args: -s SunOS Release 5.9 Version s81_39 64-bit Copyright 1983-2001 Sun Microsystems, Inc. All rights reserved. configuring IPv4 interfaces: hme0. Hostname: dodo metainit: dodo: stale databases Insufficient metadevice database replicas located. Use metadb to delete databases which are broken. Ignore any "Read-only file system" error messages. Reboot the system when finished to reload the metadevice database. After reboot, repair any broken database replicas which were deleted. Type control-d to proceed with normal startup, (or give root password for system maintenance): root password single-user privilege assigned to /dev/console. Entering System Maintenance Mode Jun 7 08:57:25 su: 'su root' succeeded for root on /dev/console Sun Microsystems Inc. SunOS 5.9 s81_39 May 2002 # metadb -i flags first blk block count a m p lu 16 8192 /dev/dsk/c0t0d0s7 a p l 8208 8192 /dev/dsk/c0t0d0s7 a p l 16400 8192 /dev/dsk/c0t0d0s7 M p 16 unknown /dev/dsk/c1t1d0s0 M p 8208 unknown /dev/dsk/c1t1d0s0 M p 16400 unknown /dev/dsk/c1t1d0s0 M p 24592 unknown /dev/dsk/c1t1d0s0 M p 32784 unknown /dev/dsk/c1t1d0s0 M p 40976 unknown /dev/dsk/c1t1d0s0 M p 49168 unknown /dev/dsk/c1t1d0s0 # metadb -d c1t1d0s0 # metadb flags first blk block count a m p lu 16 8192 /dev/dsk/c0t0d0s7 a p l 8208 8192 /dev/dsk/c0t0d0s7 a p l 16400 8192 /dev/dsk/c0t0d0s7 # |
The system paniced because it could no longer detect state database replicas on slice /dev/dsk/c1t1d0s0, which is part of the failed disk or attached to a failed controller. The first metadb -i command identifies the replicas on this slice as having a problem with the master blocks.
When you delete the stale state database replicas, the root (/) file system is read-only. You can ignore the mddb.cf error messages displayed.
At this point, the system is again functional, although it probably has fewer state database replicas than it should, and any volumes that used part of the failed storage are also either failed, errored, or hot-spared; those issues should be addressed promptly.