Recovering From State Database Replica Failures

How to Recover From Insufficient State Database Replicas

If the state database replica quorum is not met, for example, due to a drive failure, the system cannot be rebooted into multiuser mode. This situation could follow a panic (when Solaris Volume Manager discovers that fewer than half the state database replicas are available) or could occur if the system is rebooted with exactly half or fewer functional state database replicas. In Solaris Volume Manager terms, the state database has gone "stale." This task explains how to recover from this problem.

Boot the system to determine which state database replicas are down.

Determine which state database replicas are unavailable
Use the following format of the metadb command:
metadb -i

If one or more disks are known to be unavailable, delete the state database replicas on those disks. Otherwise, delete enough errored state database replicas (W, M, D, F, or R status flags reported by metadb) to ensure that a majority of the existing state database replicas are not errored.
Delete the state database replica on the bad disk using the metadb -d command.

Tip - State database replicas with a capitalized status flag are in error, while those with lowercase status flags are functioning normally.

Verify that the replicas have been deleted by using the metadb command.

Reboot.

If necessary, you can replace the disk, format it appropriately, then add any state database replicas needed to the disk. Following the instructions in "Creating State Database Replicas".
Once you have a replacement disk, halt the system, replace the failed disk, and once again, reboot the system. Use the format command or the fmthard command to partition the disk as it was configured before the failure.

Example--Recovering From Stale State Database Replicas

In the following example, a disk containing seven replicas has gone bad. This leaves the system with only three good replicas, and the system panics, then cannot reboot into multi-user mode.

panic[cpu0]/thread=70a41e00: md: state database problem

403238a8 md:mddb_commitrec_wrapper+6c (2, 1, 70a66ca0, 40323964, 70a66ca0, 3c)
  %l0-7: 0000000a 00000000 00000001 70bbcce0 70bbcd04 70995400 00000002 00000000
40323908 md:alloc_entry+c4 (70b00844, 1, 9, 0, 403239e4, ff00)
  %l0-7: 70b796a4 00000001 00000000 705064cc 70a66ca0 00000002 00000024 00000000
40323968 md:md_setdevname+2d4 (7003b988, 6, 0, 63, 70a71618, 10)
  %l0-7: 70a71620 00000000 705064cc 70b00844 00000010 00000000 00000000 00000000
403239f8 md:setnm_ioctl+134 (7003b968, 100003, 64, 0, 0, ffbffc00)
  %l0-7: 7003b988 00000000 70a71618 00000000 00000000 000225f0 00000000 00000000
40323a58 md:md_base_ioctl+9b4 (157ffff, 5605, ffbffa3c, 100003, 40323ba8, ff1b5470)
  %l0-7: ff3f2208 ff3f2138 ff3f26a0 00000000 00000000 00000064 ff1396e9 00000000
40323ad0 md:md_admin_ioctl+24 (157ffff, 5605, ffbffa3c, 100003, 40323ba8, 0)
  %l0-7: 00005605 ffbffa3c 00100003 0157ffff 0aa64245 00000000 7efefeff 81010100
40323b48 md:mdioctl+e4 (157ffff, 5605, ffbffa3c, 100003, 7016db60, 40323c7c)
  %l0-7: 0157ffff 00005605 ffbffa3c 00100003 0003ffff 70995598 70995570 0147c800
40323bb0 genunix:ioctl+1dc (3, 5605, ffbffa3c, fffffff8, ffffffe0, ffbffa65)
  %l0-7: 0114c57c 70937428 ff3f26a0 00000000 00000001 ff3b10d4 0aa64245 00000000

panic: 
stopped at      edd000d8:       ta      %icc,%g0 + 125
Type  'go' to resume

ok boot -s
Resetting ... 

Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 270MHz), No Keyboard
OpenBoot 3.11, 128 MB memory installed, Serial #9841776.
Ethernet address 8:0:20:96:2c:70, Host ID: 80962c70.



Rebooting with command: boot -s                                       
Boot device: /pci@1f,0/pci@1,1/ide@3/disk@0,0:a  File and args: -s
SunOS Release 5.9 Version s81_39 64-bit

Copyright 1983-2001 Sun Microsystems, Inc.  All rights reserved.
configuring IPv4 interfaces: hme0.
Hostname: dodo

metainit: dodo: stale databases

Insufficient metadevice database replicas located.

Use metadb to delete databases which are broken.
Ignore any "Read-only file system" error messages.
Reboot the system when finished to reload the metadevice database.
After reboot, repair any broken database replicas which were deleted.

Type control-d to proceed with normal startup,
(or give root password for system maintenance): root password
single-user privilege assigned to /dev/console.
Entering System Maintenance Mode

Jun  7 08:57:25 su: 'su root' succeeded for root on /dev/console
Sun Microsystems Inc.   SunOS 5.9       s81_39  May 2002
# metadb -i
        flags           first blk       block count
     a m  p  lu         16              8192            /dev/dsk/c0t0d0s7
     a    p  l          8208            8192            /dev/dsk/c0t0d0s7
     a    p  l          16400           8192            /dev/dsk/c0t0d0s7
    M     p             16              unknown         /dev/dsk/c1t1d0s0
    M     p             8208            unknown         /dev/dsk/c1t1d0s0
    M     p             16400           unknown         /dev/dsk/c1t1d0s0
    M     p             24592           unknown         /dev/dsk/c1t1d0s0
    M     p             32784           unknown         /dev/dsk/c1t1d0s0
    M     p             40976           unknown         /dev/dsk/c1t1d0s0
    M     p             49168           unknown         /dev/dsk/c1t1d0s0
# metadb -d c1t1d0s0
# metadb
        flags           first blk       block count
     a m  p  lu         16              8192            /dev/dsk/c0t0d0s7
     a    p  l          8208            8192            /dev/dsk/c0t0d0s7
     a    p  l          16400           8192            /dev/dsk/c0t0d0s7
#

The system paniced because it could no longer detect state database replicas on slice /dev/dsk/c1t1d0s0, which is part of the failed disk or attached to a failed controller. The first metadb -i command identifies the replicas on this slice as having a problem with the master blocks.

When you delete the stale state database replicas, the root (/) file system is read-only. You can ignore the mddb.cf error messages displayed.

At this point, the system is again functional, although it probably has fewer state database replicas than it should, and any volumes that used part of the failed storage are also either failed, errored, or hot-spared; those issues should be addressed promptly.


24. Troubleshooting Solaris Volume Manager Boot Problems How to Recover From a Boot Device Failure