SRDB ID   Synopsis   Date
49095   Data corruption on the Sun StorEdge [TM] T3 after a hardware failure   26 Nov 2002

Status Issued

Description

It's important to understand how the Sun StorEdge[TM] T3 cache settings work in order to avoid data corruption when components fail. There is plenty of documentation about what the recommended settings are, but there is very little in the documentation about what can happen if the Sun StorEdge [TM] T3 is configured with non-standard settings. Occasionally we see instances of data corruption on the Sun StorEdge[TM] T3 after a hardware failure which can be traced back to the cache settings. This affects all types of Sun StorEdge[TM] T3s, including the Sun StorEdge[TM] T3+ or Sun StorEdge[TM] T3B.

SOLUTION SUMMARY:

The command to check the settings on the Sun StorEdge[TM] T3 is `sys list`. The settings we're interested in are "cache" and "mirror". By default, both parameters are set to "auto" to allow the T3 to set them appropriately without user intervention, like so:

my_t3:/:<1>sys list

blocksize : 64k

cache : auto

mirror : auto

....

The behavior of the "auto" settings is documented in the Sun StorEdge[TM] T3 Disk Tray Configuration Guide. With this configuration, the data is protected even in the event of hardware failures because the cache is mirrored.

The Sun StorEdge[TM]T3 will allow users to configure it so that it is vulnerable to data corruption due to loss of the cache. If the cache is not mirrored (either because it's a single-brick configuration and cache mirroring is set to "auto", or a partner pair with sys mirror set to "off") and the Sun StorEdge[TM] T3 is set to "writebehind" mode (forcing it to use the cache for writes), there is a danger of data corruption. For example:

my_t3:/:<1>sys list

blocksize : 64k

cache : writebehind

mirror : off

....

In this situation, if there is a hardware failure that prevents cache data from being flushed to disk (such as a controller failure) and there are pending writes in the Sun StorEdge[TM] T3 cache, data corruption will occur. Even though the OS received acknowledgement that the writes were completed, they haven't actually been written to disk yet, and when access to the cache is lost, so is the data in the pending writes. Since the writes never complete, the data on disk becomes inconsistent and cannot be repaired. Even after fixing the hardware problem, some or all of the data will generally need to restored.

INTERNAL SUMMARY:

SUBMITTER: John Mountain APPLIES TO: AFO Vertical Team Docs/Storage, Hardware/Disk Storage Subsystem/StorEdge Disk Array/StorEdge T3 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.