C H A P T E R 2 |
Introduction to DR on the Sun Fire 15K Server |
This chapter contains descriptions about general concepts that pertain to the dynamic reconfiguration (DR) feature on the Sun Fire 15K system.
The dynamic reconfiguration (DR) feature on the Sun Fire 15K server enables you to perform hardware configuration changes to a live domain that is running the Solaris operating environment, without causing machine downtime. You can also use DR in conjunction with hot-swap to physically remove boards from, or add them to, the server.
You can execute DR operations either from the Sun Fire 15K system controller (SC) by using the system management services (SMS) commands -- addboard (1M), moveboard (1M), deleteboard (1M), and rcfgadm (1M) -- or from the domain by using the cfgadm (1M) command.
The DR software has a command line interface using the cfgadm(1M) command, which runs the configuration administration program that you use to perform DR operations on a Solaris domain.
The optional Sun Management Center 3.0 Platform Update 4 software, which is designed for use on Sun Fire 15K systems, provides features such as domain management, as well as a graphical user interface (GUI) where you perform DR operations. If you prefer to use a graphical user interface instead of a command line interface, use the Sun Management Center 3.0.
To use the Sun Management Center 3.0 Platform Update 4 software, you must attach the system controller (SC) board to a network. With a network connection, you can view both the command line interface and the graphical user interface. For instructions on how to use the Sun Management Center 3.0 Platform Update 4 software, refer to the Sun Management Center 3.0 User's Guide, shipped with the Sun Management Center 3.0 Platform Update 4 software. For instructions on how to connect the system controller to a network connection on the system controller (SC) board, see your system's installation documentation.
Automatic DR enables an application to execute DR operations without requiring user interaction. This ability is provided by an enhanced DR framework that includes the reconfiguration coordination manager (RCM) and the system event facility, called sysevent . The RCM enables application-specific loadable modules to register callbacks. The callbacks perform preparatory tasks before a DR operation; error recovery during a DR operation; and clean-up after a DR operation. The sysevent facility enables applications to register for system events and receive notifications of those events. The automatic DR framework interfaces with the RCM and the sysevent facility to enable applications to automatically give up resources prior to unconfiguring them, and to capture new resources as they are configured into the domain.
The DR feature enables you to hot-swap system boards without bringing the server down. It is used to unconfigure the resources on a faulty system board from a domain so that the system board can be removed from the server. The repaired or replacement board can be inserted into the domain while the Solaris operating environment continues to run. DR then configures the resources on the board into the domain. If you use DR to add or remove a system board or component, DR always leaves the board or component in a known configuration state. See Chapter 3 for more information about configuration states for system boards and components.
This section contains descriptions of general DR concepts that pertain to Sun Fire 15K domains. For more information about DR concepts on the SC, refer to the System Management Services (SMS) 1.2 Dynamic Reconfiguration User Guide .
For a device to be detachable, it must conform to the following items:
Critical resources must be redundant or accessible through an alternate pathway. CPUs and memory banks can be redundant critical resources. Disk drives are examples of critical resources that can be accessible through an alternate pathway.
Some boards cannot be detached because their resources cannot be moved. For example, if a domain has only one CPU board, that CPU board cannot be detached. An I/O board is not detachable if it controls the boot drive.
If there is no alternate pathway for an I/O board, you can:
Add a second path to the device through a second I/O board so that the I/O board can be detached without losing access to the secondary disk chain.
Note - If you are unsure whether a device is detachable, consult your Sun service representative. |
During an unconfigure operation on a system board that has permanent memory (OpenBoot PROM or kernel memory), the operating environment is briefly paused, which is known as operating environment quiescence . All operating environment and device activity on the domain must cease during this critical phase of the operation.
Before it can achieve quiescence, the operating environment must temporarily suspend all processes, CPUs, and device activities. If the operating environment cannot achieve quiescence, it displays the reasons, which may include the following:
Real-time processes do not prevent quiescence.
The conditions that cause processes to fail to suspend are generally temporary. Examine the reasons for the failure, and if the operating environment encountered a failure to suspend a process, simply try the operation again.
When DR suspends the operating environment, all device drivers that are attached to the operating environment must also be suspended. If a driver cannot be suspended (or subsequently resumed), the DR operation fails.
A suspend-safe device does not access memory or interrupt the system while the operating environment remains in quiescence. A driver is suspend-safe if it supports operating environment quiescence (if it can be suspended and then resumed). A suspend-safe driver also guarantees that when a suspend request is successfully completed, the device that the driver manages will not attempt to access memory, even if the device is open when the suspend request is made.
A suspend-unsafe device allows a memory access or a system interruption to occur while the operating environment is in quiescence.
DR uses an unsafe driver list in the dr.conf file to prevent unsafe devices from accessing memory or interrupting the operating environment during a DR operation. The dr.conf file resides in the following directory: /platform/SUNW,Sun-Fire-15000/kernel/drv/. The unsafe driver list is a property in the dr.conf with the following format:
DR reads this list when it prepares to suspend the operating environment so that it can unconfigure a memory component. If DR finds an active driver in the unsafe driver list, it aborts the DR operation and returns an error message that identifies the active, unsafe driver. You must manually stop usage of the device by performing one or more of the following tasks:
Kill processes that are using the device.Unload the driver using the modunload (1M) command.
Disconnect the cables (depending on the type of device).
You can retry the DR operation after you have stopped usage of the device.
Note Note - If you are unsure whether a device is suspend-safe, contact your Sun service representative. |
An attachment point is a collective term that refers to a board slot, a system board installed in the slot, and any devices connected to the board. DR can display the status of the board, the board slot, and the attachment point. The term occupant refers to the combination of a board and its attached devices.
A board slot (sometimes referred to as a receptacle ) has the ability to electrically isolate the occupant from the host machine. The software can put a board slot into power-off mode.
Board slots can be named according to slot numbers, or can be anonymous (for example, a SCSI chain).
There are two types of attachment points:
A physical attachment point describes the software driver and location of the slot. Examples of physical attachment point names are:
/devices/pseudo/dr@0:SBx (for a CPU/memory board in slot 0) -OR- /devices/pseudo/dr@0:IOx (for an I/O board or Max CPU board in slot 1) |
Where, x represents the number of a particular board (from 0 through 17).
Note Note - CPU/memory boards are installed only in slot 0. I/O boards and Max CPU boards are installed only in slot 1. |
A logical attachment point is an abbreviated name created by the system to refer to the physical attachment point. Logical attachment points take one of the following two forms:
To obtain a list of all available logical attachment points, use the cfgadm (1M) command with its -l option.
A state is the operational status of either a board slot or its occupant. A condition is the operational status of an attachment point. The cfgadm (1M) command can display nine types of states and conditions. See Chapter 3 for descriptions of the conditions and states for system boards and components.
There are four main types of operations related to boards: connect, configure, unconfigure, and disconnect. A board that is brought into a domain is first connected and then configured. A board that is removed from a domain is first unconfigured and then disconnected.
During the connect operation, the system provides power to the slot, and the operating environment begins monitoring the board's temperature. For I/O boards, the connect operation is included in the configure operation. You connect a board before you configure it.
During the configure operation, the operating environment assigns functional roles to the board, and loads device drivers for the board and for devices attached to the board. You always connect a board before you configure it.
During the unconfigure operation, the system detaches the board logically from the operating environment and takes the associated device drivers offline. Environmental monitoring continues, but devices on the board are not available for system use. You unconfigure a board before you disconnect it.
During the disconnect operation, the system stops monitoring the board and power to the slot is turned off. You always unconfigure a board before you disconnect it.
To power-off a board that is in use (configured), first stop its use (unconfigure it), and then disconnect it from the domain. After a new or upgraded system board is inserted into the slot, connect the board and configure it.
The cfgadm (1M) command can connect and configure (or unconfigure and disconnect) in a single command. To connect and configure a board using a single command, see the section . To unconfigure and disconnect a board using a single command, see the section .
If necessary, each operation (connect, configure, unconfigure, or disconnect) can be performed separately using the cfgadm(1M) command.
Hot-plug boards and modules have special connectors that supply electrical power to the board or module before the data pins make contact. Boards and devices that do not have hot-plug connectors cannot be inserted or removed while the system is running.
I/O boards and CPU/memory boards used in the Sun Fire 15K server are hot-plug devices. Some devices, such as the peripheral power supply, are not hot-plug modules and cannot be removed while the system is running.
The Sun Fire 15K server can be divided into dynamic system domains, which are comprised of logical and physical groupings of system board slots. Each domain is electrically isolated into hardware partitions, which ensures that a problem in one domain does not affect other domains.
Domain configuration is determined by the domain configuration table in the platform configuration database (PCD), which resides on the SC. The domain table controls how system board slots are logically partitioned into domains. The domain configuration represents the intended domain configuration. Thus, the configuration can include empty slots and occupied slots.
The number of slots available to a given domain is controlled by an available component list (ACL) that is maintained on the system controller. (Refer to the System Management Services (SMS) 1.2 Administrator Guide for more information about the ACL.) After a slot has been assigned to a domain, it becomes visible to that domain and unavailable and invisible to any other domain. Conversely, you must disconnect and unassign a slot from its domain before you can assign and connect it to another domain.
The logical domain is the set of slots that belong to the domain. The physical domain is the set of boards that are physically interconnected. A slot can be a member of a logical domain and not be part of a physical domain.
After a domain is booted, the system boards and empty slots can be assigned to (or unassigned from) a logical domain; however, they cannot become a part of the physical domain until the operating environment requests it.
System boards or slots that are not assigned to a domain are available to all domains in whose ACLs they are listed. These boards can be assigned to a domain by the platform administrator. Or, an ACL can be set up on the system controller to allow users with appropriate privileges to assign available boards to a domain.
You can use DR to configure or to unconfigure several types of components
You must use caution when you add or remove I/O boards to which devices are attached. Before you can remove an I/O board with devices attached, all of its devices must be closed and all of its file systems must be unmounted.
If you need to remove an I/O board with attached devices from a domain temporarily and then re-add it before any other boards with I/O devices are added, reconfiguration is not necessary. In this case, device paths to the board devices remain unchanged.
All I/O devices must be closed before they are unconfigured. If you encounter a problem with an I/O device, the following list can help you to overcome the problem.
Use the fuser (1M) command to see which processes have the device open.
Run showdevices (1M) on the SC to determine the state and usage of the device.
If disk mirroring is being used to access a device connected to the board, reconfigure the device so that it is accessible by way of controllers on other system boards.
Unmount file systems.
Remove multipathing databases from board-resident partitions. The location of multipathing databases is explicitly chosen by the user and can be changed.
Refer to the Solaris 9 Sun on Sun Hardware Release Notes Supplement for special instructions for I/O devices.
Remove any private regions used by volume managers. By default, volume managers use a private region on each device that they control. Such devices must be removed from volume manager control before they can be detached.
Take any RSM 2000 controllers offline by using the rm6 or rdacutil commands.
If a detach-unsafe device is present on the board, close all instances of the device and use modunload (1M) to unload the driver.
Caution Caution - Unmounting file systems may affect NFS client systems. |
Either kill any process that directly opens a device or raw partition, or direct it to close the open device on the board.
Before you can delete a board, the operating environment must vacate the memory on that board. Vacating a board entails flushing the contents of its non-permanent memory to swap space; and copying the contents of its permanent memory (that is, the kernel and OpenBoot PROM software) to another memory board.
To relocate permanent memory, the operating environment on a domain must be temporarily quiesced. The length of the quiescence depends on the domain I/O configuration and the running workloads.
Detaching a board with permanent memory is the only time when the operating environment is quiesced; therefore, you should know where permanent memory resides so that you can avoid impacting the operation of the domain significantly. To display the size of permanent memory, use the cfgadm (1M) command with its - v option. To vacate a board that has permanent memory, the operating environment must find a sufficiently large block of available memory, called target memory, on which to copy the current contents of permanent memory, which is referred to as source memory.
A DR memory operation can be disallowed if the target domain does not have enough memory to hold the contents currently stored in permanent memory.
Correctable memory errors indicate that the memory on a system board (that is, one or more of its Dual Inline Memory Modules (DIMMs), or portions of the hardware interconnect) may be faulty and need replacement. When the SC detects correctable memory errors, it initiates a record-stop dump to save the diagnostic data, which can interfere with a DR operation.
When a record-stop occurs from a correctable memory error, allow the record-stop dump to complete before you initiate a DR operation.
If the faulty component causes repeated reporting of correctable memory errors, the SC performs multiple record-stop dumps. If this happens, you should temporarily disable the dump-detection mechanism on the SC; allow the current dump to finish; then initiate the DR operation. After the DR operation finishes, re-enable the dump detection.
DR lets you disconnect and then re-connect system circuit boards without bringing the system down. You can use DR to add or remove system resources while the system continues to operate.To understand the dynamic reconfiguration of system resources, consider the following Sun Fire system configuration, as depicted in the diagram that follows: Domain A contains system boards 0 and 2, and I/O board 2. Domain B contains system boards 1 and 3, and I/O boards 1, 3, and 4.
To assign system board 4 and I/O board 0 to Domain A, and to move I/O board 4 from Domain B to Domain A, you can use the Sun Management Center software's GUI. Or you can perform the following steps manually on the CLI in each domain as follows:
1. Enter the following configuration command on the command line in Domain B to disconnect I/O board 4 from Domain B:
2. Then, enter the following single command on the command line in Domain A, which assigns, connects, and configures system board 4 and I/O boards 0 and 4 into Domain A:
The following system configuration is the result. Only the way in which the boards are connected has changed, not the physical layout of the boards within the cabinet.
Copyright © 2002, Sun Microsystems, Inc. All rights reserved.