C H A P T E R 9 |
Dual Control Board Handling |
A platform can be configured with dual control boards for redundancy purposes. One of the control boards is identified as the primary control board and the other control board is considered the spare. The switchover from the primary control board to the spare when a failure occurs is called control board failover. This failover is done automatically. If necessary, you can also force a control board failover.
This chapter explains how control boards function in a dual configuration and how control board failover works.
The control board executive (CBE) runs on the control board and facilitates communication between the SSP and the platform.
When power is applied, both control boards boot from the main SSP. After the CBE is booted, it waits for the control board server and the fod (failover) daemon running on the SSP to establish a connection. The connections between the fod daemon and the control board facilitate SSP and control board failover.
A failover task within CBE enables the main and spare SSP to establish connections for monitoring failover conditions. This task listens for and accepts TCP/IP connections from the fod daemons running on the main and spare SSP. The failover task also reads and transmits heartbeat messages to the fod daemons on both the main and spare SSP.
When the control board server running on the SSP connects to the CBE running on a control board, the CBE asserts the control board as the primary control board. The primary control board is responsible for the JTAG interface, which enables control board components to communicate with other Sun Enterprise 10000 system components so that the Sun Enterprise 10000 system can be monitored and configured. The primary control board also provides the system clock, which synchronizes and controls the speed of the centerplane, CPU clock, and system boards.
After the SSP is booted, the control board server (CBS) is started automatically, as are several other daemons, including the fod daemon. The CBS is responsible for all nonfailover communication between the SSP and the primary control board.
The CBS attempts to connect only to the primary control board identified in the control board configuration file.
Note - Do not manually modify the control board configuration file. Use the ssp_config(1M) command to change the control board configuration. |
The format of the control board configuration file is as follows:
platform_name is the name assigned by the system administrator.
platform_type is Ultra-Enterprise-10000 .
cb0_hostname is the host name for control board 0, if available.
status0 indicates that control board 0 is the primary control board ( P indicates primary, and anything else indicates non-primary).
cb1_hostname is the host name for control board 1, if available.
status1 indicates that control board 1 is the primary control board.
This example indicates that there are two control boards in the xf2 platform. They are xf2-cb0 and xf2-cb1 . xf2-cb0 is specified as the primary. See the cb_config (4) man page for more information.
The
communication port that is used for communication between the control board server and the control board executive is specified in
/tftpboot/
xxxxxxxx
.cb_port
where
xxxxxxxx
is the control board IP address represented in hexadecimal format.
The main SSP is the boot server for the control board. Two files are downloaded by the control board boot PROM during boot time: the image of CBE and the port number specification file. These files are located in /tftpboot on the SSP and the naming conventions are:
where xxxxxxxx is the control board IP address in hex format.
For example, if the IP address of xf2-cb0 is 129.153.3.19, the files for control board xf2-cb0 are:
Control board failover is automatically enabled upon SSP installation or upgrade. The fod daemon performs failover monitoring of the control boards and other failover components. If the primary control board is not functioning properly, the fod daemon will trigger an automatic failover to the spare control board. A control board failure can be caused by
A clock failure
When a clock failure occurs, all active domains arbstop simultaneously and a control board failover is automatically triggered. Both the system clock and JTAG interface are automatically moved to the spare control board. When the new control board is started, normal EDD recovery actions reboot the Sun Enterprise 10000 domains.
A JTAG interface failure
If the SSP cannot communicate with the JTAG interface, the SSP determines that the control board failed and automatically triggers a control board failover.
Failure of the Ethernet interface on the control board
Failure of the control board processor
Disconnected cable between the control board and the hub
Failure of the hub connected to the control board
Disconnected cable between the main SSP and the hub
Failure of the SSP network interface card (NIC) for the control board network
User error caused by disabling the NIC for the control board network
Note that under certain failure conditions the fod daemon can disable a control board failover. For a detailed description of the failure conditions and a summary of the failover actions performed, see Chapter 10 .
A control board failover can be either partial or complete, depending on whether domains are running:
If domains are active and a control board failure condition is detected, a partial failover occurs.
In a partial failover, the JTAG interface is moved from the primary control board to the spare. However, the system clock source remains on the failed primary control board. You must complete the control board failover so that both the JTAG interface and system clock source are managed by the same control board. For details, see To Force a Complete Control Board Failover .
If no domains are running and a control board failure condition is detected, a complete failover occurs.
In a complete control board failover, both the JTAG interface and the system clock source are moved from the primary control board to the spare.
You can enable, disable, or force a control board failover as explained in the following procedures. Use the setfailover (1M) command on the main SSP to manage the failover state. For example, after a control board failover occurs, you must use the setfailover (1M) command to re-enable the control board failover capability.
As user ssp on the main SSP, type:
Control board failover remains disabled until you enable it. To determine whether control board failover was disabled, use the showfailover (1M) command to verify the failover state, as explained in Obtaining Control Board Failover Information .
As user ssp on the main SSP, type:
Control board failover is activated when all the connection links are functioning properly. If any failed connections exist, control board failover is not enabled. You can use the showfailover (1M) command to verify that control board failover is enabled and review the connection status.
1. If any domains are running, shut down those domains using the standard shutdown (1M) command.
2. Log in to the main SSP as user ssp .
3. To ensure that domains do not arbstop, do the following:
a. Stop event detection monitoring.
b. Power off all of the system boards.
c. Power on all of the system boards.
d. Start event detection monitoring.
4. Type the following to force the control board failover:
5. Issue the bringup (1M) command for all domains.
6. Re-enable control board failover as described in To Enable Control Board Failover .
Use the showfailover (1M) command on the main SSP to obtain the failover state of an SSP or control board failover and the status of the private connection links. The names of the SSPs and control boards are also provided, and the control boards responsible for the JTAG interface and system clock are identified. For details on the failover information displayed, see Obtaining Failover Status Information .
The following example shows the information displayed for a control board failover in which the primary control board failed.
You can also use Hostview to verify the type of control board failover (complete or partial). When you use Hostview to verify a control board, the "J" (JTAG) and "C" (system clock source) characters indicate which control board manages the JTAG interface and system clock.
FIGURE 9-1 shows an example Hostview window after a partial control board failover. One control board handles the JTAG interface, while the other serves as the system clock source.
After a control board failover occurs, you must perform certain recovery tasks:
Identify the failure point or condition that caused the failover and determine how to correct the failure.
For example, if a control board failover occurred due to a faulty control board, you must determine whether you need to replace the failed control board.
Use the showfailover (1M) command to review the failover state and verify which control board is responsible for the JTAG interface and system clock. Review the connection map in the showfailover output and the summary of the failover detection points in Chapter 10 .
You can also review the platform log file to review other error conditions and determine the corrective action needed to reactivate the failed components.
If a partial failover occurred, resynchronize the JTAG and system clock interfaces so that they are managed by the same control board.
To resynchronize the JTAG and system clock interfaces, perform a complete control board failover as described in To Force a Complete Control Board Failover . The first domain that is brought up resynchronizes the system clock and the JTAG interface on the primary control board.
Once you have resolved the control board failure, re-enable control board failover (for details, see To Force a Complete Control Board Failover ).
Copyright © 2002, Sun Microsystems, Inc. All rights reserved.