C H A P T E R

C H A P T E R 9

Dual Control Board Handling

A platform can be configured with dual control boards for redundancy purposes. One of the control boards is identified as the primary control board and the other control board is considered the spare. The switchover from the primary control board to the spare when a failure occurs is called control board failover. This failover is done automatically. If necessary, you can also force a control board failover.

This chapter explains how control boards function in a dual configuration and how control board failover works.

Note - You can have dual control boards in a single SSP configuration, as well as in a dual SSP configuration (main and spare SSP). Control board failover works the same in either a single or dual SSP configuration.

Control Board Executive

The control board executive (CBE) runs on the control board and facilitates communication between the SSP and the platform.

When power is applied, both control boards boot from the main SSP. After the CBE is booted, it waits for the control board server and the fod (failover) daemon running on the SSP to establish a connection. The connections between the fod daemon and the control board facilitate SSP and control board failover.

A failover task within CBE enables the main and spare SSP to establish connections for monitoring failover conditions. This task listens for and accepts TCP/IP connections from the fod daemons running on the main and spare SSP. The failover task also reads and transmits heartbeat messages to the fod daemons on both the main and spare SSP.

Primary Control Board

When the control board server running on the SSP connects to the CBE running on a control board, the CBE asserts the control board as the primary control board. The primary control board is responsible for the JTAG interface, which enables control board components to communicate with other Sun Enterprise 10000 system components so that the Sun Enterprise 10000 system can be monitored and configured. The primary control board also provides the system clock, which synchronizes and controls the speed of the centerplane, CPU clock, and system boards.

Control Board Server

After the SSP is booted, the control board server (CBS) is started automatically, as are several other daemons, including the fod daemon. The CBS is responsible for all nonfailover communication between the SSP and the primary control board.

The CBS attempts to connect only to the primary control board identified in the control board configuration file.

Note - Do not manually modify the control board configuration file. Use the ssp_config(1M) command to change the control board configuration.

The format of the control board configuration file is as follows:

platform_name:platform_type:cb0_hostname:status0:cb1_hostname:status1

where:

platform_name is the name assigned by the system administrator.

platform_type is Ultra-Enterprise-10000 .

cb0_hostname is the host name for control board 0, if available.

status0 indicates that control board 0 is the primary control board ( P indicates primary, and anything else indicates non-primary).

cb1_hostname is the host name for control board 1, if available.

status1 indicates that control board 1 is the primary control board.

For example:

xf2:Ultra-Enterprise-10000:xf2-cb0:P:xf2-cb1:

This example indicates that there are two control boards in the xf2 platform. They are xf2-cb0 and xf2-cb1 . xf2-cb0 is specified as the primary. See the cb_config (4) man page for more information.

The communication port that is used for communication between the control board server and the control board executive is specified in /tftpboot/ xxxxxxxx .cb_port where xxxxxxxx is the control board IP address represented in hexadecimal format.

Control Board Executive Image and Port Specification Files

The main SSP is the boot server for the control board. Two files are downloaded by the control board boot PROM during boot time: the image of CBE and the port number specification file. These files are located in /tftpboot on the SSP and the naming conventions are:

/tftpboot/xxxxxxxx for the cbe image

/tftpboot/xxxxxxxx.cb_port for the port number

where xxxxxxxx is the control board IP address in hex format.

For example, if the IP address of xf2-cb0 is 129.153.3.19, the files for control board xf2-cb0 are:

/tftpboot/81990313

/tftpboot/81990313.cb_port

Automatic Failover to the Spare Control Board

Control board failover is automatically enabled upon SSP installation or upgrade. The fod daemon performs failover monitoring of the control boards and other failover components. If the primary control board is not functioning properly, the fod daemon will trigger an automatic failover to the spare control board. A control board failure can be caused by

A clock failure

When a clock failure occurs, all active domains arbstop simultaneously and a control board failover is automatically triggered. Both the system clock and JTAG interface are automatically moved to the spare control board. When the new control board is started, normal EDD recovery actions reboot the Sun Enterprise 10000 domains.
A JTAG interface failure

If the SSP cannot communicate with the JTAG interface, the SSP determines that the control board failed and automatically triggers a control board failover.
Failure of the Ethernet interface on the control board
Failure of the control board processor
Disconnected cable between the control board and the hub
Failure of the hub connected to the control board
Disconnected cable between the main SSP and the hub
Failure of the SSP network interface card (NIC) for the control board network
User error caused by disabling the NIC for the control board network

Note that under certain failure conditions the fod daemon can disable a control board failover. For a detailed description of the failure conditions and a summary of the failover actions performed, see Chapter 10 .

A control board failover can be either partial or complete, depending on whether domains are running:

If domains are active and a control board failure condition is detected, a partial failover occurs.

In a partial failover, the JTAG interface is moved from the primary control board to the spare. However, the system clock source remains on the failed primary control board. You must complete the control board failover so that both the JTAG interface and system clock source are managed by the same control board. For details, see To Force a Complete Control Board Failover .
If no domains are running and a control board failure condition is detected, a complete failover occurs.

In a complete control board failover, both the JTAG interface and the system clock source are moved from the primary control board to the spare.

Managing Control Board Failover

You can enable, disable, or force a control board failover as explained in the following procedures. Use the setfailover (1M) command on the main SSP to manage the failover state. For example, after a control board failover occurs, you must use the setfailover (1M) command to re-enable the control board failover capability.

To Disable Control Board Failover

As user ssp on the main SSP, type:

ssp% setfailover -t cb off

Control board failover remains disabled until you enable it. To determine whether control board failover was disabled, use the showfailover (1M) command to verify the failover state, as explained in Obtaining Control Board Failover Information .

To Enable Control Board Failover

As user ssp on the main SSP, type:

ssp% setfailover -t cb on

Control board failover is activated when all the connection links are functioning properly. If any failed connections exist, control board failover is not enabled. You can use the showfailover (1M) command to verify that control board failover is enabled and review the connection status.

To Force a Complete Control Board Failover

Note Note - If you want to force a complete control board failover, where both the JTAG connection and the system clock source are moved from the primary control board to the spare, you must shut down any domains that are running and power off, then power on all system boards before you switch control boards. If you do not shut down all the domains, a partial control board failover occurs. The JTAG connection is moved to the spare control board but the system clock source remains on the former primary control board.

1. If any domains are running, shut down those domains using the standard shutdown (1M) command.

2. Log in to the main SSP as user ssp .

3. To ensure that domains do not arbstop, do the following:

a. Stop event detection monitoring.

ssp% edd_cmd -x stop

b. Power off all of the system boards.

ssp% power -off -all

c. Power on all of the system boards.

ssp% power -on -all

d. Start event detection monitoring.

ssp% edd_cmd -x start

4. Type the following to force the control board failover:

ssp% setfailover -t cb force

5. Issue the bringup (1M) command for all domains.

6. Re-enable control board failover as described in To Enable Control Board Failover .

Obtaining Control Board Failover Information

Use the showfailover (1M) command on the main SSP to obtain the failover state of an SSP or control board failover and the status of the private connection links. The names of the SSPs and control boards are also provided, and the control boards responsible for the JTAG interface and system clock are identified. For details on the failover information displayed, see Obtaining Failover Status Information .

The following example shows the information displayed for a control board failover in which the primary control board failed.

ssp% showfailover

Failover State:

     SSP Failover: Active

     CB Failover:  Failed

Failover Connection Map:

     Main SSP to Spare SSP thru Main Hub:   GOOD

     Main SSP to Spare SSP thru Spare Hub:  GOOD

     Main SSP to Primary Control Board:     FAILED

     Main SSP to Spare Control Board:       GOOD

     Spare SSP to Main SSP thru Main Hub:   GOOD

     Spare SSP to Main SSP thru Spare Hub:  GOOD

     Spare SSP to Primary Control Board:    FAILED

     Spare SSP to Spare Control Board:      GOOD

SSP/CB Host Information

     Main SSP:                              xf12-ssp

     Spare SSP:                             xf12-ssp2

     Primary Control Board (JTAG source):   xf12-cb1

     Spare Control Board:                   xf12-cb0

     System Clock source:                   xf12-cb1

You can also use Hostview to verify the type of control board failover (complete or partial). When you use Hostview to verify a control board, the "J" (JTAG) and "C" (system clock source) characters indicate which control board manages the JTAG interface and system clock.

FIGURE 9-1 shows an example Hostview window after a partial control board failover. One control board handles the JTAG interface, while the other serves as the system clock source.

FIGURE 9-1 Example Hostview Window After a Partial Control Board Failover

After Control Board Failover

After a control board failover occurs, you must perform certain recovery tasks:

Identify the failure point or condition that caused the failover and determine how to correct the failure.

For example, if a control board failover occurred due to a faulty control board, you must determine whether you need to replace the failed control board.

Use the showfailover (1M) command to review the failover state and verify which control board is responsible for the JTAG interface and system clock. Review the connection map in the showfailover output and the summary of the failover detection points in Chapter 10 .

You can also review the platform log file to review other error conditions and determine the corrective action needed to reactivate the failed components.
If a partial failover occurred, resynchronize the JTAG and system clock interfaces so that they are managed by the same control board.

To resynchronize the JTAG and system clock interfaces, perform a complete control board failover as described in To Force a Complete Control Board Failover . The first domain that is brought up resynchronizes the system clock and the JTAG interface on the primary control board.
Once you have resolved the control board failure, re-enable control board failover (for details, see To Force a Complete Control Board Failover ).