InfoDoc ID   Synopsis   Date
26286   Sun Fire[TM] 3800-6800: System Controller failover functionality   18 Dec 2002

Status Issued

Description
The Sun Fire[TM] System Controller (SSC) provides management functionality and 
clock to the Sun Fire platform.

When the master SSC fails (SSC0) and the system includes two SSCs, clock and 
management functionality have to fail over to the spare SC.                        

For Firmware 5.13.X

Starting with 5.13.X Sun Fire 6800/4810/4800/3800 systems can be configured with two
system controllers for high availability. In a high-availability system controller (SC) 
configuration, one SC serves as the main SC, which manages all the system resources,
while the other SC serves as a spare. When certain conditions cause the main SC to 
fail, a switch over or failover from the main SC to the spare is triggered automatically, 
without operator intervention. The spare SC assumes the role of the main and takes over 
all system controller responsibilities.

New commands have been added to manage this functionality: setfailover and showfailover.

setfailover -- set automatic/manual SC failover

Usage: setfailover [-y|-n] off|on|force
       setfailover -h

    off -- This option prevents a failover until the failover feature is reenabled.
    on  -- Enables failover for systems that previously had failover
           disabled due to a failover or an operator request.
    force -- Causes a forced failover to the spare SC.
    -y -- Do not prompt for confirmation
    -n -- Do not execute command if confirmation is requested 
    -h -- Display the help message for this command
    
This command enables you to control automatic or manual SC failover. Be aware that
if you force a failover using this command, SC failover is disabled after the manual 
failover occurs and must be re-enabled manually using the command setfailover on.                         
showfailover -- Enables you to monitor the state of the SC and clock failover. 

Usage: showfailover [-v]
       showfailover -h

    -v -- Verbose mode. Displays all available command information, which includes
          both SC and clock failover status.
    -h -- Display this help message
    

The SC failover state can be one of the following:  
 
enabled and active - SC failover is enabled and functioning normally.

disabled - SC failover has been disabled due to an operator request (setfailover off)
           or because a failover has occurred. 
             
enabled but not active - SC failover is enabled, but certain components, such as the
			 spare SC or the centerplane between the main and spare, are 
			 not in a failover-ready state (available and responding). 

The clock failover state can be one of the following:   
enabled - Clock failover is enabled.   

disabled - Clock failover has been automatically disabled due to a hardware problem.
                        
Example of how to force a manual failover:

1. Connect to the System Controller (failover can be initiated from either SC).

System Controller 'sunfire12-sc1':

    Type  0  for Platform Shell

    Input: 0

Platform Shell - Spare System Controller

sunfire12-sc1:sc> 

2. Verify that failover is enabled and active.

sunfire12-sc1:sc> showfailover -v

SC: SSC1  
Spare System Controller
SC Failover: disabled <---failover is disabled, so we must enable it.


sunfire12-sc1:sc> setfailover on
Dec 12 16:06:51 sunfire12-sc1 Platform.SC: SC Failover: enabled but not active.
SC Failover: enabled but not active.

sunfire12-sc1:sc> Dec 12 16:07:09 sunfire12-sc1 Platform.SC: SC Failover: enabled and active.

sunfire12-sc1:sc> showfailover
SC Failover: enabled and active. <---now failover is enabled and active.

3. Manually force the Spare System Controller to become the Main System Controller.

sunfire12-sc1:sc> setfailover force

SC: SSC1  
Spare System Controller
SC Failover: enabled and active.
Clock failover enabled.

This will abruptly interrupt operations on the other System Controller.
This System Controller will become the main System Controller.

Do you want to continue? [no] yes
Dec 12 16:10:18 sunfire12-sc1 Platform.SC: SC Failover: becoming main SC ...
sunfire12-sc1:sc> Dec 12 16:10:19 sunfire12-sc1 Platform.SC: SC Failover: disabled
Dec 12 16:10:26 sunfire12-sc1 Platform.SC: Chassis is in single partition mode.

4. Verify the status of the system controller and re-enable failover.
sunfire12-sc1:SC> showfailover  
SC Failover: disabled  <---failover will be disabled after a failover has occured and must be
			   re-enabled.	

sunfire12-sc1:SC> setfailover on
Dec 12 16:14:14 sunfire12-sc1 Platform.SC: SC Failover: enabled but not active.
SC Failover: enabled but not active.
sunfire12-sc1:SC> Dec 12 16:14:18 sunfire12-sc1 Platform.SC: SC Failover: enabled and active.                        
                        

For Firmware 5.11.X

In ScApp Release 5.11.4, the  management functionality failover is NOT 
implemented. This means if SSC0 fails, you are not able to perform tasks
which require ScApp, such as, power off system boards, check the temperature or 
change system board assignments.

During the boot sequence of the SSC, you will get a messages indicating if
clock failover is enabled or disabled.

	Apr 02 05:38:27 SunFire45 Chassis-Port.SC: Clock failover disabled.
	Apr 02 05:38:50 SunFire45 Chassis-Port.SC: Clock failover enabled.
	
So when you reboot an SSC or an SSC fails, you will see a console message of the other
SSC saying that clock failover is disabled. Once the other SSC is available again
(finished rebooting), and the second clock source is available again, you should
see the following message on the console of the SSC

	Apr 02 05:38:50 SunFire45 Chassis-Port.SC: Clock failover enabled.
	

If you have two SSCs in your Sun Fire system with ScApp version 5.11.4, and 
the master SSC0 fails, you CANNOT replace the failing SSC0 without taking the 
entire platform down. Adding SSC1 can be done during normal platform operation, and
there is no need to take any domain down.                                               
INTERNAL SUMMARY:

Check the ReadMe of Patch 800054 and /or Release Notes of future Firmware versions for SSC failover changes.

For more information about the System Controller Failover in 5.13.X firmware check the following resources:

Sun Fire 6800/4810/4800/3800 Systems Platform Administration Manual (Firmware Version 5.13.0), http://docs-pdf.sun.com/816-2970-10/816-2970-10.pdf

Sun Fire 6800/4810/4800/3800 System Controller Command Reference Manual (Firmware Version 5.13.0), http://docs-pdf.sun.com/816-2971-10/816-2971-10.pdf

SUBMITTER: Peter Gonscherowski PATCH ID: 800054 APPLIES TO: Hardware/Sun Fire ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.