InfoDoc ID   Synopsis   Date
45890   Sun Fire[TM] 3800-6800: Firmware revision 5.13.x issues and workarounds   19 Jul 2002

Status Issued

Description

The new Sun Fire [TM] 3800 - 6800 firmware, revision 5.13.0, will give us full SC failover functionality as well as fixing various outstanding bugs. The firmware, as before, will be distributed in a patch, in this case patch id: 112494-02.

Refer to the README in the patch for a full list of bug fixes.

Read the install.info and release_notes files in patch 112494-02.

The following information will continue to be valid for all 5.13.x variants unless specifically noted.

Below are a few issues and "gotchas" that you may encounter.

Essential - Always upgrade SSC1 first!

Failure to follow this instruction will result in problems such as crashed domains, lost configuration information, and inaccessible domains.

The remainder of this document discusses the following issues:

  1. What to do if SSC0 is upgraded first
  2. Hot-Plugging SCs with old revs of firmware into a 5.13.x platform
  3. Hot-Plugging SCs with 5.13.x into a platform with older revs of firmware
  4. Replacing System Boards and I/O Boards with different revs of firmware
  5. SC Clock Failover Issues
  6. SC Communications Issues after SC Failover

Problems which do not fit any issues listed above should be directed to the GSCC (http://gscc) in your GEO for analysis and escalation to CPRE as appropriate.

1. What to do if SSC0 is upgraded first

If SSC0 is upgraded first, the result will be you will now have two spare system controllers. DO NOT try and recover by pressing reset buttons, or re-flashing. You will almost certainly crash any running domains on your platform.

If SSC0 is upgraded first, there is a recovery procedure. Engage your local GSCC (http://gscc) if you find yourself in this situation, and ask for the recovery procedure. This procedure is not published because it uses undocumented commands.

Here's an example of what happens if SSC0 is updated first.

# telnet 4800-sc0

            System Controller '4800-sc0':

                Type  0  for Platform Shell

                Type  1  for domain A console
                Type  2  for domain B console
                Type  3  for domain C console
                Type  4  for domain D console

                Input: 0

            Platform Shell

            4800-sc0:SC> flashupdate -f ftp://172.29.3.44/pub/112494-01 all

            As part of this update, the system controller will automatically reboot.
            RTOS will be upgraded automatically during the next boot. 
            ScApp will be upgraded automatically during the next boot. 

            After this update you must reboot each active domain that was upgraded.

            Do you want to continue? [no] yes

            Retrieving: ftp://172.29.3.44/pub/112494-01/sgcpu.flash
            Validating  ............. Done

            Current firmware version: 5.12.6
            New firmware version: 5.13.0

            Programming /N0/SB2 PROM 0
            Erasing          ............. Done
            Programming  ............. Done
            Verifying        ............. Done
            .
            .
            .
            Flashupdate

            Connecting to 172.29.3.44...
            Transferring sgrtos.flash via FTP : 679648 
            Comparing image and flash...
            Image and flash are different. Proceeding with update.
            Erasing flashprom sectors at address 0x20000000: 11/11 = 100%
            Programming: 11/11 = 100%

            Connecting to 172.29.3.44...
            Transferring sgsc.flash via FTP : 5548663 
            Comparing image and flash...
            Image and flash are different. Proceeding with update.
            Erasing flashprom sectors at address 0x36000000: 85/85 = 100%
            Programming: 85/85 = 100%
            .
            .
            .
            Copyright 2001-2002 Sun Microsystems, Inc.  All rights reserved.
                      Use is subject to license terms.

            Sun Fire 3800-6800 System Firmware
            RTOS version: 23
            ScApp version: 5.13.0 
            SC POST diag level: off

            The date is Thursday, May 23, 2002, 11:29:30 AM GMT+01:00.

            May 23 11:29:31 4800-sc0 Platform.SC: Boot: ScApp 5.13.0, RTOS 23
            May 23 11:29:36 4800-sc0 Platform.SC: Clock Source: 75MHz
            May 23 11:29:38 4800-sc0 Platform.SC: SC Failover Monitor: enabled
            May 23 11:30:08 4800-sc0 Platform.SC: Spare System Controller
            May 23 11:30:08 4800-sc0 Platform.SC: SC Failover: enabled but not active.
             

            System Controller '4800-sc0':

                Type  0  for Platform Shell

                Input: 0

            Platform Shell - Spare System Controller

            4800-sc0:sc> 
            --------------------------------------------------------------------------
            # telnet 4800-sc1

            System Controller '4800-sc1':

                Type  0  for Platform Shell

                Input: 0

            Platform Shell - Slave System Controller

            4800-sc1:SC>             

2. Hot-Plugging SCs with old revs of firmware into a 5.13.0 platform

5.13 firmware does not mix with 5.11 & 5.12 firmware. If SC1 is to be replaced in a platform running 5.13.x, and replacement has 5.11 or 5.12 firmware loaded, recovery is simple and outlined below.

If SC0 is to be replaced in a platform running 5.13.x, and the replacement has 5.11 or 5.12 firmware loaded, the replacement will not boot, as outlined below. Recovery is to remove and put an SC in at 5.13.x.

Warning: Before removing SC0, be sure to issue the following command from SC1 or you may crash any running domains:

	poweroff ssc0            

If SC0 is to be replaced in a 5.13 platform, ensure the replacement has 5.13.0 firmware loaded on it. Double check with the control room that this is the case.

Example - Hot-Plugging SC with old rev of firmware in slot SSC1

Output from SSC0:

             sc0-4800a:SC> poweroff ssc1

             SSC1: powered off 

             sc0-4800a:SC>  

             May 31 10:34:45 sc0-4800a Platform.SC: Clock failover disabled. 

             May 31 10:37:07 sc0-4800a Platform.SC: SSC1 removed 
             May 31 10:37:37 sc0-4800a Platform.SC: SSC1 inserted 

             sc0-4800a:SC>  

             sc0-4800a:SC>  

             May 31 10:39:57 sc0-4800a Platform.SC: SC Failover: the other SC is 
             running an old version of firmware which is not compatible with failover. 
             You need to upgrade this firmware as soon as possible. 

             sc0-4800a:SC>  
             sc0-4800a:SC>              

Output from SSC1:

             Hardware Reset... 

             @(#) SYSTEM CONTROLLER(SC) POST 18 2001/06/14 11:20 
             PSR = 0x044010e5 
             PCR = 0x04004000 

                     SelfTest running at DiagLevel:0x20 

             SC Boot PROM                          Test  
             BootPROM CheckSum                     Test  
             . 
             . 
             . 

             Console Bus Hub          Test  
             CBH Register Access             Test 
             POST Complete. 
             ERI Device Present 
             Getting MAC address for SSC1 
             MAC address is 8:0:20:d8:ab:64 
             Using DHCP to configure network interface 
             Attached TCP/IP interface to eri unit 0 
             Attaching interface lo0...done 
             interrupt: 100 Mbps full duplex link up 
             Initiating DHCP negotiations for eri0 
             dhcpcBind() failed: errno = 0xd0003 

             Adding 2851 symbols for standalone. 

                     Copyright 2001 Sun Microsystems, Inc.  All rights reserved. 

             RTOS version: 18 
             ScApp version: 5.11.9 
             SC POST diag level: min 

             The date is Friday, May 31, 2002, 3:39:42 AM PDT. 

             SbbcAsic.showResetReason: SBBC reset status=0160 POR 
             PowerOn or Invalid magic: Initializing the SC SRAM 
             May 31 03:39:46 noname Chassis-Port.SC: Backing up Static ID Info to NVCI 
             May 31 03:39:46 noname Chassis-Port.SC: Clock source: 75MHz 
             May 31 03:39:48 noname Chassis-Port.SC: Starting Slave Thread 
               

             System Controller 'noname.example.com': 

                 Type  0  for Platform Shell 

                 Input: 0

             Platform Shell 

             noname:SC> showsc 

             SC: SSC1  

             SC date: Fri May 31 03:39:56 PDT 2002 
             SC uptime: 25 seconds  

             ScApp version: 5.11.9 
             RTOS version: 18 

             noname:SC>              

Example - Hot-Plugging SC with old rev of firmware in slot SSC0

Output from SSC1:

             sc1-4800a:SC> poweroff ssc0 

             SSC0: powered off 

             sc1-4800a:SC>  

             May 31 10:48:28 sc1-4800a Platform.SC: SSC0 removed 
             May 31 10:49:02 sc1-4800a Platform.SC: SSC0 inserted 

             sc1-4800a:SC>  
             sc1-4800a:SC>  

             May 31 10:50:25 sc1-4800a Platform.SC: SC Failover: the other SC is 
             running an old version of firmware. It cannot be booted on this platform. 
             Contact your support organization. 

             sc1-4800a:SC>  
             sc1-4800a:SC>  
             sc1-4800a:SC>              

Output from SSC0:

             Hardware Reset... 
               

             @(#) SYSTEM CONTROLLER(SC) POST 18 2001/06/14 11:20 
             PSR = 0x044010e5 
             PCR = 0x04004000 

                     SelfTest running at DiagLevel:0x20 

             SC Boot PROM             Test  
                     BootPROM CheckSum               Test  
             . 
             . 
             . 
             Console Bus Hub         Test  
                     CBH Register Access                 Test 
             POST Complete. 
             ERI Device Present 
             Getting MAC address for SSC0 
             MAC address is 8:0:20:d8:ab:63 
             Using DHCP to configure network interface 
             Attached TCP/IP interface to eri unit 0 
             Attaching interface lo0...done 
             Timeout waiting for network driver (flags=0x8062) 

             Adding 2851 symbols for standalone.             

SSC0 is unusable at this point. Recovery is to remove and put an SC in at 5.13.0.

3. Hot-Plugging SCs with 5.13.0 into a platform with older revs of firmware

Plugging an SC with 5.13.0 firmware into a 5.12.6 platform, slot SSC0

Remember, the platform will have had to be powered off to affect this FRU replacement. The state the system controllers end up in depends on which one boots first, which is largely down to SCPOST levels and the SC network settings. For example, an SC from logistics should be at default settings, which means SCPOST level min and the network configured for DHCP.

If SSC1 boots first, it will put out a heartbeat (since it is at 5.12.6), and this will cause the SSC0 to assume the role of spare.

System Controller 'noname.example.com':

Type 0 for Platform Shell.

            Input: 0

            Platform Shell - Spare System Controller 

            noname:sc>             

This is not a problem.

If SSC0 boots first, the SC may become confused. Ignore this.

Flashupdate SSC0 with 5.12.6 firmware, and power-cycle the platform

Plugging an SC with 5.13.0 firmware into a 5.12.6 platform, slot SSC1

Again, the platform will have had to be powered off to affect this FRU replacement.

If SSC0 boots first, it will be the main and SSC1 the spare. Flashupdate SSC1 with 5.12.6 firmware, and power-cycle the platform.

If SSC1 boots first, you will get a message on SSC1:

	Platform.SC: SC Failover: the other SC is running an old version of
	firmware. It cannot be booted on this platform. Contact your support            

SSC0 will be hung, at the point the RTOS finishes loading. Ignore SSC0, flashupdate SSC1 with 5.12.6 firmware, and power-cycle the platform. You will now be back at SSC0 as main and SSC1 as spare.

4. Replacing System Boards (SBs) and I/O Boards (IBs) with different revs of firmware

If you are going to replace a system board or I/O assembly, be aware that the replacement board firmware must be compatible with the system controller firmware. To check the firmware compatibility for each board, use the showboards command with the "-p version" or "-v" option.

If the firmware of the replacement board is not compatible with the firmware for the system controller, you must upgrade or downgrade the firmware on the replacement board accordingly, using flashupdate -c. It is recommended that replacement boards run the same revision of firmware as the other boards in the system.

5. SC Clock Failover Issues

The SC clock failover mechanism is different than the SC failover mechanism. The SC clock failover function does not happen at the same time as the SC failover function. When the system is up and running with no problems, all the boards are using a clock signal from the main system controller. However, once SC failover occurs, the main SC and the spare SC swap their roles. Subsequently, the boards within the system continue to use the same clock they were using prior to the failover.

Workaround:

Power off the system controller. The "poweroff sscX" command will automatically attempt to switch all the boards over to the clock supplied by "this" SC (i.e. the SC that is not being powered off). The "poweroff sscX" powers off the "other" system controller, not the one where the command is being typed.

6. SC Communication Issues After SC Failover

When the system is running normally and failover is enabled, the spare SC and the main SC communicate status and configuration changes with each other. If a failover occurs and the main SC transfers its responsibilities to the spare SC, failover between the two SCs becomes disabled. With failover disabled, no data is shared between the two SCs, and the most up-to-date configuration and status information is not passed between the two SCs. Failover must be manually re-enabled.

If the chassis of the system is then power-cycled, the roles of the main SC and the spare SC may not necessarily be the same as they were prior to the power cycle. It is possible for the system to boot using the previously spare SC (with a possibly outdated state configuration) as the new main SC.

Workaround:

If failover becomes disabled, manually re-enable failover as soon as possible so the configurations can be re-synchronized.

If this is not possible, do a dumpconfig as outlined in the Sun Fire 3800 - 6800 Platform Administration Guide. Then if the power is cycled and SSC0 assumes the role of main, you can restore the setup ts SC0 using restoreconfig. Note that you will have to copy <sc1_hostname>.tod & <sc1_hostname>.nvci to <sc0_hostname>.tod & <sc0_hostname>.nvci for this workaround.

Keywords: firmware, 5.13.x, 3800, 6800, sunfire

INTERNAL SUMMARY: SUBMITTER: Michele Whittaker PATCH ID: 112494-02, 112494-02 APPLIES TO: Hardware/Sun Fire /3800, Hardware/Sun Fire /4800, Hardware/Sun Fire /4810, Hardware/Sun Fire /6800 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.