InfoDoc ID | Synopsis | Date | ||
45890 | Sun Fire[TM] 3800-6800: Firmware revision 5.13.x issues and workarounds | 19 Jul 2002 |
Status | Issued |
Description |
The new Sun Fire [TM] 3800 - 6800 firmware, revision 5.13.0, will give us full SC failover functionality as well as fixing various outstanding bugs. The firmware, as before, will be distributed in a patch, in this case patch id:
Refer to the README in the patch for a full list of bug fixes.
Read the install.info and release_notes files in patch
The following information will continue to be valid for all 5.13.x variants unless specifically noted.
Below are a few issues and "gotchas" that you may encounter.
Essential - Always upgrade SSC1 first!
Failure to follow this instruction will result in problems such as crashed domains, lost configuration information, and inaccessible domains.
The remainder of this document discusses the following issues:
Problems which do not fit any issues listed above should be directed to the GSCC (http://gscc) in your GEO for analysis and escalation to CPRE as appropriate.
1. What to do if SSC0 is upgraded first
If SSC0 is upgraded first, the result will be you will now have two spare system controllers. DO NOT try and recover by pressing reset buttons, or re-flashing. You will almost certainly crash any running domains on your platform.
If SSC0 is upgraded first, there is a recovery procedure. Engage your local GSCC (http://gscc) if you find yourself in this situation, and ask for the recovery procedure. This procedure is not published because it uses undocumented commands.
Here's an example of what happens if SSC0 is updated first.
# telnet 4800-sc0 System Controller '4800-sc0': Type 0 for Platform Shell Type 1 for domain A console Type 2 for domain B console Type 3 for domain C console Type 4 for domain D console Input: 0 Platform Shell 4800-sc0:SC> flashupdate -f ftp://172.29.3.44/pub/112494-01 all As part of this update, the system controller will automatically reboot. RTOS will be upgraded automatically during the next boot. ScApp will be upgraded automatically during the next boot. After this update you must reboot each active domain that was upgraded. Do you want to continue? [no] yes Retrieving: ftp://172.29.3.44/pub/112494-01/sgcpu.flash Validating ............. Done Current firmware version: 5.12.6 New firmware version: 5.13.0 Programming /N0/SB2 PROM 0 Erasing ............. Done Programming ............. Done Verifying ............. Done . . . Flashupdate Connecting to 172.29.3.44... Transferring sgrtos.flash via FTP : 679648 Comparing image and flash... Image and flash are different. Proceeding with update. Erasing flashprom sectors at address 0x20000000: 11/11 = 100% Programming: 11/11 = 100% Connecting to 172.29.3.44... Transferring sgsc.flash via FTP : 5548663 Comparing image and flash... Image and flash are different. Proceeding with update. Erasing flashprom sectors at address 0x36000000: 85/85 = 100% Programming: 85/85 = 100% . . . Copyright 2001-2002 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms. Sun Fire 3800-6800 System Firmware RTOS version: 23 ScApp version: 5.13.0 SC POST diag level: off The date is Thursday, May 23, 2002, 11:29:30 AM GMT+01:00. May 23 11:29:31 4800-sc0 Platform.SC: Boot: ScApp 5.13.0, RTOS 23 May 23 11:29:36 4800-sc0 Platform.SC: Clock Source: 75MHz May 23 11:29:38 4800-sc0 Platform.SC: SC Failover Monitor: enabled May 23 11:30:08 4800-sc0 Platform.SC: Spare System Controller May 23 11:30:08 4800-sc0 Platform.SC: SC Failover: enabled but not active. System Controller '4800-sc0': Type 0 for Platform Shell Input: 0 Platform Shell - Spare System Controller 4800-sc0:sc> -------------------------------------------------------------------------- # telnet 4800-sc1 System Controller '4800-sc1': Type 0 for Platform Shell Input: 0 Platform Shell - Slave System Controller 4800-sc1:SC>
2. Hot-Plugging SCs with old revs of firmware into a 5.13.0 platform
5.13 firmware does not mix with 5.11 & 5.12 firmware. If SC1 is to be replaced in a platform running 5.13.x, and replacement has 5.11 or 5.12 firmware loaded, recovery is simple and outlined below.
If SC0 is to be replaced in a platform running 5.13.x, and the replacement has 5.11 or 5.12 firmware loaded, the replacement will not boot, as outlined below. Recovery is to remove and put an SC in at 5.13.x.
Warning: Before removing SC0, be sure to issue the following command from SC1 or you may crash any running domains:
poweroff ssc0
If SC0 is to be replaced in a 5.13 platform, ensure the replacement has 5.13.0 firmware loaded on it. Double check with the control room that this is the case.
Example - Hot-Plugging SC with old rev of firmware in slot SSC1
Output from SSC0:
sc0-4800a:SC> poweroff ssc1 SSC1: powered off sc0-4800a:SC> May 31 10:34:45 sc0-4800a Platform.SC: Clock failover disabled. May 31 10:37:07 sc0-4800a Platform.SC: SSC1 removed May 31 10:37:37 sc0-4800a Platform.SC: SSC1 inserted sc0-4800a:SC> sc0-4800a:SC> May 31 10:39:57 sc0-4800a Platform.SC: SC Failover: the other SC is running an old version of firmware which is not compatible with failover. You need to upgrade this firmware as soon as possible. sc0-4800a:SC> sc0-4800a:SC>
Output from SSC1:
Hardware Reset... @(#) SYSTEM CONTROLLER(SC) POST 18 2001/06/14 11:20 PSR = 0x044010e5 PCR = 0x04004000 SelfTest running at DiagLevel:0x20 SC Boot PROM Test BootPROM CheckSum Test . . . Console Bus Hub Test CBH Register Access Test POST Complete. ERI Device Present Getting MAC address for SSC1 MAC address is 8:0:20:d8:ab:64 Using DHCP to configure network interface Attached TCP/IP interface to eri unit 0 Attaching interface lo0...done interrupt: 100 Mbps full duplex link up Initiating DHCP negotiations for eri0 dhcpcBind() failed: errno = 0xd0003 Adding 2851 symbols for standalone. Copyright 2001 Sun Microsystems, Inc. All rights reserved. RTOS version: 18 ScApp version: 5.11.9 SC POST diag level: min The date is Friday, May 31, 2002, 3:39:42 AM PDT. SbbcAsic.showResetReason: SBBC reset status=0160 POR PowerOn or Invalid magic: Initializing the SC SRAM May 31 03:39:46 noname Chassis-Port.SC: Backing up Static ID Info to NVCI May 31 03:39:46 noname Chassis-Port.SC: Clock source: 75MHz May 31 03:39:48 noname Chassis-Port.SC: Starting Slave Thread System Controller 'noname.example.com': Type 0 for Platform Shell Input: 0 Platform Shell noname:SC> showsc SC: SSC1 SC date: Fri May 31 03:39:56 PDT 2002 SC uptime: 25 seconds ScApp version: 5.11.9 RTOS version: 18 noname:SC>
Example - Hot-Plugging SC with old rev of firmware in slot SSC0
Output from SSC1:
sc1-4800a:SC> poweroff ssc0 SSC0: powered off sc1-4800a:SC> May 31 10:48:28 sc1-4800a Platform.SC: SSC0 removed May 31 10:49:02 sc1-4800a Platform.SC: SSC0 inserted sc1-4800a:SC> sc1-4800a:SC> May 31 10:50:25 sc1-4800a Platform.SC: SC Failover: the other SC is running an old version of firmware. It cannot be booted on this platform. Contact your support organization. sc1-4800a:SC> sc1-4800a:SC> sc1-4800a:SC>
Output from SSC0:
Hardware Reset... @(#) SYSTEM CONTROLLER(SC) POST 18 2001/06/14 11:20 PSR = 0x044010e5 PCR = 0x04004000 SelfTest running at DiagLevel:0x20 SC Boot PROM Test BootPROM CheckSum Test . . . Console Bus Hub Test CBH Register Access Test POST Complete. ERI Device Present Getting MAC address for SSC0 MAC address is 8:0:20:d8:ab:63 Using DHCP to configure network interface Attached TCP/IP interface to eri unit 0 Attaching interface lo0...done Timeout waiting for network driver (flags=0x8062) Adding 2851 symbols for standalone.
SSC0 is unusable at this point. Recovery is to remove and put an SC in at 5.13.0.
3. Hot-Plugging SCs with 5.13.0 into a platform with older revs of firmware
Plugging an SC with 5.13.0 firmware into a 5.12.6 platform, slot SSC0
Remember, the platform will have had to be powered off to affect this FRU replacement. The state the system controllers end up in depends on which one boots first, which is largely down to SCPOST levels and the SC network settings. For example, an SC from logistics should be at default settings, which means SCPOST level min and the network configured for DHCP.
If SSC1 boots first, it will put out a heartbeat (since it is at 5.12.6), and this will cause the SSC0 to assume the role of spare.
System Controller 'noname.example.com':
Type 0 for Platform Shell.
Input: 0 Platform Shell - Spare System Controller noname:sc>
This is not a problem.
If SSC0 boots first, the SC may become confused. Ignore this.
Flashupdate SSC0 with 5.12.6 firmware, and power-cycle the platform
Plugging an SC with 5.13.0 firmware into a 5.12.6 platform, slot SSC1
Again, the platform will have had to be powered off to affect this FRU replacement.
If SSC0 boots first, it will be the main and SSC1 the spare. Flashupdate SSC1 with 5.12.6 firmware, and power-cycle the platform.
If SSC1 boots first, you will get a message on SSC1:
Platform.SC: SC Failover: the other SC is running an old version of firmware. It cannot be booted on this platform. Contact your support
SSC0 will be hung, at the point the RTOS finishes loading. Ignore SSC0, flashupdate SSC1 with 5.12.6 firmware, and power-cycle the platform. You will now be back at SSC0 as main and SSC1 as spare.
4. Replacing System Boards (SBs) and I/O Boards (IBs) with different revs of firmware
If you are going to replace a system board or I/O assembly, be aware that the replacement board firmware must be compatible with the system controller firmware. To check the firmware compatibility for each board, use the showboards command with the "-p version" or "-v" option.
If the firmware of the replacement board is not compatible with the firmware for the system controller, you must upgrade or downgrade the firmware on the replacement board accordingly, using flashupdate -c. It is recommended that replacement boards run the same revision of firmware as the other boards in the system.
5. SC Clock Failover Issues
The SC clock failover mechanism is different than the SC failover mechanism. The SC clock failover function does not happen at the same time as the SC failover function. When the system is up and running with no problems, all the boards are using a clock signal from the main system controller. However, once SC failover occurs, the main SC and the spare SC swap their roles. Subsequently, the boards within the system continue to use the same clock they were using prior to the failover.
Workaround:
Power off the system controller. The "poweroff sscX" command will automatically attempt to switch all the boards over to the clock supplied by "this" SC (i.e. the SC that is not being powered off). The "poweroff sscX" powers off the "other" system controller, not the one where the command is being typed.
6. SC Communication Issues After SC Failover
When the system is running normally and failover is enabled, the spare SC and the main SC communicate status and configuration changes with each other. If a failover occurs and the main SC transfers its responsibilities to the spare SC, failover between the two SCs becomes disabled. With failover disabled, no data is shared between the two SCs, and the most up-to-date configuration and status information is not passed between the two SCs. Failover must be manually re-enabled.
If the chassis of the system is then power-cycled, the roles of the main SC and the spare SC may not necessarily be the same as they were prior to the power cycle. It is possible for the system to boot using the previously spare SC (with a possibly outdated state configuration) as the new main SC.
Workaround:
If failover becomes disabled, manually re-enable failover as soon as possible so the configurations can be re-synchronized.
If this is not possible, do a dumpconfig as outlined in the Sun Fire 3800 - 6800 Platform Administration Guide. Then if the power is cycled and SSC0 assumes the role of main, you can restore the setup ts SC0 using restoreconfig. Note that you will have to copy <sc1_hostname>.tod & <sc1_hostname>.nvci to <sc0_hostname>.tod & <sc0_hostname>.nvci for this workaround.
Keywords: firmware, 5.13.x, 3800, 6800, sunfire
INTERNAL SUMMARY: