C H A P T E R 2 |
System Management Services 1.2 Release Notes |
This chapter contains the release notes for System Management Services 1.2 on Sun Fire 15K servers and covers the following topics:
This section contains known limitations that involve SMS on the Sun Fire 15K system.
DR is not supported on I/O boards. However, you can hot-plug hPCI cards on I/O boards to reconfigure I/O capacity dynamically. Do not use the psradm (1M) command concurrently with a hot-swap operation on the same domain.
Do not attempt to perform DR operations on a MaxCPU system board.
smsversion does not automatically implement SMS 1.2 features, such as IPv6, on domains. This must be done manually. If you return to SMS 1.1 from SMS 1.2, smsversion does not automatically restore domain configuration settings. This must be done manually. Refer to BugId 4484851.
This section contains general issues that involve SMS on Sun Fire 15K systems.
Each system controller (SC) must be configured for the TCP/IP network to which it is attached. Refer to the System Administration Guide: Resource Management and Network Services of the Solaris 9 System Administrator Collection for details on planning and configuring a TCP/IP-based network. SMS supports both IPv4 and IPv6 configurations.
In this release, the SC supports network connections through the RJ45 jacks on the faceplate of each SC. This corresponds to the network interface hme0 and eri1 under Solaris software for each SC. You will be required to configure hme0 or eri1 on each SC with appropriate information for your TCP/IP network. Using this configuration, each SC is known to external network applications by a separate IP hostname and address.
Each SC operates in one of two mutually exclusive modes: main or spare. The SC that is in main mode is the SC that controls the machine. The SC that is in spare mode acts as a spare that automatically takes over if the main SC fails. It is important to know which system controller is the main SC and which is the spare SC. To determine the SC role log in to the SC and use the following command:
External network-based applications such as Sun Management Center, telnet , and others will need to be given the appropriate IP hostname of the main system controller. In the case of an SC failover, these applications need to be restarted with the IP address of the new main SC.
Note Note - Any changes made to the network configuration on one SC using smsconfig -m must be made to the other SC as well. Network configuration is not automatically propagated. |
Disks intended to be used on a Sun Fire 15K system must be installed using a Sun Fire 15K machine. Policy placed in /etc/inet/inetd.conf must be added manually to /etc/inet/ipsecinit.conf as well.
Whenever policy is taken out of /etc/inet/inetd.conf it must be removed manually from /etc/inet/ipsecinit.conf also.
When a board breaker is turned off and ready to be taken out of the system, I2C timeout errors will be seen. These messages are a notification and does not indicate that an error has occurred. They can be ignored.
Software documentation for this release is provided, in PDF form, at the following location:
/cdrom/cdrom0/System_Management_Services_1.2/Docs
These PDF files are named by part number. For your convenience, here are the associated document titles:
816-3267-10.pdf - System Management Services (SMS) 1.2 Administration Guide
816-3268-10.pdf - System Management Services (SMS) 1.2 Reference Manual
816-3269-10.pdf - System Management Services (SMS) 1.2 Installation Guide and Release Notes
816-3285-10.pdf - Sun Fire 15K Software Overview Guide
816-4279-10.pdf - System Management Services (SMS) 1.2 Dynamic Reconfiguration User Guide
Due to a late arrival in the software, you may see slight differences between the screen snapshots shown in the Installation Guide and what appears on your screen during installation.
The System Management Services (SMS) 1.2 Reference Manual contains corrected text for each of the following man pages but the man pages themselves do not.
The list of valid console escape characters for use with the -e option are invalid. You can use any characters other than those listed.
The enablecomponent and disablecomponent manpages do not contain support for Paroli modules on wPCI boards.
The following operand is supported:
The following paroli_link forms are valid:
sc0:sms-user:> disablecomponent IO7/PAR0 sc0:sms-user:> showcomponent Component PARS at 1O7/PAR0 is disabled <no reason given> |
Domain Down is missing from the list of domain statuses. Domain Down indicates that the domain is down and setkeyswitch is set to ON, DIAG or SECURE. To restore the domain use:
For more information on showplatform, refer to Chapter 7 in the System Management Services (SMS) 1.2 Administrator Guide .
The smsconfig man page command synopsis does not list options for adding domain users or removing platform users. The -a and the -r option need to be added to each list:
smsconfig -a|-r -u username -G admn|oper|svc platform smsconfig -a|-r -u username -G admn|rcfg domain_id |
This section contains bugs fixed since SMS 1.1.
If setkeyswitch is already running for a domain, and you try to run it again, an error message is printed, but the return code is 0. A non-zero result would indicate failure.
When failover occurs, pcd receives poweron events from esmd . pcd clears the test status field of those boards being reported as powered on by esmd (even though in reality they are not being powered on).
A console session does not connect if dxs/dca are not running.
If a domain does not perform an environmental shutdown quickly enough, dsmd may leave it off. esmd is not sending a recover event to dsmd .
frad messages in the message log files sometimes contain a bad string in place of the FRUID. This does not crash the daemon and nothing needs to be done.
The following commands should not be executable by platsvc:
disablecomponent , enablecomponent , flashupdate , poweron , poweroff , resetsc , setbus , setfailover .
showdate privileges are incorrect and allow all users access to both the platform and the domains. showdate should be executable as follows:
Platform administrator, operator and service can only run showdate for the platform. Domain administrator and configurator can only run showdate on the domain for which they have privileges.
The usage message for the showcomponent command does not match the man page. Needs to be updated to follow the accepted nomenclature.
Only the platform admin can run showkeyswitch for a domain.
Workaround : The platoper or platsvc will need to run showplatform -d domain_id to see the keyswitch state of domains.
esmd calculates available power based on how many power supplies it has probed. At startup time, esmd registers a power supply failure because it has not yet probed all the power supplies. It then logs an incorrect message about available power.
When running SMS operations (like setkeyswitch, for example) on machines with many domains (greater than 10) you see failures due to "lock acquisition failures".
Currently, smsconnectsc supports the "-q" command line option which suppress all messages to stdout including prompts and you will not get the tip console.
If POST is already running on several domains, setkeyswitch may appear to hang before starting POST. It can take up to 50+ minutes to finish.
When both CSBs overheat simultaneously esmd does not gracefully shutdown the domain.
The internal network fails when Sun Management Center starts up. The domain can be reached by the external network but not by the internal network.
A library routine is trying to get status. This does not affect the operation, only the return code.
After starting SMS but before the SC has become main, the hwad and fomd error messages are printed in the platform log. These error messages vanish once the SC has becomes main.
After running setfailover force , the desired new main sometimes has problems becoming main. pcd repeatedly fails to startup. The SC eventually gives up and remains in an UNKNOWN state until it is either reset or SMS is cycled. The old main comes back up, does not detect interrupts, and then assumes the main role.
Files under the /etc directory are not backed up by the smsbackup command. These include but are not limited to: /etc/hosts , /etc/nsswitch.conf , /etc/group and /etc/hostname.* Consequently, an smsrestore does not restore a system to its previous working state completely.
dsmd attempts clear recordstops, after the hardware state dump is taken. The recordstop may not be cleared if the lowest numbered expander board is unconfigured. This causes dsmd to continue taking recordstop dumps indefinitely.
After starting SMS on main and spare, the platform message file does not get copied to spare. Other files in /var/opt/SUNWSMS/adm/A...R get copied once when starting failover, but never again. pcd files get propagated but other files don't.
It is not clear when SMS is loaded and ready for use.
Workaround : Use the showfailover command. When it completes, SMS is ready.
The pcd database and checkpoint files failed to propagate to the other SC before the failover occurs.
Both SC's clocks are phase locked when SMS is running. This creates a failover without the benefit of having SMS phase locking the system clocks. This, naturally, lead to a DStop.
Cannot specify IPv6 addressses.
smsconfig should set the following IP ndd variables:
to false using ndd . These settings should be configured to persist across reboots (add them to the appropriate rc script).
The following are known SMS 1.2 software bugs.
setkeyswitch may hang after you send a control-c (SIGINIT) signal.
Workaround : In the event control-c doesn't work you can regain the prompt by killing the process using kill -9 .
esmd logs all environmental events that affect one or more domains to the platform log but not the domain log.
Workaround : None. Refer to the platform log where the messages are logged.
After a failover, kmd does not delete security associations on the domain. The security associations (SAs) are associated with socket connections between DCA to and from DCS or DXS to and from CVCD. The SAs for the SC which failed over are the ones which should have been deleted.
The SAs would be useful only for a client on the failed over SC with sockets bound to the ports in the SAs.
Workaround : Use the Solaris ipseckey (1M) command on the domain to delete SAs which have the IP address of the failed over SC.
After a failover/takeover, the following errors are sometimes seen when failover is activated and file propagation begins:
" /var/opt/SUNWSMS/data/.failover/chkpt/chkpt.list " failed - "rcmd: socket: Cannot assign requested address."
This prevents file propagation from working.
Workaround : None. File propagation will take place eventually.
If a failover occurs while dsmd is performing a domain recovery, dsmd may not complete the domain recovery.
Workaround : Complete the recovery manually using setkeyswitch off , setkeyswitch on and, if necessary, booting the domain.
If a failover occurs in the middle of a rcfgadm operation, the operation fails when restarted after the failover.
Whenever you turn an hPCI board on and off, esmd logs messages indicating that its cassettes were inserted/removed.
Booting 8 domains in parallel to the OS level could result in a failover when the SC runs of of memory.
Workaround : Do not boot 8 domains in parallel.
dsmd distinguishes two types of domain reboot. A domain reboot to recover from software failures such as domain panic or heartbeat stop is performed by the minimal POST. The reboot to recover from hardware failures such as domain stop or from repeated software failure is performed by the regular POST. Currently the dsmd -invoked POSTs always use the hpost level specified in the .postrc file and this hpost level does not change between ASR retries. dsmd should handle such boot failures by retry the ASR reboot but the POST invoked should be done with higher hpost level.
When the system brings up a large number of domains, commands like showplatform will not display all domain nodenames at once. It can take several iterations to complete the display.
Workaround : Wait til dsmd finishes.
The SC should set the frame name which is displayed on the LCD of the Frame Manager.
The SC should signal faults with itself, the other SC or the system it is monitoring on the Frame Managers amber LEDs.
The following messages has been seen in the platform log:
Workaround : Ignore the messages.
If -o unassign -c disconnect is used, the unassign is passed as an option to a domain function. In this case, the unassign is performed with the domain administrator privileges even if the user on the SC has platform administrator privileges.
Workaround : For the following example there are two possible workarounds.
sc0:sms-svc:>rcfgadm -da -v -c disconnect -o unassign SB0 This fails because SB0 is not in Domain A's available component list. |
Add SB0 to the available component list of domain a.
You must have both domain and platform administrator privileges then run rcfgadm twice. First to disconnect SB0 (using domain privileges) and then unassign it (using platform privileges).
The following messages has been seen in the domain log:
dxs[8753]-C(): [4911 12439774264309 ERR ConsoleService.cc 506] DXS - maximum number of connected consoles reached |
This indicates that the maximum number of console processes has been reached.
Workaround : Close some open consoles. If that does not work, kill (1) the console process.
Whenever any of the power converters on the SC are powered off SMS the poweron command will show that board as off, even though it is up and running. Sometimes showboards -v will show that the spare SC is off when the SC in on and failover is active.
Workaround : Make sure all power converters are on. poweroff and poweron the spare SC.
When some of the domains are trying to recover a failure and dsmd core dumps, it dsmd can lose the recovery state.
Workaround : Reboot the domain using setkeyswitch off , setkeyswitch on .
smsconnectsc asks the user if they want to power on the other SC, then it does the poweron and exits without printing any further instructions or information. It should automatically connect to the SC after it has powered it on and not prompt.
If a wPCI ASIC overheats, you may lose the ASIC.
The comment in the kmd_policy.cf file is inaccurate. It states that specific domains should be identified using an integer in 0 - 17. The file should state a letter in A - R should be used to identify a domain.
Workaround : Use domain letters in the kmd_policy.cf file rather than numbers to identify specific domains.
SMS 1.2 software supports disabling and enabling Paroli modules on wPCI boards. The man pages do not list paroli_link as a valid form.
Workaround : See disablecomponent and enablecomponent Missing Paroli Link Operand for examples on blacklisting Paroli modules. Refer to the System Management Services (SMS) 1.2 Reference Manual for the corrected text.
esmd detects a voltage condition but fails to turn the paroli off
smsrestore for 1.2 restores an incompatible version of the MAN.cf to SMS 1.1. Switching back from 1.2 to 1.1 once the new MAN features have been enabled is not supported.
Workaround : Rerun smsconfig after installation and smsversion to 1.2.
You can lose clock source, causing domains to DSTOP.
The absolute path in the listed crontab entries are incorrect.
Workaround : These entries are not implemented in this release. Remove the following crontab entries :
10 4 1 * * /var/opt/SUNWSMS/bin/codlogrotate # SUNWSMSop 0 10 * * 1 /var/opt/SUNWSMS/bin/audithotspares # SUNWSMSop |
Issuing the reboot command on a domain, issuing the boot command after shutdown on a domain and some dsmd ASR reboots will cause a domain to panic.
Workaround : Install the patch associated with this BugId. The patch is available at: http://sunsolve.sun.com. Until the patch can be installed, you can use setkeyswitch standby , setkeyswitch on, to reboot the domain.
This can cause setkeyswitch to hang.
Workaround : Restart tmd and dsmd .
The list of valid escape characters is invalid. The only characters you cannot use are: # @ ^ & ? * = . |
Workaround : Use any character other than those listed. Refer to the System Management Services (SMS) 1.2 Reference Manual for the corrected text.
This will only occasionally happen.
During complex operations, for example, setkeyswitch (1M), it is possible that the pcd on the Spare SC can get out-of-sync with the pcd on the Main SC. If this happens when a failover occurs then the new Main SC may not recognize a given domain. This will leave the domain unmonitored thus disabling console access and domain logging from the SC.
Workaround :Execute a command such as addtag (1M) after the setkeyswitch completes. This has the effect of updating the pcd and, thus, propagating it. The other option is to use setdatasync (1M)'s backup option to propagate it. However, the platform message logs on the Spare SC will be overwritten by the Main SC's. This is bug:
4619939 setdatasync backup overwrites platform message logs on SPARE SC
Depending on which CP is degraded, DStops may not be handled.
Once in a while the domain console hangs.
A 1 is returned instead of a 0 when showplatform (1M) successfully completes.
When DSMD recovers a domain after a platform power failure, POST may fail on the domain one or more times, but DSMD will retry POST until it is able to restart the domain
"Domain Down" is missing from the domain status list in the showplatform man page.
Workaround : See showplatform Missing Domain Down for an explanation of "Domain Down ." Refer to the System Management Services (SMS) 1.2 Reference Manual for the corrected text.
If esmd detects a hot sensor within a minute after it starts up, it may decrease fan speeds in spite of the sensor.
The syntax for smsconfig is incorrect. The -a option only shows platoform users and the -r option only shows domain users. Both options need their complement added.
Workaround : See smsconfig Options for Adding and Removing Users Incomplete for an example of the correct syntax. Refer to the System Management Services (SMS) 1.2 Reference Manual for the corrected text.
Normally, if failover happens in the middle of a cmdsync command execution, the new main continues and completes the commands before it disables failover. Sometimes, however, failover is disabled before the commands have finished running and they do not complete.
Workaround : Rerun the commands manually.
The platform administrator does not have access to the /etc/opt/SUNWSMS/config/ domain_id / directories. Domain-specific blacklist and postrc entries will not be visible to the administrator if he only has platadmn privileges.
Workaround : None for a platamdn but obtaining domain administrator privileges would allow you to view the domain specific files.
A thread in fomd can get caught in a loop and use alot of CPU cycles.
Workaround : Stop and restart SMS
osdTimeDeltas does not get propagated to the Spare SC. This may throw off the time-of-day for all domains.
Workaround : Use setdatasync (1M) to propagate the file.
The failure could happen when 18 domains are trying to boot.
Workaround : Stop and restart SMS
Boards that are not powered-off after all domains are brought down may cause a DStop when the domains are brought back up. Certain conditions must be met for this to occur.
Workaround : Poweroff all boards after all domains have been brought down.
Some of the contents of an smsbackup from SMS1.1 are not compatable with SMS1.2. If an smsrestore is performed in SMS1.2 using an SMS1.1 backup file, SMS will not start up.
/etc/opt/SUNWSMS/SMS/config/esmd_tuning.txt
/etc/opt/SUNWSMS/SMS/config/fomd.cf
/etc/opt/SUNWSMS/SMS/startup/ssd_start
/etc/opt/SUNWSMS/SMS/startup/sms_env.sh
prior to restoring a backup file created by SMS1.1.
After running smsrestore , replace the restored files with the ones saved above.
If a relative path name is passed to the command, the backup will fail. In two cases . and ./ will cause the command to print
/opt/SUNWSMS/bin/smsbackup. smsbackup: Backup to tape succeeded: ./sms_backup.1.2.cpio SMS backup complete. |
This is incorrect. No file is generated to the current directory.
Workaround : Use absolute path name.
This section contains the synopses and Sun BugID number of the more important bugs that have been discovered regarding MAN. This list does not include all bugs
While net booting a domain using the SC as the install server, and going over the MAN, the following error is displayed while the Solaris software is coming up:
ifconfig: setifflags: SIOCSLIFFLAGS: eri1: Cannot assign requested address
If sys-unconfig is run on a domain preconfigured with Solaris software, the /etc/hostname.dman0 files are lost. They are not be recreated on a reconfiguration boot and the MAN network between the SC and the domain does not come up.
Workaround : Refer to Unconfigured Domains .
If a boot disk which was installed on another domain is used to boot a domain, then dman0 interface on the domain will be configured with wrong IP address.
Workaround : Refer to Unconfigured Domains .
If there are already installed domains and you have changed the MAN I1 network configuration using smsconfig -m then you will need to configure the MAN network information on the already installed domains by hand.
Workaround : Refer to Unconfigured Domains .
For certain cases there may be a delay in the start up of the I1 network.
Workaround : Run ifconfig (1).
You must be logged in as superuser on the SC.
This section contains bugs fixed since SMS 1.2.
Volume Manager can not cope with some formats of CDROM
Holding proc_t->p_lock while allocating memory leads to hung clock() & heartbeat.
xntpd on the domain should gradually adjust the clock to sync with the sc clock. Instead, a message appears about a half hour after starting xntpd :
sun15-b xntpd[1324]: IID 774427] time reset (slew) -54.206802 s
The amount printed is the amount of difference between the sc and domain clocks, but the clocks are never in sync.
When the external network is configured so that there are two communities, with hme0 in one community and eri1 in the other, IPMP fails the path group with eri1 in it.
This section contains the synopses and Sun BugID number of the more important bugs that have been discovered regarding the Sun Fire 15K system. This list does not include all bugs.
ip_rput_dlpi(fcip0): DL_ERROR_ACK error message on boot from cd image
The following error messages are seen when doing an installation to set the SC as the install server.
This indicates that the IP over Fibre Channel device instance 0 does not exist.
The Sun Fire 15K server does not currently support USB devices. Due to interaction with the corresponding software device driver in Solaris, users may experience significant delays while booting SCs and domains. In addition, messages similar to the following might be seen in the console output during boot or in system log files::
Since USB devices are not yet supported on the Sun Fire 15K, there is no workaround that will enable them. However, adding the following line to the /etc/system file on the SC and on each domain will eliminate the unnecessary boot delays and warning messages:
Copyright © 2002, Sun Microsystems, Inc. All rights reserved.