C H A P T E R 7 |
Domain Status |
Status functions return measured values that characterize the state of the server hardware or software. As such, these functions are used both to provide values for status displays and input to monitoring software that periodically polls status functions and verifies that the values returned are within normal operational limits. Monitoring and event detection functions that use the status functions are described in this chapter.
The software state consists of status information provided by the software running in a domain. The identity of the software component currently running (for example, POST, OpenBoot PROM, or Solaris software) is available. Additional status information is available (booting, running, panicking).
SMS software provides the following command(s) to display the status of the software, if any, currently running in a domain.
showboards (1M) displays the assignment information and status of the DCUs. These include the following: Location, Power, Type of board, Board status, Test status and Domain.
If no options are specified, showboards displays for the platform administrator, all DCUs including those that are assigned or available . For the domain administrator or configurator, showboards displays only those DCUs for those domains for which the user has privileges, including those boards that are assigned or available and in the domain's available component list.
If domain_id | domain_tag is specified, this command displays which DCUs are assigned or available to the given domain. If the -a option is used, showboards displays all boards including DCUs.
For examples and more information, see To Obtain Board Status and refer to the showboards manpage.
showdevices (1M) displays configured physical devices on system boards and the resources made available by these devices. Usage information is provided by applications and subsystems that are actively managing system resources. The predicted impact of a system board DR operation may be optionally displayed by performing an offline query of managed resources.
showdevices gathers device information from one or more Sun Fire 15K domains. The command uses the dca (1M) as a proxy to gather the information from the domains.
For examples and more information, see To Obtain Device Status and refer to the showdevices manpage.
showenvironment (1M) displays environmental data including: Location, Device, Sensor, Value, Unit, Age, Status. For fan trays, Power, Speed and Fan Number are displayed. For bulk power, the Power, Value and Unit and Status are shown.
If a domain domain_id | domain_tag is specified, environmental data relating to the domain is displayed, providing that the user has domain privileges for that domain. If a domain is not specified, all domain data permissible to the user will be displayed.
DCUs (for example, CPU, I/O) belong to a domain and you must have domain privileges to view their status. Environmental data relating to such things as fan trays, bulk power, or other boards are displayed without domain permissions. You can also specify individual reports for: temperatures, voltages, currents, faults, bulk power status, and fan tray status with the -p option. If the -p option is not present, all reports will be shown.
For examples and more information, see Environmental Status and refer to the showenvironment man page.
showobpparams (1M) displays OpenBoot PROM bringup parameters. showobpparams allows a domain administrator to display the virtual NVRAM and REBOOT parameters passed to OpenBoot PROM by setkeyswitch (1M).
For examples and more information, see Setting the OpenBoot PROM Variables and refer to the showobpparams man page.
showplatform (1M) displays the available component list and domain state of each domain.
A domain is identified by a domain_tag if one exists. Otherwise it is identified by the domain_id , a letter in the set A - R. The letter set is case insensitive. The Solaris hostname is displayed if one exists. If a hostname has not been assigned to a domain, Unknown is printed.
The following is a list of domain statuses:
Unknown
The domain state could not be determined or for Ethernet addresses, it indicates the domain idprom image file does not exist. You need to contact your Sun service representative.
Powered Off
Keyswitch Standby
Running Domain POST
Loading OBP
Booting OBP
Running OBP
In OBP Callback
The domain has been halted and has returned to the OpenBoot PROM.
Loading Solaris
Booting Solaris
Domain Exited OBP
OBP Failed
OBP in sync Callback to OS
The OpenBoot PROM is in sync callback to the Solaris software.
Exited OBP
In OBP Error Reset
The domain is in OpenBoot PROM due to an error reset condition.
Solaris Halted in OBP
Solaris software is halted and the domain is in OpenBoot PROM.
OBP Debugging
Environmental Domain Halt
Booting Solaris Failed
Loading Solaris Failed
Running Solaris
Solaris Quiesce In-Progress
Solaris Quiesced
Solaris Resume In-Progress
Solaris Panic
Solaris Panic Debug
Solaris Panic Continue
Solaris Panic Dump
Solaris Halt
Solaris Panic Exit
Environmental Emergency
Debugging Solaris
Solaris Exited
Domain Down
The domain is down and the setkeyswitch in the ON, DIAG, or SECURE position.
In Recovery
Domain status reflects two cases. The first is that dsmd is busy trying to recover the domain and the second is that dsmd has given up trying to recover the domain. In the second case you will always see "Domain Down." In the first case you will either see "Domain Down" or some other status. To recover from a "Domain Down" in either case, use:
For examples and more information, see To Obtain Domain Status and refer to the showplatform man page.
showxirstate (1M) displays CPU dump information after sending a reset pulse to the processors. This save state dump can be used to analyze the cause of abnormal domain behavior. showxirstate creates a list of all active processors in that domain and retrieves the save state information for each processor.
showxirstate data resides, by default, in /var/opt/SUNWSMS/adm/ domain_id /dump .
For examples and more information, refer to the showxirstate man page.
During normal operation, the Solaris environment produces a periodic heartbeat indicator readable from the SC. dsmd detects the absence of heartbeat updates for a running Solaris system as a hung Solaris. Hangs are not detected for any software components other than the Solaris software.
Note Note - The Solaris software heartbeat should not be confused with the SC-to-SC (hardware) heartbeat or the heartbeat network, both used to determine the health of failover. For more information see, SC Heartbeats. |
The only reflection of the Solaris heartbeat occurs when dsmd detects a failure to update the Solaris heartbeat of sufficient duration to indicate that the Solaris software is hung. Upon detection of a Solaris software hang, dsmd will conduct an ASR.
The hardware status functions report information about the hardware configuration, hardware failures detected, and platform environmental state.
The following hardware configuration status is available from the Sun Fire 15K system management software:
Hardware components physically present on each board (as detected by POST)
Hardware components not in use because they failed POST
Presence or absence of all hot-pluggable units (HPUs) (for example, system boards)
Hardware components not in use because they were on the blacklist when POST was invoked (see Power-On Self-Test (POST) )
Contents of the SEEPROM for each FRU including the part number and serial number
The hardware configuration supported by functions described in this section exclude I/O adaptors and I/O devices. showboards displays all hardware components that are present.
As described in Blacklist Editing , the current contents of the component blacklist(s) can always be viewed and altered.
The following hardware environmental measurements are available:
Temperatures
Power voltage and amperage
Fan status (stopped, low-speed, high-speed, failed)
Power status
Faults
showenvironment displays every environmental measurement that can be taken within the Sun Fire 15K rack.
Platform administrators can view any environment status on the entire platform. Domain administrators can see the environment status only for those domains for which they have privileges.
As described in HPU LEDs , the operating indicator LEDs on Sun Fire 15K HPUs visibly reflect that the HPUs are powered on and the OK to remove indicator LEDs visibly reflect those that can be unplugged.
dsmd monitors the Sun Fire 15K hardware operational status and reports errors. The occurrence of some errors are directly reported to the SC (for example, the error register(s) in every ASIC propagate to the SBBC on the SC that provides an error summary register). Although the occurrence of some errors is indicated by an interrupt delivered to the SC, some error states may require the SC to monitor hardware registers for error indications. When a hardware error is detected, esmd follows the established procedures for collecting and clearing the hardware error state.
The following types of errors can occur on Sun Fire 15K hardware:
Domain stops, fatal hardware errors that terminate all hardware operations in a domain
Record stops that cause the hardware to stop collecting transaction history when a data transfer error (for example, parity) occurs
SPARC processor error conditions such as RED_state/watchdog reset
Nonfatal ASIC-detected hardware failures
Hardware error status is generally not reported as a status. Rather, event handling functions perform various actions when hardware errors occur such as logging errors, initiating ASR, and so forth. These functions are discussed in Chapter 9 .
Note Note - As described inHPU LEDs, the fault LEDs, after POST completion, identify Sun Fire 15K HPUs in which faults have been discovered since last powered on or submitted to a power on reset. |
Proper operation of SMS depends upon proper operation of the hardware and the Solaris software on the SC. The ability to support automatic failover from the main to the spare system controller requires properly functioning hardware and software on the spare. SMS software running on the main system controller must either be functioning sufficiently to diagnose a software or hardware failure in a manner that can be detected by the spare or it must fail in a manner that can be detected by the spare.
SC-POST determines the status of system controller hardware. It tests and configures the system controller at power-on or power-on reset.
The SC will not boot if the SC fails to function.
If the control board fails to function, the SC will normally boot, but without access to the control board devices. The level of hardware functionality required to boot the system controller is essentially the same as that required for a standalone SC.
SC-POST writes diagnostic output to the SC console serial port (TTY-A). Additionally, SC-POST leaves a brief diagnostics status summary message in an NVRAM buffer that can be read by a Solaris driver and logged and/or displayed when the Solaris software boots.
SC firmware and software display information to identify and service SC hardware failures.
SC firmware and software provide a software interface that verifies that the system controller hardware is functional. This selects a working system controller as main in a high-availability SC configuration.
The system controller LEDs provide visible status regarding power and detected hardware faults as described in HPU LEDs .
Solaris software provides a level of self-diagnosis and automatic recovery (panic and reboot). Solaris software utilizes the SC hardware watchdog logic to trap hang conditions and force an automatic recovery reboot.
There are three hardware paths of communication between the SCs (two Ethernet connections, the heartbeat network, and one SC-to-SC heartbeat signal) that are used in the high-availability SC configuration by each SC to detect hangs or failures on the other SC.
SMS practices self-diagnosis and institutes automatic failure recovery procedures; even in non-high-availability SC configurations.
Upon recovery, SMS software either takes corrective actions as necessary to restore the platform hardware to a known, functional configuration or reports the inability to do so.
SMS software records and logs sufficient information to allow engineering diagnosis of single-occurrence software failures in the field.
SMS software takes a noticeable interval to initialize itself and become fully functional. The user interfaces behave predictably during this interval. Any rejections of user commands are clearly identified as due to system initialization with advice to try again after a suitable interval.
SMS software implementation uses a distributed client/server architecture. Any errors encountered during SMS initialization, due to attempts to interact with a process that has not yet completed initialization, are dealt with silently.
Copyright © 2002, Sun Microsystems, Inc. All rights reserved.