C H A P T E R 9 |
Domain Events |
Event monitoring periodically checks the domain and hardware status to detect conditions that need to be acted upon. The action taken is determined by the condition and can involve reporting the condition or initiating automated procedures to deal with it. This chapter describes the events that are detected by monitoring and the requirements with respect to actions taken in response to detected events.
SMS logs all significant events, such as those that require taking actions other than logging or updating user monitoring displays in message files on the SC. Included in the log is information to support subsequent servicing of the hardware or software.
SMS writes log messages for significant hardware events to the platform log file located in /var/opt/SUNWSMS/adm/platform/messages .
The actions taken in response to events that crash domain software systems include automatic system recovery (ASR) reboots of all affected domain(s), provided that the domain hardware (or a bootable subset thereof) meets the requirements for safe and correct operation.
SMS logs all significant actions other than logging or updating user monitoring displays taken in response to an event. Log messages for significant domain software events and their response actions are written to the message log file for the affected domain located in /var/opt/SUNWSMS/adm/ domain_id/ messages .
SMS writes log messages to /var/opt/SUNWSMS/adm/ domain_id /messages for significant hardware events that can visibly affect one or more domains of the affected domain(s).
SMS also logs domain console, syslog, post and dump information and as well as manages sms_core files.
SMS software maintains SC-resident copies of logs of all server information that it logs. Use the showlogs (1M) command to access log information.
The platform message log file can be accessed only by administrators for the platform using:
SMS log information relevant to a configured domain can be accessed only by administrators for that domain. SMS maintains separate log files for each domain using:
SMS maintains copies of domain syslog files on the SC in /var/opt/SUNWSMS/adm/ domain_id /syslog . syslog information can be accessed only by administrators for that domain using:
Solaris console output logs are maintained to provide valuable insight into what happened before a domain crashed. Console output is available on the SC for a crashed domain in /var/opt/SUNWSMS/adm/ domain_id /console . console information can be accessed only by administrators for that domain using:
XIR state dumps, generated by the reset command, can be displayed using showxirstate . For more information refer to the showxirstate man page.
Domain post logs are for service diagnostic purposes and are not displayed by showlogs or any SMS CLI.
The /var/tmp/sms_core. daemon files are binaries and not viewable.
The availability of various log files on the SC supports analysis and correction of problems that prevent a domain or domains from booting. For more information refer to the showlogs man page.
Note Note - Panic dumps for panicked domains are available in the /var/crash logs on the domain and not on the SC. |
The following table lists the SMS log information types, and their descriptions.
SMS manages the log files, as necessary, to keep the SC disk utilization within acceptable limits.
The message log daemon ( mld ) monitors message log size, file count per directory, and age every 10 minutes. mld executes the first limit to be reached.
Assuming 20 directories, the defaults represent approximately 4Gbytes of stored logs.
When a log message file reaches the size limit mld does the following:
Starting with the oldest message file x.X, it moves that file to x.X+1, except when the oldest message file is message.9 or core file is sms_core.daemon.1 , then it starts with x.X-1.
For example, messages becomes messages.0 , message.0 becomes messages.1 and so on up to messages.9 . When messages reaches 2.5MB then messages.9 is deleted and all files are bumped up by one and a new empty messages file is created.
When a log file reaches the file count limit mld does the following:
When messages or sms_core. daemon reaches its count limit, then the oldest message or core file is deleted.
When a log file reaches the age limit mld does the following:When any message file reaches x days, it is deleted.
Note Note - By default, the age limit (*_log_keep_days) is set to zero and not used. |
When a post date.time.sec .log or a dump_name . date.time.sec file reaches the file size, count or age limit, mld deletes the oldest file in the directory.
Note Note - Post files are provided for service diagnostic purposes and not intended for display. |
For more information, refer to the mld and showlogs man pages, and see Message Logging Daemon .
SMS monitors domain software status (see Software Status ) to detect domain reboot events.
Since the domain software is incapable of rebooting itself, SMS software controls the initial sequence for all domain reboots. In consequence, SMS is always aware of domain reboot initiation events.
SMS software logs the initiation of each reboot and the passage through each significant stage of booting a domain to the domain-specific log file.
SMS software detects all domain reboot failures.
Upon detecting a domain reboot failure, SMS logs the reboot failure event to the domain-specific message log.
SC resident per-domain log files are available for failure analysis. In addition to the reboot failure logs, SMS can maintain duplicates of important domain-resident logs and transcripts of domain console output on the SC as described in Log File Maintenance .
Domain reboot failures are handled as follows:
The first attempt to recover a domain from software failure uses a quick reboot procedure. The response to reboot or reset requests is always a fast bringup procedure.
The first attempt to recover domain from hardware failure uses the reboot procedure. The POST default diagnostic level is used in the reboot procedure.
If the domain recovery fails during the POST run, dsmd will retry POST at the default diagnostic level for up to four consecutive domain recovery failures after the first recovery attempt fails.
If the domain recovery fails during the IOSRAM layout, OpenBoot PROM download and jump, OpenBoot PROM run, or Solaris software boot, dsmd reruns POST at the default diagnostic level. dsmd retries domain recovery domain at the default level for up to four attempts after the first recovery attempt fails. (All in all, dsmd will try domain recovery attempts at most five times).
Once the system has been recovered and Solaris software has been booted, any domain failures within four hours is treated as repeated domain failure and is recovered by running POST at the default level.
If there are no domain failures within four hours of Solaris software running, then the domain is considered successfully recovered and healthy.
A subsequent domain hardware failure is handled by the reboot procedure.
A subsequent domain software failure is handled by quick reboot procedure, and the reboot or reset request is handled by the fast bringup procedure.
SMS tries all ASR methods at its disposal to boot a domain that has failed booting. All recovery attempts are logged in the domain-specific message log.
When a domain panics, it informs dsmd so that a recovery reboot can be initiated. The panic is reported as a domain software status change (see Software Status .
dsmd is informed when the Solaris software on a domain panics.
Upon detecting a domain panic, dsmd logs the panic event including information, to the domain-specific message log.
SC resident per-domain log files are available to assist in domain panic analysis. In addition to the panic logs, SMS can maintain duplicates of important domain-resident logs and transcripts of domain console output on the SC as described in Log File Maintenance .
In general, after an initial panic where there has been no prior indication of hardware errors, SMS requests that a fast reboot be tried to bring up the domain. For more information, see Fast Boot .
After a panic event, dsmd tries the ASR reboot on the panicked domain. This recovery action is logged in the domain-specific message log.
The Solaris panic dump logic has been redesigned to minimize the possibility of hangs at panic time. In a panic situation, Solaris software may operate differently either because normal functions are shutdown or because it is disabled by the panic. An ASR reboot of a panicked Solaris domain is eventually started, even if the panicked domain hangs before it can request a reboot.
Since the normal heartbeat monitoring (see Solaris Software Hang Events ) of a panicked domain may not be appropriate or sufficient to detect situations where a panicked Solaris domain will not proceed to request an ASR reboot, dsmd takes special measures as necessary to detect a domain panic hang event.
Upon detecting a panic hang event, dsmd logs each occurrence including information, to the domain-specific message log.
Upon detection of a domain panic hang (if any), SMS aborts the domain panic (see Domain Abort/Reset ) and initiates an ASR reboot of the domain. dsmd logs these recovery actions in the domain-specific message log.
SC resident log files are available to assist in panic hang analysis. In addition to the panic hang event logs, dsmd maintains duplicates of important domain-resident logs and transcripts of domain console output on the SC as described in Log File Maintenance .
If a second domain panic is detected shortly after recovering from a panic event, dsmd classifies the domain panic as a repeated domain panic event.
In addition to the standard logging actions that occur for any panic, the following action is taken when attempting to reboot after the repeated domain panic event.
With each successive repeated domain panic event, SMS attempts a full-test-level boot against the next untried administrator-specified degraded configuration (see ).
After all degraded configurations have been tried, successive repeated domain panic events will continue full-test-level boots using the last specified degraded configuration.
Upon determining that a repeated domain panic event has occurred, dsmd tries the ASR method at its disposal to boot a stable domain software environment. dsmd logs all recovery attempts in the domain-specific message log.
dsmd monitors the Solaris heartbeat described in Solaris Software Heartbeat in each domain while Solaris software is running (see ). When the heartbeat indicator is not updated for a period of time, a Solaris software hang event occurs.
dsmd detects Solaris software hangs.
Upon detecting a Solaris hang, dsmd logs the hang event including information, to the domain-specific message log.
Upon detecting a Solaris hang, dsmd requests the domain software to panic in order to obtain a core image for analysis of the Solaris hang ( Domain Abort/Reset ). SMS logs this recovery action in the domain-specific message log.
dsmd monitors the inability of the domain software to satisfy the request to panic. Upon determining noncompliance with the panic request, dsmd aborts the domain (see Domain Abort/Reset ) and initiates an ASR reboot. dsmd logs these recovery actions in the domain-specific message log.
Although the core image taken as a result of the panic will only be available for analysis from the domain, SC resident log files are available to assist in domain hang analysis. In addition to the Solaris hang event logs, dsmd can maintain duplicates of important domain-resident logs and transcripts of domain console output on the SC.
Changes to the hardware configuration status are considered hardware configuration events. esmd detects the following hardware configuration events on the Sun Fire 15K system.
The insertion of a hot-pluggable unit (HPU) is a hot-plug event. The following actions take place:
SMS detects HPU insertion events and logs each event and additional information to a platform message log file.
If the inserted HPU is a system board in the logical configuration for a domain, SMS also logs its arrival in the domain's message log file.
The removal of a hot-pluggable unit (HPU) is a hot-unplug event. The following actions take place:
Upon occurrence of a hot-unplug event, SMS makes a log entry recording the removal of the HPU to the platform message log file.
A hot unplug event that detects the removal of a system board from a logical domain configuration logs it to that domain's message log file.
POST can run against different server components at different times due to domain-related events such as reboots and dynamic reconfigurations. As described in Hardware Configuration , SMS includes status from POST and identifying failed-test components. Consequently, changes in POST status of a component are considered to be hardware configuration events. SMS logs POST-initiated hardware configuration changes to the platform message log.
In general, environmental events are detected when hardware status measurements exceed normal operational limits. Acceptable operational limits depend upon the hardware and the server configuration.
esmd verifies that measurements returned by each sensor are within acceptable operational limits. esmd logs all sensor measurements outside of acceptable operational limits as environmental events to the platform log file.
esmd also logs significant actions taken in response to an environmental event (such as those beyond logging information or updating user displays) to the platform log file.
esmd logs significant environmental event response actions that affect one or more domain(s) to the log file(s) of the affected domain(s).
esmd handles environmental events by removing from operation the hardware that has experienced the event (and any other hardware dependent upon the disabled component). Hardware can be left in service, however, if continued operation of the hardware will not harm the hardware or cause hardware functional errors.
The options for handling environmental events are dependent upon the characteristics of the event. All events have a time frame during which the event must be handled. Some events kill the domain software; some do not. Event response actions are such that esmd responds within the event time frame.
There are a number of responses esmd can make to environmental events, such as increasing fan speeds. In response to a detected environmental event, which requires a power off, esmd undertakes one of the following corrective actions:
esmd uses immediate power off if there is no other option that meets the time constraints.
If the environment event does not require immediate power off and the component is a MaxCPU board, esmd will attempt to DR the endangered board out of the running domain and power it off.
If the environment event does not require immediate power off and the component is a centerplane support board (CSB), esmd will attempt to reconfigure the bus traffic to use only the other CSB and power the component off.
Where possible, if the environment event does not require immediate power off and the component is any type of board other than a MaxCPU or CSB, esmd notifies dsmd of the environment condition and dsmd sends an "orderly shutdown" request to the domain. The domain flushes uncommitted memory buffers to physical storage.
If the software is still running and a viable domain configuration remains after the affected hardware is removed, a remote DR operation to remove the hardware from the domain allows it to continue running in degraded mode.
If either of the last two options takes longer than the allotted time for the given environmental condition, esmd will immediately power off the component regardless of the state of the domain software.
SMS illuminates the fault indicator LED on any hot-pluggable unit that can be identified as the cause of an environmental event.
So long as the environmental event response actions do not include shutdown of the system controller(s), all domain(s) whose software operations were terminated by an environmental event or the ensuing response actions are subject to ASR reboot as soon as possible.
ASR reboot begins immediately if there is a bootable set of hardware that can be operated in accordance with constraints imposed by the Sun Fire 15K system to assure safe and correct operation.
The following provides a little more detail about each type of environmental event that can occur on the Sun Fire 15K system.
esmd monitors temperature measurements from Sun Fire 15K hardware for values that are too high. There is a critical temperature threshold that, if exceeded, is handled as quickly as possible by powering off the affected hardware. High, but not critical, temperatures are handled by attempting slower recovery actions.
There is very little opportunity to do anything when a full power failure occurs. The entire platform, domains as well as SCs, are shut off when the plug is pulled without the benefit of a graceful shut down. The ultimate recovery action occurs when power is restored (see Power-On Self-Test (POST) ).
Sun Fire 15K power voltages are monitored to detect out-of-range events. The handling of out-of-range voltages follows the general principles outlined at the beginning of Environmental Events .
In addition to checking for adequate power before powering on any boards, as mentioned in Power Control , the failure of a power supply could leave the server inadequately powered. The system is equipped with power supply redundancy in the event of failure. esmd does not take any action (other than logging) in response to a bulk power supply hardware failure. The handling of under power events follow the general principles outlined at the beginning of Environmental Events
esmd monitors fans for continuing operation. Should a fan fail, a fan failure event occurs. The handling of fan-failures will follow the general principles outlined at the beginning of Environmental Events .
As described in Hardware Error Status , the occurrence of Sun Fire 15K hardware errors is recognized at the SC by more than one mechanism. Of the errors that are directly visible to the SC, some are reported directly by PCI interrupt to the UltraSPARC IIi processor on the SC, and others are detected only through monitoring of the Sun Fire 15K hardware registers.
There are other hardware errors that are detected by the processors running in a domain. Domain software running in the domain detects the occurrence of those errors in the domain, which then reports the error to the SC. Like the mechanism by which the SC becomes aware of the occurrence of a hardware error, the error state retained by the hardware after a hardware error is dependent upon the specific error.
dsmd implements the mechanisms necessary to detect all SC-visible hardware errors.
dsmd implements domain software interfaces to accept reports of domain-detected hardware errors.
dsmd collects hardware error data and clears the error state.
dsmd logs the hardware error and related information as required, to the platform message log.
dsmd logs the hardware error to the domain message log file for all affected domain(s).
Data collected in response to a hardware error that is not suitable for inclusion in a log file may be saved in uniquely named file(s) in /var/opt/SUNWSMS/adm/ domain_id /dump on the SC.
SMS illuminates the fault indicator LED on any hot-pluggable unit that can be identified as the cause of a hardware error.
The actions taken in response to hardware errors (other than collecting and logging information as described above) are twofold. First, it may be possible to eliminate the further occurrence of certain types of hardware errors by eliminating from use the hardware identified to be at fault.
Second, all domains that crashed either as a result of a hardware error or were shut down as a consequence of the first type of action are subject to ASR reboot actions.
In response to each detected hardware error and each domain-software-reported hardware error, dsmd undertakes corrective actions.
ASR reboot with full POST verification will be initiated for each domain brought down by a hardware error or subsequent actions taken in response to that error.
Note Note - Problems with the ASR reboot of a domain after a hardware error are detected as domain boot failure events and subject to the recovery actions described in Domain Boot Failure. |
dsmd logs all significant actions, such as those beyond logging information or updating user displays taken in response to a hardware error in the platform log file. When a hardware error affects one or more domains, dsmd logs the significant response actions in the message log files of the affected domain(s).
The following sections summarize the types of hardware errors expected to be detected/handled on the Sun Fire 15K system.
Domain stops are uncorrectable hardware errors that immediately terminate the affected domain(s). Hardware state dumps are taken before dsmd initiates an ASR reboot of the affected domain(s). These files are located in: /var/opt/SUNWSMS/adm/ domain_id/ dump . dsmd logs the event in the domain log file.
A RED_state or Watchdog reset traps to low-level domain software (OpenBoot PROM or kadb ), which reports the error and requests initiation of ASR reboot of the domain.
An XIR signal (reset -x ) also traps to low-level domain software (OpenBoot PROM or kadb ), which retains control of the software. The domain must be rebooted manually.
Correctable data transmission errors (for example, parity errors) can stop the normal transaction history recording feature of Sun Fire 15K ASICs. SMS reports a transmission error as a record stop. SMS dumps the transaction history buffers of the Sun Fire 15K ASICs and re-enables transaction history recording when a record stop is handled. dsmd records record stops in the domain log file.
ASIC-detected hardware failures other than domain stop or record stop include console bus errors which may or may not impact a domain. The hardware itself will not abort any domain but the domain software may not survive the impact of the hardware failure and may panic or hang. dsmd logs the event in the domain log file.
SMS monitors the main SC hardware and running software status as well as the hardware and running software of the spare SC, if present. In a high-availability SC configuration, SMS handles failures of the hardware or software on the main SC or failures detected in the hardware control paths (for example, console bus, or internal network connections) to the main SC by an automatic SC failover process. This cedes main responsibilities to the spare SC and leaves the former main SC as a (possibly crippled) spare.
SMS monitors the hardware of the main and spare SCs for failures.
SMS logs the hardware failure and related information to the platform message log.
SMS illuminates the fault indicator LED on a system controller with an identified hardware failure.
For more information, see SC Failover .
Copyright © 2002, Sun Microsystems, Inc. All rights reserved.