SRDB ID | Synopsis | Date | ||
48458 | Sun Fire[TM] 15K: Domain heartbeat failed | 31 Oct 2002 |
Status | Issued |
Description |
- Problem Statement: Domain heartbeat failed and DSMD issues an XIR to recover - Symptoms: The issue surrounding this domain heartbeat type of failure is that the "forced OS to panic" routine fails, the dsmd recovery action produces little useful information, and the dsmd recovery action only runs level 7 hpost diagnostics. These relevant messages will appear in the domain messages file relating to the failure: Apr 29 11:56:17 2002 f15k1sc1-hme0 dsmd[17690]-K(): Domain heartbeat failed in state (K:20307: 1: 0). Apr 29 11:56:17 2002 f15k1sc1-hme0 dsmd[17690]-K(): Forcing OS to panic Apr 29 11:56:27 2002 f15k1sc1-hme0 dsmd[17690]-K(): Put Mailbox Message failed 1141 Apr 29 11:56:37 2002 f15k1sc1-hme0 dsmd[17690]-K(): Force OS panic timed out. Apr 29 11:56:37 2002 f15k1sc1-hme0 dsmd[17690]-K(): Domain K OS is hung, aborting and rebooting domain. Apr 29 11:56:49 2002 f15k1sc1-hme0 dsmd[17690]-K(): Sending XIR to every CPU in domain, rc = 0 Apr 29 11:56:50 2002 f15k1sc1-hme0 dsmd[17690]-K(): Taking CPU registers and IOSRAM domain data dump. Apr 29 11:56:50 2002 f15k1sc1-hme0 dsmd[17690]-K(): XIR dump: /var/opt/SUNWSMS/SMS1.2/adm/K/dump/dsmd.dump.020429.1156.50 Apr 29 11:56:51 2002 f15k1sc1-hme0 dsmd[17690]-K(): Taking hardware configuration dump. Dump file: -D/var/opt/SUNWSMS/SMS1.2/adm/K/dump/dsmd.hwconfig.020429.1156.50 Apr 29 11:59:47 2002 f15k1sc1-hme0 dsmd[17690]-K(): Domain K running OS - Solaris hostname is leo. As noted in the messages, the mailbox communication to force the bootproc to panic times out. The determination by dsmd that the domain is hung forces it to XIR the domain processors. A dsmd.dump and a dsmd.hwconfig file are created. Using showxirstate -f dsmd.dump.020429.1156.50 produces a cpu register dump of all the processors in the domain. No immediately useful information is available. Using redx to examine the dsmd.hwconfig.020429.1156.50 file reports "0 errors occurred while creating this dump." and "No components would be failed based on this state."SOLUTION SUMMARY:
- Troubleshooting: Ideally, this failure occurs fairly consistently. The purpose of this procedure is to gather additional data (a corefile and cpu signature states) that is relevant to the failure. A. The first thing to do is to disable dsmd recovery action for the domain: - copy $SMSETC/config/dsmd_tuning.txt to $SMSETC/config/[A-R]/dsmd_tuning.txt. It is imperative that the owner and permissions are maintained in order for the dsmd daemon initialize the changed parameters. The permissions of the $SMSETC/config/[A-R] directory must be: drwxrwx---+ 2 root bin 512 Mar 27 17:09 K Also, the owner and permissions of the dsmd_tuning.txt file must be: -rw-r--r-- 1 root bin 1326 Dec 1 01:22 dsmd_tuning.txt - Make the following changes in the domain dsmd_tuninig.txt file: obp_heartbeat_time = 1200 os_heartbeat_time = 1200 domain_asr = 0 At this point, you must stop and restart SMS on the MAINSC (as root user): /etc/inid.d/sms stop|start NOTE: When you disable asr (domain_asr=0), a reset from obp, either manually or automatic, will not come back to obp. A Solaris reboot will not work either. You must setkeyswitch off/on. B. Download showcpusig from http://cpre-amer.west.sun.com/esg/hsg/starcat/tools/showcpusig.html. Install on the MAINSC and apply execute permissions. C. Make sure you have dumpadm set up to capture a core file ; enough disk space etc... D. On the next occurrence, the domain should noticeably hang and remain in that state. It is now important to inform the SSE or customer to execute the showcpusig script on the MAINSC. The showcpusig output should be useful in determining which cpu is not responding. sms-svc> ./showcpusig -d K 4. Interrupt the domain console in order to force a panic and a corefile from OBP. This can be done by entering the key sequence '~#' in the console window and typing sync at the OK> prompt. - Resolution: This is a temporary workaround to change the behaviour of the dsmd daemon in order to gather additional data and aid in resolving the problem listed in the problem statement. Once this data is gathered and the problem is understood, you are required to remove the dsmd_tuning.txt settings and restart SMS on the MAINSC. The resolution to the domain heartbeat failed problem will require analysis of the showcpusig output and the vmcore file. That is beyond the scope of this article. However, see the references section below for the meaning of known cpu signature states. - References and bug IDs - Summary of part number and patch ID's Bug4658538 reboot "fails" if ASR=0 for domain/platform Additional information regarding the showscpusig program can be found at the URL: http://cpre-amer.west.sun.com/esg/hsg/starcat/tools/showcpusig.html. The output produces the signature state of the domain (as is reported by the showplatform command), the heartbeat of the domain, and the individual cpu signature states. The signature state is decoded as follows: 4f530100 Solaris / Run / Null The 1st 4 digits starting from the left are decoded in the following table: 4f42 = OBP 4f53 = Solaris 4442 = Debug The next 2 digits are decoded in the following table: 00 = Non 01 = Run 02 = Exit 03 = Prerun 04 = Domain Stop 05 = Reset 06 = Power Off 07 = Detached 08 = Callback 09 = Offline 10 = Booting 11 = Unknown 12 = Error Reset 13 = Error Reset Sync 14 = Quiesced 15 = Quiesce In Progress 16 = Resume In Progress / \c" ;; 17 = Init 18 = Loading The next 2 digits are decoded in the following table: 00 = Null 01 = Halt 02 = Environment 03 = Reboot 04 = Panic 05 = Panic Con 06 = Hung 07 = Watch 08 = Panic Reboot 09 = Error Reset Reboot 10 = OBP Reset 11 = Debug 12 = Dump 13 = Failed
- Additional background information: Some other information from the dsmd.dump file may be useful in determining the type of hang experienced and other appropriate actions to take. For example, check the context of esc 536982. From the dsmd.dump file, it was determined and hypothesized that the domain had hung on a clock thread resulting in the heartbeat failure. By enabling the deadman kernel, in addition to the actions listed above, a corefile was obtained and the problem was root caused to a missing SRM patch. See the escalation for enabling the deadman kernel. sms-svc> showxirstate -f dsmd.dump.020429.1156.50 |more ver : 003E0015.21000507 US3+_2.1 EPIC6cu tba : 00000000.10000000 pil : 0xB <<-----------------!!! PROCESSOR INTERRUPT LEVEL y : 00000000.00000000 afsr : 00000000.00000000 afar : 00000402.CA001F00 afsr2 : 00000000.00000000 afar2 : 00000402.CA001F00 pcontext: 00000000.00000000 scontext: 00000000.00000000 dcu : 00000200.00000000 dcr : 00000000.0000103F pcr : 00000000.00000000 gsr : 00000000.00000000 softint : 0x0400 <<-----------------!!! INTERRUPTS PENDING REGISTER - Meta-Data/Problem categorization: Product/Platform: SF15K Category: - Keywords 15K, 12K, SF15K, SF12K, Sun Fire 15K, Enterprise, Server, Sun Fire 12K, hang, heartbeat, XIR, dsmd, deadman, showxirstate
INTERNAL SUMMARY:
SUBMITTER: Gino Valencia BUG REPORT ID: 4658538 APPLIES TO: Hardware/Sun Fire /15000 ATTACHMENTS: