SRDB ID   Synopsis   Date
48458   Sun Fire[TM] 15K: Domain heartbeat failed   31 Oct 2002

Status Issued

Description
- Problem Statement:

  Domain heartbeat failed and DSMD issues an XIR to recover
     
- Symptoms:

  The issue surrounding this domain heartbeat type of failure is that the "forced OS to panic" 
  routine fails, the dsmd recovery action produces little useful information, and the dsmd
  recovery action only runs level 7 hpost diagnostics.
  
  These relevant messages will appear in the domain messages file relating to the failure:

Apr 29 11:56:17 2002 f15k1sc1-hme0 dsmd[17690]-K(): Domain heartbeat failed in state (K:20307: 1: 0).
Apr 29 11:56:17 2002 f15k1sc1-hme0 dsmd[17690]-K(): Forcing OS to panic
Apr 29 11:56:27 2002 f15k1sc1-hme0 dsmd[17690]-K(): Put Mailbox Message failed 1141
Apr 29 11:56:37 2002 f15k1sc1-hme0 dsmd[17690]-K(): Force OS panic timed out.
Apr 29 11:56:37 2002 f15k1sc1-hme0 dsmd[17690]-K(): Domain K OS is hung, aborting and rebooting domain.
Apr 29 11:56:49 2002 f15k1sc1-hme0 dsmd[17690]-K(): Sending XIR to every CPU in domain, rc = 0
Apr 29 11:56:50 2002 f15k1sc1-hme0 dsmd[17690]-K(): Taking CPU registers and IOSRAM domain data dump.
Apr 29 11:56:50 2002 f15k1sc1-hme0 dsmd[17690]-K(): XIR dump: /var/opt/SUNWSMS/SMS1.2/adm/K/dump/dsmd.dump.020429.1156.50
Apr 29 11:56:51 2002 f15k1sc1-hme0 dsmd[17690]-K(): Taking hardware configuration dump. Dump file:
-D/var/opt/SUNWSMS/SMS1.2/adm/K/dump/dsmd.hwconfig.020429.1156.50
Apr 29 11:59:47 2002 f15k1sc1-hme0 dsmd[17690]-K(): Domain K running OS - Solaris hostname is leo.

  As noted in the messages, the mailbox communication to force the bootproc to panic times out. 
  The determination by dsmd that the domain is hung forces it to XIR the domain processors.
 
  A dsmd.dump and a dsmd.hwconfig file are created. 

  Using showxirstate -f dsmd.dump.020429.1156.50 produces a cpu register dump of all the 
  processors in the domain. No immediately useful information is available.
  
  Using redx to examine the dsmd.hwconfig.020429.1156.50 file reports "0 errors occurred 
  while creating this dump." and "No components would be failed based on this state."      
SOLUTION SUMMARY:
- Troubleshooting:

  Ideally, this failure occurs fairly consistently. The purpose of this procedure is to
  gather additional data (a corefile and cpu signature states) that is relevant to the 
  failure.    
  
   A. The first thing to do is to disable dsmd recovery action for the domain:
   
     - copy $SMSETC/config/dsmd_tuning.txt to $SMSETC/config/[A-R]/dsmd_tuning.txt. 
       It is imperative that the owner and permissions are maintained in order for 
       the dsmd daemon initialize the changed parameters. The permissions of the 
       $SMSETC/config/[A-R] directory must be:
       
       drwxrwx---+  2 root     bin          512 Mar 27 17:09 K
       
       Also, the owner and permissions of the dsmd_tuning.txt file must be:
       
       -rw-r--r--   1 root     bin         1326 Dec  1 01:22 dsmd_tuning.txt

     - Make the following changes in the domain dsmd_tuninig.txt file:
     
        obp_heartbeat_time      = 1200 
        os_heartbeat_time       = 1200
        domain_asr = 0
        
        At this point, you must stop and restart SMS on the MAINSC (as root user):
        
        /etc/inid.d/sms stop|start

      NOTE: When you disable asr (domain_asr=0), a reset
      from obp, either manually or automatic, will
      not come back to obp. A Solaris reboot will
      not work either. You must setkeyswitch off/on.        
        
   B. Download showcpusig from http://cpre-amer.west.sun.com/esg/hsg/starcat/tools/showcpusig.html.
      Install on the MAINSC and apply execute permissions. 
      
   C. Make sure you have dumpadm set up to capture
      a core file ; enough disk space etc...

   D. On the next occurrence, the domain should noticeably hang and remain in that state. 
      It is now important to inform the SSE or customer to execute the showcpusig script
      on the MAINSC. The showcpusig output should be useful in determining which cpu is 
      not responding.
      
      sms-svc> ./showcpusig -d K 
      
      
   4. Interrupt the domain console in order to force a panic and a corefile
      from OBP. This can be done by entering the key sequence '~#' in the console window and 
      typing sync at the OK> prompt.
        

- Resolution:

  This is a temporary workaround to change the behaviour of the dsmd daemon in order to gather 
  additional data and aid in resolving the problem listed in the problem statement. Once this
  data is gathered and the problem is understood, you are required to remove the dsmd_tuning.txt 
  settings and restart SMS on the MAINSC.

  The resolution to the domain heartbeat failed problem will require analysis of the showcpusig
  output and the vmcore file. That is beyond the scope of this article.  However, see the references 
  section below for the meaning of known cpu signature states. 

- References and bug IDs       

- Summary of part number and patch ID's

  Bug 4658538 reboot "fails" if ASR=0 for
            domain/platform

  Additional information regarding the showscpusig program can be found at the URL:
  http://cpre-amer.west.sun.com/esg/hsg/starcat/tools/showcpusig.html. The output produces the
  signature state of the domain (as is reported by the showplatform command), the 
  heartbeat of the domain, and the individual cpu signature states. The signature state
  is decoded as follows:
  
  4f530100    Solaris / Run / Null 

  The 1st 4 digits starting from the left are
  decoded in the following table:
  4f42 = OBP 
  4f53 = Solaris 
  4442 = Debug

  The next 2 digits are decoded in the
  following table: 
  00 = Non 
  01 = Run 
  02 = Exit 
  03 = Prerun
  04 = Domain Stop 
  05 = Reset
  06 = Power Off 
  07 = Detached 
  08 = Callback 
  09 = Offline
  10 = Booting
  11 = Unknown 
  12 = Error Reset
  13 = Error Reset Sync
  14 = Quiesced 
  15 = Quiesce In Progress 
  16 = Resume In Progress / \c"  ;;
  17 = Init 
  18 = Loading 

  The next 2 digits are decoded in the
  following table:
  00 = Null        
  01 = Halt
  02 = Environment
  03 = Reboot
  04 = Panic
  05 = Panic Con
  06 = Hung
  07 = Watch
  08 = Panic Reboot
  09 = Error Reset Reboot
  10 = OBP Reset
  11 = Debug
  12 = Dump
  13 = Failed      
- Additional background information:

Some other information from the dsmd.dump file may be useful in determining the type of hang experienced and other appropriate
actions to take. For example, check the context 
of esc 536982. From the dsmd.dump file, it was determined and hypothesized that the domain
had hung on a clock thread resulting in the heartbeat failure. By enabling the deadman kernel,
in addition to the actions listed above, a corefile was obtained and the problem was root 
caused to a missing SRM patch. See the escalation for enabling the deadman kernel.
  
sms-svc> showxirstate -f dsmd.dump.020429.1156.50 |more

ver     : 003E0015.21000507     US3+_2.1  EPIC6cu
tba     : 00000000.10000000
pil     : 0xB                           <<-----------------!!! PROCESSOR INTERRUPT LEVEL
y       : 00000000.00000000
afsr    : 00000000.00000000  afar    : 00000402.CA001F00
afsr2   : 00000000.00000000  afar2   : 00000402.CA001F00
pcontext: 00000000.00000000  scontext: 00000000.00000000
dcu     : 00000200.00000000
dcr     : 00000000.0000103F
pcr     : 00000000.00000000
gsr     : 00000000.00000000
softint : 0x0400                        <<-----------------!!! INTERRUPTS PENDING REGISTER
        
        
- Meta-Data/Problem categorization:

Product/Platform: SF15K
Category:

- Keywords

15K, 12K, SF15K, SF12K, Sun Fire 15K, Enterprise, Server, Sun Fire 12K,
hang, heartbeat, XIR, dsmd, deadman, showxirstate      

INTERNAL SUMMARY:

SUBMITTER: Gino Valencia BUG REPORT ID: 4658538 APPLIES TO: Hardware/Sun Fire /15000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.