InfoDoc ID   Synopsis   Date
41244   Steps for un-hanging Sun Enterprise [TM] 10000 domain.   6 Mar 2002

Status Issued

Description

There are several methods for un-hanging a Sun Enterprise [TM] 10000 domain. The options listed below should be tried in the order listed, starting with Option 1, then moving onto option 2, and then onto option 3 if necessary. Option 1 is the least traumatic option of the three, but is often not a valid method for getting the domain back under control.

Before proceeding with the options listed, the first step is to verify that the domain is truly hung. Make sure it is unreachable by other nodes on the same subnet (you should try to confirm this with ping as well as rlogin, or telnet.) Assuming the domain is confirmed "hung" or unreachable, proceed with the following steps. If it is at all possible, attempts to capture a core file should be made in order to determine the source of the hang.

Option 1-Using a netcon session to un-hang a domain:

  1. From the primary ssp try to connect netcon onto the domain in question.
  2. If you can connect through netcon to the domain, check for any messages that indicate what is happening, or see if you can check the domain's process table. This will confirm the domain is truly hung and not just slow. If the netcon session is showing messages, the domain is not hung, it's just running slow or degraded by some process. Try killing off the process or application that is "hanging" the domain if possible.
  3. Assuming you can netcon into the domain, but can not kill the hung application or process, we must force the domain down to OBP. From the netcon session issue the command ~#. The domain will drop to OBP.
  4. ok> sync This will attempt to dump a core file which may or may not be helpful in determining why the domain was hung in the first place.
  5. ok> boot

Option 2-If netcon connection is unavailable try these steps next:

  1. ssp# sigbcmd -p <boot_proc> obp (Proceed to step 2 if this command fails to bring the domain to OBP. If this sends the domain to OBP, proceed to step 4.)
  2. ssp# sigbcmd -p <different_proc> obp (Proceed to step 3 if this does not work. If this sends the domain to OBP, proceed to step 4.)
  3. ssp# sigbcmd -p <boot_proc> panic (Proceed to option 3 if this does not work. If this sends the domain to OBP, proceed to step 4.)
  4. ok> sync This will attempt to dump a core file which may or may not be helpful in determining why the domain was hung in the first place.
  5. ok> boot

Option 3-Last option if others don't work:

  1. ssp# hostint -p <boot_proc> If this is successful the domain will panic, dump core, and then reboot. If this is not successful, proceed to the next step.
  2. ssp# hostreset (The environment variable SUNW_HOSTNAME must be set to the name of the domain.) This should reset the domain in question. This does not automatically initiate a reboot. It may be necessary to command a sync and then boot procedure. In worse case scenario, you may have to bringup the domain after issuing the hostreset command.
  3. ssp# bringup -A off (Again, make sure that the variable SUNW_HOSTNAME is set for the proper domain in question.) This will bring the domain to OBP. If you are forced to do the bringup procedure, a core file will not be dumped, limiting the ability to troubleshoot the cause of the hang.

INTERNAL SUMMARY:

SUBMITTER: Joshua Freeman APPLIES TO: Hardware/Ultra Enterprise, Hardware/Ultra Enterprise/Servers/Enterprise 10000, AFO Vertical Team Docs/HAS ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.