InfoDoc ID   Synopsis   Date
46780   Instructions on how to gather data from a hung Sun Fire[TM] domain.   13 Dec 2002

Status Issued

Description
Instructions on how to gather data from a hung Sun Fire[TM] domain.

1. Ensure that the domain is actually hung:

        - Can you ping the domain?
        - Can you telnet to the domain?

2. Ensure that the SC (System Controller) is not hung, login to the SC and 
   obtain a platform shell.

        A. If you get to the platform shell run the following commands:

                SCname:SC> showlogs
                SCname:SC> showplatform

        B. If the SC is hung try the secondary SC (if available) 
           Since only the primary SC can have access to the domains,
           this leaves us with little debugging options.

                - obtain the output of a showlogs

                - if possible (depends on the SC firmware) issue a reset 
                  to the primary SC. 

        C. If the SC is still hung then the reset button on the 
        primary SC must be pushed. Then go back to step 2A. 
3. Once in the platform shell attempt to get a domain shell:

        console -d <domainID>

- If the command appears to hang then we need to send a break signal to 
  the domain.

        - if you are using telnet:
          Press: CTRL ]
          at the telnet prompt type: send break

        - if you are connected to the SC via tip:
          ~#

     At this point you should have a domain shell prompt, continue with the
     following commands, otherwise continue to step 4.

   - If you get the domain shell run the following commands:
        SCname:A> showdomain -p status
        SCname:A> showlogs
     Then type break to get to the OBP. if this takes you to the ok prompt 
     then type sync to force a core file.

4. If you were not able to get to the ok prompt then the system is really
   hung and we will need to send and XIR (externally initiated reset) to 
   the domain. 

        From the domain shell type: reset

   This command will give different behavior depending on what the OBP 
   variable error-reset-recovery is set to. If this variable is set to
   sync a core file will attempt to be taken. If it is set to boot then
   the system will just reboot as if the boot command was issued at the
   ok prompt. If it is set to none it should drop you to the ok prompt, 
   where you can run the following commands, the '#' sign represents the 
   cpu that we took the XIR on, use that number in the cbuf command if 
   possible run this command on each of the cpus (some depend on 
   firmware level of the SC):

        {#} ok dump-sigblock
        {#} ok # cbuf
        {#} ok .xir-state-all

        - If you were not able to return to the ok prompt, but have a domain 
        prompt type the following command:

        SCname:A> showresetstate

5. If none of these tactics work you may be forced in to just powering 
   off the domain. If this is the case then do a setkeyswitch off for the 
   domain.

Keywords: StarCat, Star, Cat, SC, SunFire, kernel, XIR      
INTERNAL SUMMARY:
                    
Author: Christine Perrigo
        Kernel Technical Support Engineer
        Sun Enterprise Services
        (MS) UBRM04-125
        500 Eldorado Blvd.
        Broomfield, CO. 80021
        E-mail: christine.perrigo@sun.com
        Phone: 303.464.4521                                        
SUBMITTER: Chris Wagner APPLIES TO: AFO Vertical Team Docs/Kernel ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.