SRDB ID |
|
Synopsis |
|
Date |
18300 |
|
Troubleshooting hung E10K domains |
|
23 Jan 1999 |
I think my domain is hung. How can I tell?
Are there any commands I should run that will help collect information
on the reason for the hang?
SOLUTION SUMMARY:
Hung Domains
------------------------------------------------------------
To recover from a hung domain, you must be logged into the SSP as user ssp
with two log-in sessions. Both log-in sessions must have their environment pointed
at the correct domain. Use the 'domain_status' and 'domain_switch' commands
to set up the environment. Make one session be the system console for that
domain with 'netcon' command. Use the other for all SSP commands.
To Determine if a Domain is Hung
1. Verify that 'netcon' will respond to a carriage return and that
you can ping the domain from the SSP.
If you cannot perform these functions, you either have system problem
or a hung domain.
a. System problems can be confirmed by checking the power status
and 'hostview' warnings.
b. A hung domain can be confirmed by issuing a 'telnet' to the
domain.
2. Use Unix command to determine the cause of sluggish behavior.
Use the 'ps -elf' to look for slow processes, 'df -lk' to check file system
usage, and 'who' to determine who is current user and what processes they
are running.
Useful 'netcon' toggles
~? = Show status and communication path
~= = switch to jtag from network if using network. If jtag switch to network
~. = exit out of netcon
~# = L1A or Stop A (ie drop to OBP)
~@ = get write permission
~* = kill all netcon sessions but yours
To Recover from a Hung Domain
------------------------------------------------------------
1. Issue the following SSP commands and save the output:
'domain_status '
'check_host -v '
'hostinfo -h '
'hostinfo -S '
Run the last command three times, waiting a few seconds between
each command
NOTES: hostinfo -h tells you what boards are the platform has.
hostinfo -S shows you the "heartbeat" from each processors
"Signature Block". Looking only at the boards in your domain
and only at procs currently configured, you should see
the "heartbeat" number increment over time. If no change then
that domain is dead.
2. Attempt to force the domain into OBP by typing:
'sigbcmd -f -p[processor id] obp '
See NOTE 1: Below for how to use the -p option
Observe netcon session activity. If you see the OBP ok> prompt, then
the sigbcmd was successful. Allow a few minutes for this command to
run. If it does not work, keep repeating step 2 with other processors from
that hung domain
a. If the 'sigbcmd' worked, issue the following sequence of commands.
When finished, save the contents of the window buffer to a file, or
cut and past them to a file. This data is useful in analyzing the cause
of the hang condition.
'ctrace'
This will give you trace before going to OBP. Symbols will not be available
if kernel is non-debug kernel.
'.registers '
This command gives you global register dump at the time of entering OBP.
'.locals '
This command gives you local register dump at the time of entering OBP.
'sync '
sync issues a callbk to the kernel to get a core dump. The system
should dump core and reboot after issuing this OBP command.
b. If the 'sigbcmd' command did not work, attempt to force the system
panic with the 'hostinit' command on the SSP.
c. If the 'hostint' command does not work, try the 'sigbcmd panic'
This is a more forecful version of the 'hostinit' command.
3. When all else fails, issue a bringup command to restore the domain to
operation.
==========
NOTE 1: The -p option is for processor. You must choose a processor from
the hung domain and "NOT A ACTIVE DOMAIN". It is also best to try a
processor other then the boot processor.
To get board numbers do:
'domain_status' cmd
DOMAIN TYPE PLATFORM OS SYSBDS
sun Ultra-Enterprise-10000 test 2.6 0 2 4
Example of processors numbers from a domain of boards 0 2 4:
Processors for board 0: 0 1 2 3
Processors for board 2: 8 9 10 11
Processors for board 4: 16 17 18 19
board# x 4 = starting proc for that board
==========
==========
ADDITIONAL NOTES:
All commands are enclosed by ' '
It is recommended you read the man pages for noted comands.
==========
INTERNAL SUMMARY:
This information was taken from the "Enterprise 10000 Advanced System Service
Manual", May, 1998. Some parts were expanded to better explain what is
presented here.
SUBMITTER: Stephen Taylor
APPLIES TO: Hardware/Ultra Enterprise/Servers/Enterprise 10000, AFO Vertical Team Docs, AFO Vertical Team Docs/Kernel
ATTACHMENTS:
Copyright (c) 1997-2003 Sun Microsystems, Inc.