Document intinfodoc/27538

InfoDoc ID		Synopsis		Date
27538		How to gather information at the OK prompt		7 Nov 2001

Status

Issued

Description

OVERVIEW:

Occasionally, systems can be found at the 'ok' prompt unexpectedly. Most of the time we just have customers type sync at the ok prompt to gather a corefile and boot the system. There are many other things that can be done at the ok prompt to gather more information. Occasionally the sync command will not be able to generate a corefile for various reasons. This document will outline the steps to follow should you find a system at the ok prompt. This document only attempts to provide guidelines for how to gather more information, it does not intend to teach the reader how to interpret the data. A call should be placed to the Customer Care Center to interpret the results.

BASIC REQUIREMENTS:

The most important capability to have when trying to gather data at the ok prompt is logging. You need to have logging capabilities as there will be a lot of data captured and it is not feasible to write it down. Even using a digital camera to take pictures of screens has been used by customers to gather data for engineering, although the preferred method would be a terminal or terminal program attached to the serial port.

Also, in order to obtain as much information as possible, one should to load the obpsym module into memory. Loading this module allows for kernel symbols to be accessible at the Open Boot Prom (OBP) level. However, when this module is loaded, if the system panics or experiences a problem that would normally cause the system to reboot, then automatic reboots will not occur. The system will drop to the ok prompt and stay there until someone types go at the ok prompt.

The best method to load obpsym is via /etc/system with the following entry:

set obpdebug=1

Keep in mind that the purpose of this Infodoc is to explain how to gather data at the ok prompt, for systems that are found at the ok or drop to the ok from a hang. While trying to troubleshoot these types of issues, a system can always panic for other reasons. If a non-related panic does occur, we would want the system to boot automatically. If you wish for the system to reboot after "normal" panics you can restore that behavior with the following line in /etc/system:

set nopanicdebug=1

Finally, you can load the obpsym module at the command line to avoid a reboot, but not the nopanicdebug parameter. The obpsym module is specific to architecture (sun4m, sun4d and sun4u), and there are two versions for the sun4u platforms - 32bit and 64bit. The following command will load the appropriate version (the -p provides that functionality) :

modload -p misc/obpsym

For more information on obpsym, please see Infodoc 15876.

COMMANDS TO RUN AT THE OK PROMPT:

There are three major tasks you may be faced with when the system has dropped down to the ok prompt and you missed eventual error messages. Those tasks are, in ascending order of complexity:

1. Extract a stack backtrace and information about the current state of the system

2. Get the message buffers, i.e. information about what last happened on this system

3. True "debugging session" at the ok-prompt. This is beyond the scope of this document.

COMMANDS TO EXTRACT CURRENT STATE OF SYSTEM:

None of these commands require that obpsym is loaded however, several will yield more information with obpsym loaded. Also some of these commands will only run on certain architectures. See below for a list of commands to run on various platforms. Here is a list of these commands and a description of the information they gather.

* ctrace - This command will dump the stack of the thread responsible for bringing the system down to the ok prompt.

To get the $c equivalent of a "long" stacktrace, enter the following:

0 w begin .locals %o7 .adr cr (+w) key? or until

NOTE: This does not work on Serengeti platforms since the (+w) is unknown on that platform.

On Sunfire systems, you can attempt to generate the same information by looping through the register windows: To so do, use this command:

0 w 20 0 do i w .locals %o7 .adr cr loop

* wd-dump - This utility will dump the information needed to determine the reason for dropping to ok prompt on sun4d systems.

* .errors - This command will dump quite a bit of information that would need to be examined by engineering.

* .xir-state - This command also dumps quite a bit of information.

* .error(s) - This command will tell you what trap occurred that brought system to ok prompt.

* .registers - This command will dump the global registers. g0 will always be zero, g7 will be the current thread pointer. There are many columns in the sun4u output.

* .trap-registers - This command will dump information about all five trap levels on the system at the time we dropped to the ok prompt.

* .pstate - This is the processor state register info from the cpu that handled going down to ok.

* .psr - This is the process status register info from the cpu that handled going down to ok.

Here are the commands to run on various platforms:

sun4m

.psr
.registers
ctrace
0 w begin .locals %o7 .adr cr (+w) key? or until

sun4d

wd-dump
.psr
.registers
ctrace
0 w begin .locals %o7 .adr cr (+w) key? or until

sun4u (U1 - U450, SunBlade)

.pstate
.errors
.trap-registers
.cpu-afsr
.registers
ctrace
0 w begin .locals %o7 .adr cr (+w) key? or until

sun4u (Ex000, Ex500, E10k)

.xir_state
.pstate
.trap-registers
.registers
ctrace
0 w begin .locals %o7 .adr cr (+w) key? or until

sun4u (Ex800, E15k)

.pstate
.trap-registers
.registers
ctrace
20 0 do i w .locals %o7 .adr cr loop

HOW TO DUMP MESSAGES FROM THE OK PROMPT:

You can dump out data that would eventually make its way to /var/adm/messages. This can be very helpful as this data is stored in the kernel in a circular ringbuffer and can get overwritten and/or lost. This could happen if the crashdump fails or the system must be power cycled to be brought up again. We definitely would like to look at the messages in the message buffer up to the point where the system dropped to the ok prompt. The process to dump this data differs between Solaris 6 and below, and Solaris 7 and above. This section outlines how to retrieve this information.

Solaris 2.6 and below

For 2.6 and below, obpsym is not required to be loaded, as msgbuf can be accessed by virtual or physical address of the ringbuffer. It is much easier to dump the msgbuf if obpsym is loaded because you do not have to know the hex address of the msgbuf. Below is the syntax to dump the msgbuf when obpsym is loaded:

msgbuf 18 + .cstr

If obpsym is not loaded, then you will have to dump out msgbuf by address. There are a couple of ways to do this. You can determine in advance what address is used to store msgbuf. The following command, run as root, while the system is running in single or multi-user mode will determine where the msgbuf is on a given system.

echo 'msgbuf/A' | adb -k

Assuming you do not apply any patches between the the time the above command was run, and the time the system is at the ok prompt, you can use the address returned by the echo command. However, if you are at the ok prompt and do not know what the address is, here is a table of the possible commands to dump the message buffer, broken out by platform:

Platform Command

sun4u

50002000 2000 dump
60002000 2000 dump
70002000 2000 dump

sun4m

f0002000 2000 dump
2000 2000 dump

sun4d

e0002000 2000 dump

The syntax of the above commands is <address> <length in hex> dump. It would be beneficial to dump a few hundred bytes first to be sure the data looks like the msgbuf data. If so, then dump all the buffer. Since the msgbuf is a circular buffer, you will have to dump the entire buffer and examine it to determine if there are any messages indicating why the system dropped to the ok prompt. That is, the last messages displayed will not necessarily be the last messages placed in the msgbuf.

Solaris 7 and above

The msgbuf in Solaris 7 and above is implemented via streams. The msgbuf can not be text dumped as previously in Solaris 2.6 and below. The downside is that obpsym is required to be loaded to dump the msgbuf, and the command to dump is significantly longer. The upside is that when dumped, the msgbuf is formatted very much like the output in /var/adm/messages. It also dumps the messages in the correct order, with the last messages in msgbuf displayed last. There are two different commands to run depending on whether the system is running in 32bit mode or 64bit mode.

Here is the command to run in 32bit mode:

log_recent l@ 4 + l@ begin dup dup . 8 + l@ c + l@ .cstr l@ dup 0 = key? or until

Here is the command to run in 64bit mode:

log_recent x@ 8 + x@ begin dup dup . 10 + x@ 18 + x@ .cstr x@ dup 0 = key? or until

NOTE: There are no spaces between l@ in 32bit or between x@ in 64bit commands.

INTERNAL SUMMARY:

ACKNOWLEDGEMENTS:

I wish to thank Frank Hofmann for all his help

SUBMITTER: Peter Shoults APPLIES TO: AFO Vertical Team Docs/Kernel ATTACHMENTS: