Gathering Troubleshooting Data

Answer the following questions to help isolate the system problem. Use "Troubleshooting a System Crash Checklist" for gathering troubleshooting data for a crashed system.

Table 26-1 Identifying System Crash Data

Question	Description
Can you reproduce the problem?	This is important because a reproducible test case is often essential for debugging really hard problems. By reproducing the problem, the service provider can build kernels with special instrumentation to trigger, diagnose, and fix the bug.
Are you using any third-party drivers?	Drivers run in the same address space as the kernel, with all the same privileges, so they can cause system crashes if they have bugs.
What was the system doing just before it crashed?	If the system was doing anything unusual like running a new stress test or experiencing higher-than-usual load, that might have led to the crash.
Were there any unusual console messages right before the crash?	Sometimes the system will show signs of distress before it actually crashes; this information is often useful.
Did you add any tuning parameters to the `/etc/system` file?	Sometimes tuning parameters, such as increasing shared memory segments so that the system tries to allocate more than it has, can cause the system to crash.
Did the problem start recently?	If so, did the onset of problems coincide with any changes to the system, for example, new drivers, new software, different workload, CPU upgrade, or a memory upgrade.

Troubleshooting a System Crash Checklist

Use this checklist when gathering system data for a crashed system.

Item	Your Data
Is a system crash dump available?
Identify the operating system release and appropriate software application release levels.
Identify system hardware. Include `prtdiag` output for sun4u systems. Include Explorer output for other systems.
Are patches installed? If so, include `showrev -p` output.
Is the problem reproducible?
Does the system have any third-party drivers?
What was the system doing before it crashed?
Were there any unusual console messages right before the system crashed?
Did you add any parameters to the `/etc/system` file?
Did the problem start recently?

Viewing System Messages

System messages display on the console device. The text of most system messages look like this:

[ID msgid facility.priority]

For example:

[ID 672855 kern.notice] syncing file systems...

If the message originated in the kernel, the kernel module name is displayed. For example:

Oct 1 14:07:24 mars ufs: [ID 845546 kern.notice] alloc: /: file system full

When a system crashes, it might display a message on the system console like this:

panic: error message

Less frequently, this message might be displayed instead of the panic message:

Watchdog reset !

The error logging daemon, syslogd, automatically records various system warnings and errors in message files. By default, many of these system messages are displayed on the system console and are stored in the /var/adm directory. You can direct where these messages are stored by setting up system message logging. For more information, see "How to Customize System Message Logging". These messages can alert you to system problems, such as a device that is about to fail.

The /var/adm directory contains several message files. The most recent messages are in /var/adm/messages file (and in messages.*), and the oldest are in the messages.3 file. After a period of time (usually every ten days), a new messages file is created. The messages.0 file is renamed messages.1, messages.1 is renamed messages.2, and messages.2 is renamed messages.3. The current /var/adm/messages.3 file is deleted.

Because the /var/adm directory stores large files containing messages, crash dumps, and other data, this directory can consume lots of disk space. To keep the /var/adm directory from growing too large, and to ensure that future crash dumps can be saved, you should remove unneeded files periodically. You can automate this task by using the crontab file. For more information on automating this task, see "How to Delete Crash Dump Files" and Chapter 18, Scheduling System Tasks (Tasks).

How to View System Messages

Display recent messages generated by a system crash or reboot by using the dmesg command.

$ dmesg

Or, use the more command to display one screen of messages at a time.

$ more /var/adm/messages

For more information, see dmesg(1M).

Example--Viewing System Messages

The following example shows output from the dmesg command.

$ dmesg
Jan  3 08:44:41 starbug genunix: [ID 540533 kern.notice] SunOS Release 5.9 ...
Jan  3 08:44:41 starbug genunix: [ID 913631 kern.notice] Copyright 1983-2002 ...
Jan  3 08:44:41 starbug genunix: [ID 678236 kern.info] Ethernet address ...
Jan  3 08:44:41 starbug unix: [ID 389951 kern.info] mem = 131072K (0x8000000)
Jan  3 08:44:41 starbug unix: [ID 930857 kern.info] avail mem = 121888768
Jan  3 08:44:41 starbug rootnex: [ID 466748 kern.info] root nexus = Sun Ultra 5/
10 UPA/PCI (UltraSPARC-IIi 333MHz)
Jan  3 08:44:41 starbug rootnex: [ID 349649 kern.info] pcipsy0 at root: UPA 0x1f0x0
Jan  3 08:44:41 starbug genunix: [ID 936769 kern.info] pcipsy0 is /pci@1f,0
Jan  3 08:44:41 starbug pcipsy: [ID 370704 kern.info] PCI-device: pci@1,1, simba0
Jan  3 08:44:41 starbug genunix: [ID 936769 kern.info] simba0 is /pci@1f,0/pci@1,1
Jan  3 08:44:41 starbug pcipsy: [ID 370704 kern.info] PCI-device: pci@1, simba1
Jan  3 08:44:41 starbug genunix: [ID 936769 kern.info] simba1 is /pci@1f,0/pci@1
Jan  3 08:44:57 starbug simba: [ID 370704 kern.info] PCI-device: ide@3, uata0
Jan  3 08:44:57 starbug genunix: [ID 936769 kern.info] uata0 is /pci@1f,0/pci@1,
1/ide@3
Jan  3 08:44:57 starbug uata: [ID 114370 kern.info] dad0 at pci1095,6460
.
.
.

Customizing System Message Logging

You can capture additional error messages that are generated by various system processes by modifying the /etc/syslog.conf file. By default, the /etc/syslog.conf file directs many system process messages to the /var/adm/messages files. Crash and boot messages are stored here as well. To view /var/adm messages, see "How to View System Messages".

The /etc/syslog.conf file has two columns separated by tabs:

facility.level ... action

facility.level	A facility or system source of the message or condition. May be a comma-separated listed of facilities. Facility values are listed in Table 26-2. A level, indicates the severity or priority of the condition being logged. Priority levels are listed in Table 26-3.
action	The action field indicates where the messages are forwarded.

The following example shows sample lines from a default /etc/syslog.conf file.

user.err                                        /dev/sysmsg
user.err                                        /var/adm/messages
user.alert                                      `root, operator'
user.emerg                                      *

This means the following user messages are automatically logged:

User errors are printed to the console and also are logged to the /var/adm/messages file.
User messages requiring immediate action (alert) are sent to the root and operator users.
User emergency messages are sent to individual users.

The most common error condition sources are shown in the following table. The most common priorities are shown in Table 26-3 in order of severity.

Table 26-2 Source Facilities for syslog.conf Messages

Source	Description
`kern`	The kernel
`auth`	Authentication
`daemon`	All daemons
`mail`	Mail system
`lp`	Spooling system
`user`	User processes

Note - The number of syslog facilities that can be activated in the /etc/syslog.conf file is unlimited.

Table 26-3 Priority Levels for syslog.conf Messages

Priority	Description
`emerg`	System emergencies
`alert`	Errors requiring immediate correction
`crit`	Critical errors
`err`	Other errors
`info`	Informational messages
`debug`	Output used for debugging
`none`	This setting doesn't log output


26. Troubleshooting Software Problems (Overview) Troubleshooting a System Crash What to Do if the System Crashes