SRDB ID | Synopsis | Date | ||
17479 | Resolving Hardhang problems on Ultra[TM] Servers | 13 Dec 2002 |
Status | Issued |
Description |
Problem Definition ------------------ None of the terminals are responding, console does not respond, ping/telnet does not respond, Stop-A does not break to OBP, "send break" from a tip line does not break into the OBP. If all of the above are tried and fail to break out of the hang then the system is really hosed. It is almost impossible for Engineering to figure out the cause of the hang if there is no core file to analyze. Keywords: kernel, hang, hard, core, obp, okSOLUTION SUMMARY:
Solutions --------- These steps do not provide the final solution nor detect the cause of the hardhang, but they will help in getting a core file to analyze the problem. In all the cases listed below, once you are in OBP type "sync" to get the core dump. If the system was booted with kadb, then do some initial analysis and then $q to enter OBP. NOTE: This document was written specifically for sun4u architecture systems. While many of these instructions will be applicable to other architectures, some will not. XIR is only available on Ultra Enterprise[TM] systems. Options ------- 1. Enable Deadman 2. Set Breakpoint 3. Install Hardhang Kernel 4. XIR 1. Enable Deadman ----------------- Deadman code is enabled by setting snooping in /etc/system. Make the following entries in the /etc/system file: set snooping = 1 set snoop_interval = 9000000
The snooping=1 entry enables the deadman code.
The snoop_interval=9000000 entry will enable the
deadman after 90 seconds (against the default of 500 seconds)
of system inactivity (no clock interrupts).
Reboot the system with kadb: ok boot kadb When the next hang occurs, hopefully the deadman timer will be triggered, and the system will drop into kadb: # ~stopped at 0xfbd01028: ta 0x7d kadb[0]: At this point, any specific debugger commands can be run to examine the current state of the system. Of particular interest are: $r dump the registers $c dump the current stack backtrace freemem/D see how much memory is free When kadb debugging is complete, attempt to take a core dump by doing: kadb[0]: $q ok sync
As of Solaris[TM] 8 the system will no longer drop to the ok prompt but will initiate a panic sequence that will
set the panic string to "deadman: timed out after %d seconds of clock inactivity" and create the core image,
and reset the system.
Pros: Easy to enable. Cons: Requires a system reboot. Cannot break to kadb/obp if the level 14 interrupt is blocked. Cannot break a hang caused by a device other than the cpu seizing a system bus. 2. Set Breakpoint ----------------- The system should have been booted with kadb. After the system comes up, get into kadb (Stop-A/"send break") and set a breakpoint in system_high_handler(). This function is only invoked on level 15 interrupts and is associated with fan fails and system board detection. To set the breakpoint in kadb: kadb: (type return) kadb[0]: system_high_handler:b kadb[0]: :c When the system hardhangs again, follow the procedure described in the section "Generating a Level 15 Interrupt". Pros: Will succeed in some instances where 'snooping' does not. Cons: Requires reboot if kadb not enabled. Requires a free system board slot. Cannot break a hang caused by a device other than the cpu seizing a system bus. Will fail if level 15 interrupts have been masked out. 3. Install Hardhang Kernel -------------------------- A special kernel needs to be built and installed at the customer site. Additionally, the breakpoint in system_high_handler() should be set through kadb (see the above section "Set Breakpoint"). Now the system has been setup to break out of the hang. Should the system hardhang, follow the procedure described in the section "Generating a Level 15 Interrupt". Pros: Will succeed even if all the interrupts are masked. Cons: Requires a custom kernel. Will fail if all the CPUs have PSTATE_IE = 0. Requires a free system board slot. Cannot break a hang caused by a device other than the cpu seizing a system bus. 4. XIR ------ This is the last resort in case the interrupts have been disabled. XIR is a non maskable interrupt and will definitely break the system out of the hang. Unfortunately this method also clears memory and hence a core dump cannot be taken. But this does provide some info about the CPU state at the time of hang. The remote External Initiated Reset (XIR) command "Although limited in its current form" can be used to aid Software debugging of hung systems. Currently XIR stores the following information for each CPU: TL (Trap Level) TT (Trap Type) TPC (Program Counter TNPC (Next Program counter) TSTATE (Trap State Register) This information is then gathered by typing .xir-state-all in the OBP. (You may need to Stop-A/"send break"to the machine to stop the machine from rebooting in order to issue this command.) There are 2 methods for initiating the XIR: Method 1: Press the XIR pin in the clock board which is at the rear of the E4000, (the FE handbook notes the location of the XIR switch). To the right side of the XIR switch is the POR switch; DO NOT press it, it will cycle power. When XIR is pressed the system will come to the "ok" prompt (or wait until it comes to the "ok" prompt). This method is easier than entering the key sequences noted in method 2. Method 2: Press Return key (twice) Press ~ key (once, possibly twice) Press Control-Shift-X keys (together) This key sequence should reboot the system. At this point, you'll need to do a Stop-A/"send break" to get to the OK prompt. Once the system is at the OBP prompt, get the CPU state info: ok .xir-state-all NOTE: This information must be manually copied. Then go to the following website: http://otis.uk/cgi-bin/xir-cgi.tcl to get the details on what to do with this information. Pros: Will break out of the hang. Cons: Will not be able to get a useful core file. Generating a Level 15 Interrupt ------------------------------- On a sun4u architecture system, a level 15 interrupt is generated when a system board is inserted. This interrupt is also generated by a fan failure, on both the sun4u and sun4d architectures, but since the fans are not easily accessible, board insertion is the method described here. If, however, the system in question is a sun4d, then disconnecting a fan will be the only method available for generating a Level 15 interrupt. When the system hangs, insert a system board into a free slot. This will generate a level 15 interrupt, which should trigger the breakpoint in kadb. Once in kadb, debugger commands can be run to examine the current state of the system. Of particular interest are: $r dump the registers $c dump the current stack backtrace freemem/D see how much memory is free When kadb debugging is complete, attempt to take a core dump by doing: kadb[0]: $q ok sync WARNING: If a non-forced level 15 interrupt should occur on the system while the breakpoint is set or the debug kernel is in place, then the system will break to the OBP/kadb prompt. The system cannot be used until control is returned to the kernel, by typing "go" at the OBP, or :c at the kadb prompt.SUBMITTER: Nancy A LeBlanc APPLIES TO: Hardware/Ultra Enterprise/Servers, Operating Systems/Solaris/Solaris 2.5.1, AFO Vertical Team Docs, AFO Vertical Team Docs/Kernel ATTACHMENTS: