InfoDoc ID | Synopsis | Date | ||
22145 | Sun Enterprise[TM] 10000: Sample steps to troubleshoot the host reset dump file | 23 Sep 2002 |
Status | Issued |
Description |
In this example, the Proc # is listed next to the output data to make reading the data easier. You will normally not see this in the grep outputs.
Step 1: Locate the signature block.
The first two bytes of the signature block gives you the program identifier. In this case, the identifier is 4f42=OB. This means that all 8 procs are setting at OBP.
The next byte is the State Byte. In this case 01=SIGBST_RUN and 0e=SIGBST_XIR.
The last byte is the CPU sub state. In this case 00=EXIT_NULL and 0b=EXIT_EXTERN_INIT_RESET. There is a complete table in the Advanced Sun Enterprise[TM] 10000 Maintenance manual on page 11-229 for all the bytes of the signature.
#cat hostresetfile | grep sig signature 0x4f420100 (Proc 27) signature 0x4f420100 (Proc 26) signature 0x4f420100 (Proc 25) signature 0x4f420e0b (Proc 24) signature 0x4f420100 (Proc 31) signature 0x4f420100 (Proc 30) signature 0x4f420100 (Proc 29) signature 0x4f420e0b (Proc 28)
Step 2: Check the heartbeat signatures.
The heartbeats should be the same for all procs in the domain. If they are not, this could lead to the source of the problem.
#cat hostresetfile | grep heart heartbeat 0x1b0b8000 (Proc 27) heartbeat 0x1b0b8000 (Proc 26) heartbeat 0x1b0b8000 (Proc 25) heartbeat 0x1b0b8000 (Proc 24) heartbeat 0x1b0b8000 (Proc 31) heartbeat 0x1b0b8000 (Proc 30) heartbeat 0x1b0b8000 (Proc 29) heartbeat 0x1b0b8000 (Proc 28)
Step 3: Determine which proc belongs to which line.
The following command gives you the reference of which proc belongs to which line. This is how the (Proc XX) was mapped throughout this document.
#cat hostresetfile | grep Proc Proc 27: offset: 0x07200 (Proc 27) Proc 26: offset: 0x07000 (Proc 26) Proc 25: offset: 0x07200 (Proc 25) Proc 24: offset: 0x07000 (Proc 24) Proc 31: offset: 0x07200 (Proc 31) Proc 30: offset: 0x07000 (Proc 30) Proc 29: offset: 0x07200 (Proc 29) Proc 28: offset: 0x07000 (Proc 28)
Step 4: Locate the determining factor.
The afsr in this case was the determining factor. These should be all zeroes.
#cat hostresetfile | grep afsr afsr 0x0000000101000010 (Proc 27) <---- Not all zeros afsr 0x0000000000000000 (Proc 26) afsr 0x0000000000000000 (Proc 25) afsr 0x0000000000000000 (Proc 24) afsr 0x0000000000000000 (Proc 31) afsr 0x0000000000000000 (Proc 30) afsr 0x0000000000000000 (Proc 29) afsr 0x00000001806000ff (Proc 28) <---- Not all zeroesINTERNAL SUMMARY:
Use the following web site and tools to debug the afsr registers:
http://cpre-amer.west/esg/hsg/starfire/tools/
or
http://kernel.central/tools/spitfire.html
(Proc 27) --------- AFSR = 0x00000000101000010 decodes to: Bit SYM Description --- --- ------------ 24: CP Copy-out (intervention) Parity error 32: ME Multiple Error of same type occured 1 parity syndrome bits (Proc 28) --------- AFSR = 0x000000001806000ff decodes to: Bit SYM Description --- --- ------------ 21: UE Uncorrectable ECC error (E_SYND in SDB) 22: EDP Data Parity error from Ecache SRAMs 31: PRIV Priviledged code access error(s) has occured 32: ME Multiple Error of same type occured 8 parity syndrome bits
Using the above information and the reference on the web page, you should determine that Proc 28 (which is cpu 0 on system board 7) is:
CPUxx UE Error: Ecache Copyout on CPUyy
Therefore, CPUyy is the implicated FRU.
SUBMITTER: Douglas Donahue APPLIES TO: Hardware/Ultra Enterprise/Servers/Enterprise 10000 ATTACHMENTS: