InfoDoc ID   Synopsis   Date
22145   Sun Enterprise[TM] 10000: Sample steps to troubleshoot the host reset dump file   23 Sep 2002

Status Issued

Description

In this example, the Proc # is listed next to the output data to make reading the data easier. You will normally not see this in the grep outputs.

Step 1: Locate the signature block.

The first two bytes of the signature block gives you the program identifier. In this case, the identifier is 4f42=OB. This means that all 8 procs are setting at OBP.

The next byte is the State Byte. In this case 01=SIGBST_RUN and 0e=SIGBST_XIR.

The last byte is the CPU sub state. In this case 00=EXIT_NULL and 0b=EXIT_EXTERN_INIT_RESET. There is a complete table in the Advanced Sun Enterprise[TM] 10000 Maintenance manual on page 11-229 for all the bytes of the signature.

#cat hostresetfile | grep sig
signature   0x4f420100 (Proc 27)
signature   0x4f420100 (Proc 26)
signature   0x4f420100 (Proc 25)
signature   0x4f420e0b (Proc 24)
signature   0x4f420100 (Proc 31)
signature   0x4f420100 (Proc 30)
signature   0x4f420100 (Proc 29)
signature   0x4f420e0b (Proc 28)      

Step 2: Check the heartbeat signatures.

The heartbeats should be the same for all procs in the domain. If they are not, this could lead to the source of the problem.

#cat hostresetfile | grep heart
heartbeat   0x1b0b8000 (Proc 27)
heartbeat   0x1b0b8000 (Proc 26)
heartbeat   0x1b0b8000 (Proc 25)
heartbeat   0x1b0b8000 (Proc 24)
heartbeat   0x1b0b8000 (Proc 31)
heartbeat   0x1b0b8000 (Proc 30)
heartbeat   0x1b0b8000 (Proc 29)
heartbeat   0x1b0b8000 (Proc 28)      

Step 3: Determine which proc belongs to which line.

The following command gives you the reference of which proc belongs to which line. This is how the (Proc XX) was mapped throughout this document.

#cat hostresetfile | grep Proc
Proc 27: offset: 0x07200 (Proc 27)
Proc 26: offset: 0x07000 (Proc 26)
Proc 25: offset: 0x07200 (Proc 25)
Proc 24: offset: 0x07000 (Proc 24)
Proc 31: offset: 0x07200 (Proc 31)
Proc 30: offset: 0x07000 (Proc 30)
Proc 29: offset: 0x07200 (Proc 29)
Proc 28: offset: 0x07000 (Proc 28)      

Step 4: Locate the determining factor.

The afsr in this case was the determining factor. These should be all zeroes.

#cat hostresetfile | grep afsr
afsr        0x0000000101000010 (Proc 27) <---- Not all zeros
afsr        0x0000000000000000 (Proc 26)
afsr        0x0000000000000000 (Proc 25)
afsr        0x0000000000000000 (Proc 24)
afsr        0x0000000000000000 (Proc 31)
afsr        0x0000000000000000 (Proc 30)
afsr        0x0000000000000000 (Proc 29)
afsr        0x00000001806000ff (Proc 28) <---- Not all zeroes
      
INTERNAL SUMMARY:

Use the following web site and tools to debug the afsr registers:

http://cpre-amer.west/esg/hsg/starfire/tools/

or

http://kernel.central/tools/spitfire.html

(Proc 27)
---------
AFSR = 0x00000000101000010 decodes to: 

Bit   SYM      Description
---   ---      ------------
24:   CP       Copy-out (intervention) Parity error
32:   ME       Multiple Error of same type occured

1 parity syndrome bits

(Proc 28)
---------
AFSR = 0x000000001806000ff decodes to: 

Bit   SYM      Description
---   ---      ------------
21:   UE       Uncorrectable ECC error (E_SYND in SDB)
22:   EDP      Data Parity error from Ecache SRAMs
31:   PRIV     Priviledged code access error(s) has occured
32:   ME       Multiple Error of same type occured

8 parity syndrome bits      

Using the above information and the reference on the web page, you should determine that Proc 28 (which is cpu 0 on system board 7) is:

CPUxx UE Error: Ecache Copyout on CPUyy      

Therefore, CPUyy is the implicated FRU.

SUBMITTER: Douglas Donahue APPLIES TO: Hardware/Ultra Enterprise/Servers/Enterprise 10000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.