SRDB ID   Synopsis   Date
47266   Sun Fire[TM] 12K/15K: CPU Event on CPU### at TL=0   25 Sep 2002

Status Issued

Description

The domain messages or console reports the following type of error. This can be accompanied by a corresponding panic of the domain with the panic string "UE errors".

WARNING: [AFT1] CPU Event on CPU129 at TL=0, errID 0x00000021.893c7a80 
        AFSR 0x00000080<CPU>.000001a1 AFAR 0x00000001.e6b25200 
        Fault_PC 0x1164608 Esynd 0x01a1 /N0/SB4/P1/E0 J5400 
[AFT1] errID 0x00000021.893c7a80 More than four Bits were in error 
[AFT2] errID 0x00000021.893c7a80 PA=0x00000001.e6b25200 
        E$tag 0x00000003.cd000003 E$state_0 Owner 
[AFT2] E$Data (0x00) 0x96080000.96100000 0x96100000.96100000 ECC 0x0fc *Bad* Esynd=0x1a1 
[AFT2] E$Data (0x10) 0x96100000.96100000 0x96100000.96100000 ECC 0x0fc 
[AFT2] E$Data (0x20) 0x96100000.96100000 0x96100000.96100000 ECC 0x0fc 
[AFT2] E$Data (0x30) 0x96100000.96100000 0x96100000.96100000 ECC 0x0fc 
[AFT2] D$ data not available 
[AFT2] I$ data not available 
WARNING: [AFT1] Uncorrectable system bus (UE) Event on CPU128 Privileged Instruction Access at TL=0, errID
0x00000021.893c7580 
        AFSR 0x00100004<PRIV,UE>.00000071 AFAR 0x00000001.e6b25200 
        Fault_PC 0x300053a3200 Esynd 0x0071  /N0/SB4/P0/B0 
[AFT1] errID 0x00000021.893c7580 Two Bits in error, likely from E$ WDU/CPU 
[AFT2] errID 0x00000021.893c7580 PA=0x00000001.e6b25200 
        E$tag 0x00000003.cd000001 E$state_0 Shared 
[AFT2] E$Data (0x00) 0x56080000.96100000 0x96100000.96100000 ECC 0x15d *Bad* Esynd=0x071 
[AFT2] E$Data (0x10) 0x96100000.96100000 0x96100000.96100000 ECC 0x0fc 
[AFT2] E$Data (0x20) 0x96100000.96100000 0x96100000.96100000 ECC 0x0fc 
[AFT2] E$Data (0x30) 0x96100000.96100000 0x96100000.96100000 ECC 0x0fc 
[AFT2] I$ data not available 
unix: [ID 836849 kern.notice] 
panic[cpu128]/thread=30003a6b7e0: 
[AFT1] errID 0x00000021.893c7580 UE Error(s) 
        See previous message(s) for details                               

SOLUTION SUMMARY:

Explanation:

"CPU" represents an Uncorrectable E$ ECC error for copyout.

A CPU error (not to be confused with the term to describe a processor) occurs when a snoop request from the victim cpu causes the source cpu to access a line in its (the source's) E$ which contains a multi-bit error. The source cpu sets the CPU bit but also writes the data to the system bus with a special syndrome of 0x071 to indicate that the data is bad.

In this example, note that CPU129, which was the source cpu, first suffered from a CPU error and sent a signal to the victim cpu (CPU 128) that the data was bad by writing the data out to the bus with a special syndrome. Therefore, when CPU128 accessed the data it saw a UE and was able to recognize that this was the result of a previous error, hence the "likely from E$ WDU/CPU" message.

Action:

The UE panic doesn't always implicate the correct source of the error. Sometimes the source and the victim are in fact the same, but this isn't always the case. Make sure to confirm that the panic is the source, and check if the panic points to the source being a result of something else by watching for the message, "likely from E$ WDU/CPU". If this message exists, the source of the error may not be the cpu identified in the UE panic. The source may in fact be a different proc implicated in the earlier WDU/CPU error. It is crucial to correctly identify the failure's source, so you know which FRU needs to be replaced.

In the above example, the panicking proc is cpu128. 128 had a UE error as the result of a CPU event which occurred in E$ 0 on cpu129. So, the source of the error is cpu129's E$ 0, not the cpu called in the UE panic. Ultimately, the FRU for this example is the same, as cpu128 and 129 are on the same system board (And for Sun Fire 12K/15K platforms, the FRU for E$ dimms and cpus is actually a system board). But, this type of failure could occur across different system boards, and thus searching out the source of the error and not the victim is crucial.

Keywords: event, TL=0, uncorrectable, ecc

INTERNAL SUMMARY:

SUBMITTER: Joshua Freeman APPLIES TO: AFO Vertical Team Docs/HAS, Hardware/Sun Fire /15000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.