SRDB ID   Synopsis   Date
47292   Sun Fire[TM] 12K/15K: WDU Event on CPU### at TL=0   25 Sep 2002

Status Issued

Description

The following error messages appear in the domain messages file. This may ultimately result in the domain panicking due to the UE error that will often accompany the WDU event.

WARNING: [AFT1] WDU Event on CPU34 at TL=0, errID 0x00000050.ae7062b0 
    AFSR 0x00000020<WDU>.0000017a AFAR 0x00000000.eca68000 
    Fault_PC 0x1205e68 Esynd 0x017a /N0/SB1/P2/E0 J6400 
[AFT1] errID 0x00000050.ae7062b0 Two Bits were in error 
[AFT2] errID 0x00000050.ae7062b0 E$tag PA=0x00000000.00268000 does not match AFAR=0x00000000.eca68000 
[AFT2] errID 0x00000050.ae7062b0 PA=0x00000000.00268000 
    E$tag 0x00000000.00000000 E$state_0 Invalid 
[AFT2] E$Data (0x00) 0x0ecce80d.ff250000 0x0eccf00d.ff250000 ECC 0x0f5 *Bad* Esynd=0x17a 
[AFT2] E$Data (0x10) 0x0eccf00d.ff250000 0x0eccf00d.ff250000 ECC 0x0f5 
[AFT2] E$Data (0x20) 0x0eccf00d.ff250000 0x0eccf00d.ff250000 ECC 0x0f5 
[AFT2] E$Data (0x30) 0x0eccf00d.ff250000 0x0eccf00d.ff250000 ECC 0x0f5 
[AFT2] errID 0x00000050.ae7062b0 E$tag PA=0x00000000.00668000 does not match AFAR=0x00000000.eca68000 
[AFT2] errID 0x00000050.ae7062b0 PA=0x00000000.00668000 
    E$tag 0x00000000.01000000 E$state_0 Invalid 
[AFT2] E$Data (0x00) 0x00000000.00000000 0x00000300.0e5f8000 ECC 0x033 
[AFT2] E$Data (0x10) 0x00000300.0e5f8000 0x00000300.0e5f8000 ECC 0x0c3 
[AFT2] E$Data (0x20) 0x00000300.0e5f8000 0x00000000.00000000 ECC 0x0f0 
[AFT2] E$Data (0x30) 0x00000000.00000000 0x00000000.00000000 ECC 0x000 
[AFT2] D$ data not available 
[AFT2] I$ data not available 
NOTICE: Scheduling clearing of error on page 0x00000000.eca68000 

WARNING: [AFT1] Uncorrectable system bus (UE) Event on CPU34 Privileged Data Access at TL=0, errID
0x00000050.ae7eba40 
    AFSR 0x00100004<PRIV,UE>.00000071 AFAR 0x00000000.eca68000 
    Fault_PC 0x784ea66c Esynd 0x0071  /N0/SB1/P2/B0 
[AFT1] errID 0x00000050.ae7eba40 Two Bits in error, likely from E$ WDU/CPU 
[AFT2] errID 0x00000050.ae7eba40 PA=0x00000000.eca68000 
    E$tag 0x00000003.b2000002 E$state_0 Exclusive 
[AFT2] E$Data (0x00) 0xcecce80d.ff250000 0x0eccf00d.ff250000 ECC 0x18f *Bad* Esynd=0x071 
[AFT2] E$Data (0x10) 0x0eccf00d.ff250000 0x0eccf00d.ff250000 ECC 0x0f5 
[AFT2] E$Data (0x20) 0x0eccf00d.ff250000 0x0eccf00d.ff250000 ECC 0x0f5 
[AFT2] E$Data (0x30) 0x0eccf00d.ff250000 0x0eccf00d.ff250000 ECC 0x0f5 
[AFT2] D$ data not available 

panic[cpu6]/thread=30010556960: [AFT1] errID 0x00000050.ae7eba40 UE Error(s) 
    See previous message(s) for details                         

***NOTE*** Use the 12K/15K CPU decoder to determine which cpu the decimal number above refers to as the correct cpu implicated in the messages. Using the decoder, you will see that cpu34 corresponds to proc 2 on SB1.

SOLUTION SUMMARY:

Explanation:

A WDU event is an uncorrectable E$ ECC error for writeback (victimization). An error of a line in the E$ which contains a multi-bit error causes a WDU. This error is logged, and the data flagged as having an uncorrectable error. This may lead to a subsequent UE error panic of the domain.

As part of the victimization process of a writeback operation the target line is read from the E$ and the correctness of the ECC is checked. If a multibit error is detected in the line then a WDU is signaled. The processor writes the data to the system bus with a special syndrome of 0x071 to indicate that the E$ data is corrupt.

In this particular case the error handling code tried to remove the data from use by scheduling the page containing the error to be checked and cleaned. However before that could occur another access was made to the same memory location and triggered a UE. Note that the system recognized that the data had been deliberately flagged bad as the result of an earlier error, hence the "likely from E$ WDU/CPU" message.

Therefore, a panic can occur as a result of a WDU event. The panic, however, will often be a UE event and will normally report the victim of the WDU. Sometimes this means the panic will be on an entirely different proc than the source. You should identify the source of the WDU to identify the FRU. In the above example, the source and victim are the same. Thus, it is easy to identify the FRU in this case: CPU34 (ultimately SB1).

Action:

The panic most likely will not implicate the correct source of the error. Make sure to confirm that the panic is the source and check if the panic points to the source being a result of something else by watching for the message, "likely from E$ WDU/CPU". If this message exists, the source of the error is the failed component. If those messages exist, identify the source of the WDU to identify the FRU.

In the above example, the source and victim are the same. Thus, it is easy to identify the FRU in this case: CPU34, which is E$0, proc 2, on SB1.

Because the error is uncorrectable, the system board should be swapped on the first occurance of this error. The Ecache SRAM is the implicated failed component, but the Ecache SRAM is not a FRU. The entire system board on a 12K/15K is the FRU for E$ dimm failures.

INTERNAL SUMMARY:

SUBMITTER: Joshua Freeman APPLIES TO: AFO Vertical Team Docs/HAS, Hardware/Sun Fire /15000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.