SRDB ID   Synopsis   Date
47290   Sun Fire[TM] 12K/15K: Uncorrectable system bus (UE) Event on CPU### User Data Access at TL=0   24 Sep 2002

Status Issued

Description

The following message appears in /var/adm/messages on a Sun Fire 12K/15K domain, or in console messages, or in the core file's message buffer. What does this mean?

WARNING: [AFT1] Uncorrectable system bus (UE) Event on CPU User Data Access at TL=0, errID 0x00000021.83c15da0 
        AFSR 0x00000004<UE>.00000164 AFAR 0x00000001.e6324000 
        Fault_PC 0x14984 Esynd 0x0164  /N0/SB4/P0/B0 
[AFT1] errID 0x00000021.83c15da0 Two Bits were in error 
[AFT2] errID 0x00000021.83c15da0 PA=0x00000001.e6324000 
        E$tag 0x00000003.cc000002 E$state_0 Exclusive 
[AFT2] E$Data (0x00) 0x0eccf00d.ff260030 0x0eccf00d.ff260000 ECC 0x024 *Bad* Esynd=0x164 
[AFT2] E$Data (0x10) 0x0eccf00d.ff260000 0x0eccf00d.ff260000 ECC 0x024 
[AFT2] E$Data (0x20) 0x0eccf00d.ff260000 0x0eccf00d.ff260000 ECC 0x024 
[AFT2] E$Data (0x30) 0x0eccf00d.ff260000 0x0eccf00d.ff260000 ECC 0x024 
[AFT2] D$Tag 0x001e6325 D$state Valid D$utag 0x98 D$snp 0x001e6324 
[AFT2] PAtag 0x001.e6324000 PAsnp 0x001.e6324000 VAutag 0x260000 
[AFT2] D$Data (0x00) 0x0eccf00d.ff260030 0x0eccf00d.ff260000 
[AFT2] D$Data (0x10) 0x0eccf00d.ff260000 0x0eccf00d.ff260000 
NOTICE: Scheduling clearing of error on page 0x00000001.e6324000 
[AFT3] errID 0x00000021.83c15da0 Above Error is in User Mode and is fatal: will reboot 
[AFT1] initiating reboot due to above error in pid 305 (mtst)                                           

***NOTE*** The above example is from a Sun Fire system, not a 12K/15K. The error string will be the same as on a 12K/15K, but the implicated dimms will be different numbers. For the purpose of this document, these differences are irrelevant.

SOLUTION SUMMARY:

Explanation:

The [AFT1] message indicates that a multibit error was detected in memory which was being used by a userland process (instead of the kernel), hence the first AFT1 line states "User Data Access" rather than "Privileged Data Access". Note the second [AFT1] error stating that two bits were in error. The final AFT1 line states the process that was affected (in this case 'mtst').

Action:

Confirm that this error is not occurring as the result of a WDU or CPU event. This can be easily confirmed by the indication of this being a "special syndrome of 0x071".

This is an uncorrectable error and the result will be the domain reboots itself, as expressed in the last [AFT1] message. Corresponding rstop or dstops should be examined to determine if the fault lies in a cpu implicated by the message or in the memory specified.

This error could result in the replacement of memory or the SB. In the event that the proc is confirmed bad, the SB is the FRU and not the proc.

Keywords: UE, uncorrectable, system, bus, event

INTERNAL SUMMARY: SUBMITTER: Joshua Freeman APPLIES TO: AFO Vertical Team Docs/HAS, Hardware/Sun Fire /15000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.