SRDB ID   Synopsis   Date
48026   Sun Fire[TM] server: CE errors are not always caused by bad memory   25 Oct 2002

Status Issued

Description

A Sun Fire[TM] 6800 panics on cpu 22 (which is SB5) because of a kernel stack overflow (trap level 2). The server has a large number of CE errors coming from random memory DIMMs on SB1. The panic stack is relatively small and all cpus are running the idle thread, including cpu 22, which is the one that panics. It is very likely a hardware problem but what part needs to be replaced - the memory on SB1, cpu 22 (which is SB5), or SB1 itself?

SOLUTION SUMMARY:

This panic matches bugid 4462509 - Recursive CE errors cause kernel stack overflow on Cheetah. This bug is fixed in patch 108528-17, but you still need to replace the failing component because the CE errors will eventually become UE errors and panic the machine.

The immediate reaction would be to replace either the memory on SB1 or SB1 itself because every CE error in the msgbuf has "Data Bit 69 was in error and corrected" . One might assume that this indicates that SB1 is reading the data wrong for every DIMM, since it is unlikely that every DIMM is bad. However, the correct answer is to replace cpu 22 (which is SB5). From an OS standpoint, the ecc is reported by the cpu which has initiated the transfer: e.g., cpu22 reads memory from SB1. However, the corruption could have occured anywhere along the path:

DIMM -> DCDS -> L1DX -> L2DX -> L1DX -> DCDS -> CPU

You need to look at the showlogs of the failed domain to tell which component is generating the CE errors. Look to see if there are any DX errors with ECC error.

Here is an example:

4800-1:A> showlogs

Jan 11 22:27:57 4800-1 Domain-A.SC: [ID 542255 local0.error] /N0/SB0 reported
ECC error

Jan 11 22:27:58 4800-1 Domain-A.SC: [ID 941184 local0.error] Status: 0xc0810000

Syndrome from DX2: 0x00000000

Syndrome from DX3: 0x000b0000
      

Note that you do NOT see any DX type messages in the showlogs on the failing machine. What you do see is an error message pointing to failing SB5.

Oct 14 17:43:26 vcore01-sc0 Platform.SC: Failed component found:
 /N0/SB5/P2      

The following is a note from engineering that explains what this means.


Usually, when a DX on a Serengeti system board detects that a
memory transfer passing through it has bad ECC, it will trigger
a message on the SC. So if the data was bad on SB1, we would
expect to see an SC message triggered by the DX's on SB1 and
SB5 as the data with bad ECC moved between SB1 -> SB5.

But we don't see any DX type messages, so we could infer that
the problem must be on the SB5 system board, somewhere between
the DX and the CPUs. (Unless of course there is some other
reason why we don't get the DX messages).

Also, even though the system will not panic if the patch is
installed, the hardware should still be replaced as the
likelihood of getting a fatal system error is significantly
increased.
      

Below is the msgbuf from the panic, and then a list of CE errors and finally the showlogs from the failing domain.

fm3(vmcore.0):0> msgbuf
[AFT0] errID 0x0000974c.f449c7c0 Corrected Memory Error on /N0/SB1/P0/B0/D0 J13300 is Persistent
[AFT0] errID 0x0000974c.f449c7c0 Data Bit 69 was in error and corrected
[AFT2] errID 0x0000974c.f449c7c0 PA=0x00000000.33eb1a00
E$tag 0x00000000.cf000002 E$state_0 Exclusive
[AFT2] E$Data (0x00) 0x00000300.0f4997d8 0x00000310.15dc7fa0 ECC 0x0d4
[AFT2] E$Data (0x10) 0x00000310.15eb1a60 0x00000310.15eb19a0 ECC 0x0f0
[AFT2] E$Data (0x20) 0x00000310.15eb1a00 0x00000310.15eb1a00 ECC 0x15e
[AFT2] E$Data (0x30) 0x006001f7.55e30000 0x00000000.00010000 ECC 0x0a5
[AFT2] D$Tag 0x00033eb1 D$state Valid D$utag 0xac D$snp 0x00033eb0
[AFT2] PAtag 0x000.33eb1a00 PAsnp 0x000.33eb1a00 VAutag 0x2b1a00
[AFT2] D$Data (0x00) 0x00000300.0f4997d8 0x00000310.15dc7fa0
[AFT2] D$Data (0x10) 0x00000310.15eb1a60 0x00000310.15eb19a0
[AFT2] I$ data not available
NOTICE: [AFT0] Corrected system bus (CE) Event on CPU23 at TL=0, errID 0x0000974c.f44bf770
AFSR 0x00000002<CE>.0000013c AFAR 0x00000000.33eb1ad0
Fault_PC 0x100968fc Esynd 0x013c /N0/SB1/P3/B0/D0 J16300
[AFT0] errID 0x0000974c.f44bf770 Corrected Memory Error on /N0/SB1/P3/B0/D0 J16300 is Persistent
[AFT0] errID 0x0000974c.f44bf770 Data Bit 69 was in error and corrected
[AFT2] errID 0x0000974c.f44bf770 PA=0x00000000.33eb1ac0
E$tag 0x00000000.cf000514 E$state_3 Exclusive
[AFT2] E$Data (0x00) 0x00000300.0f4997d8 0x00000310.15dc8060 ECC 0x0e1
[AFT2] E$Data (0x10) 0x00000310.15eb1b20 0x00000310.15eb1a60 ECC 0x161
[AFT2] E$Data (0x20) 0x00000310.15eb1ac0 0x00000310.15eb1ac0 ECC 0x0b2
[AFT2] E$Data (0x30) 0x006001f7.55d70000 0x00000000.00010000 ECC 0x181
[AFT2] D$Tag 0x00033eb1 D$state Valid D$utag 0xac D$snp 0x00033eb0
[AFT2] PAtag 0x000.33eb1ac0 PAsnp 0x000.33eb1ac0 VAutag 0x2b1ac0
[AFT2] D$Data (0x00) 0x00000300.0f4997d8 0x00000310.15dc8060
[AFT2] D$Data (0x10) 0x00000310.15eb1b20 0x00000310.15eb1a60
[AFT2] I$ data not available
NOTICE: [AFT0] Corrected system bus (CE) Event on CPU23 at TL=0, errID 0x0000974c.f44e1e60
AFSR 0x00000002<CE>.0000013c AFAR 0x00000000.33eb1b40
Fault_PC 0x1009690c Esynd 0x013c /N0/SB1/P1/B1/D0 J14301
[AFT0] errID 0x0000974c.f44e1e60 Corrected Memory Error on /N0/SB1/P1/B1/D0 J14301 is Persistent
[AFT0] errID 0x0000974c.f44e1e60 Data Bit 69 was in error and corrected
[AFT2] errID 0x0000974c.f44e1e60 PA=0x00000000.33eb1b40
E$tag 0x00000000.cf012800 E$state_5 Exclusive
[AFT2] E$Data (0x00) 0x00000310.15eb1b20 0x00000310.15eb1b20 ECC 0x069
[AFT2] E$Data (0x10) 0x006001f7.55d10000 0x00000000.00010000 ECC 0x12b
[AFT2] E$Data (0x20) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
[AFT2] E$Data (0x30) 0x010e3ffe.03000000 0x00000000.00000000 ECC 0x1a9
[AFT2] D$ data not available
[AFT2] I$ data not available
NOTICE: [AFT0] Corrected system bus (CE) Event on CPU23 at TL=0, errID 0x0000974c.f45049b0
AFSR 0x00000002<CE>.0000013c AFAR 0x00000000.33eb1b90
Fault_PC 0x10096904 Esynd 0x013c /N0/SB1/P2/B1/D0 J15301
[AFT0] errID 0x0000974c.f45049b0 Corrected Memory Error on /N0/SB1/P2/B1/D0 J15301 is Persistent
[AFT0] errID 0x0000974c.f45049b0 Data Bit 69 was in error and corrected
[AFT2] errID 0x0000974c.f45049b0 PA=0x00000000.33eb1b80
E$tag 0x00000000.cf4a0000 E$state_6 Exclusive
[AFT2] E$Data (0x00) 0x00000300.0f4997d8 0x00000310.15dc8180 ECC 0x14a
[AFT2] E$Data (0x10) 0x00000310.1bbe1be0 0x00000310.1c6dff40 ECC 0x04b
[AFT2] E$Data (0x20) 0x00000310.15eb1b80 0x00000310.15eb1b80 ECC 0x1c8
[AFT2] E$Data (0x30) 0x006001f7.55c50000 0x00000000.00010000 ECC 0x18d
[AFT2] D$Tag 0x00033eb1 D$state Valid D$utag 0xac D$snp 0x00033eb0
[AFT2] PAtag 0x000.33eb1b80 PAsnp 0x000.33eb1b80 VAutag 0x2b1b80
[AFT2] D$Data (0x00) 0x00000300.0f4997d8 0x00000310.15dc8180
[AFT2] D$Data (0x10) 0x00000310.1bbe1be0 0x00000310.1c6dff40
[AFT2] I$ data not available
NOTICE: [AFT0] Corrected system bus (CE) Event on CPU23 at TL=0, errID 0x0000974c.f4590870
AFSR 0x00000002<CE>.0000013c AFAR 0x00000000.33ebfdc0
Fault_PC 0x1009690c Esynd 0x013c /N0/SB1/P3/B1/D0 J16301
[AFT0] errID 0x0000974c.f4590870 Corrected Memory Error on /N0/SB1/P3/B1/D0 J16301 is Persistent
[AFT0] errID 0x0000974c.f4590870 Data Bit 69 was in error and corrected
[AFT2] errID 0x0000974c.f4590870 PA=0x00000000.33ebfdc0
E$tag 0x00000000.cf4a2522 E$state_7 Exclusive
[AFT2] E$Data (0x00) 0x00000310.15ebfda0 0x00000310.15ebfda0 ECC 0x0f3
[AFT2] E$Data (0x10) 0x006001f6.a8240000 0x00000000.00010000 ECC 0x089
[AFT2] E$Data (0x20) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
[AFT2] E$Data (0x30) 0x010e425a.03000000 0x00000000.00000000 ECC 0x0e1
[AFT2] D$ data not available
[AFT2] I$ data not available
NOTICE: [AFT0] Corrected system bus (CE) Event on CPU23 at TL=0, errID 0x0000974c.f4655c10
AFSR 0x00000002<CE>.0000013c AFAR 0x00000000.33ed5540
Fault_PC 0x1009690c Esynd 0x013c /N0/SB1/P1/B1/D0 J14301
[AFT0] errID 0x0000974c.f4655c10 Corrected Memory Error on /N0/SB1/P1/B1/D0 J14301 is Persistent
[AFT0] errID 0x0000974c.f4655c10 Data Bit 69 was in error and corrected
[AFT2] errID 0x0000974c.f4655c10 PA=0x00000000.33ed5540
E$tag 0x00000000.cf012914 E$state_5 Exclusive
[AFT2] E$Data (0x00) 0x00000310.15ed5520 0x00000310.15ed5520 ECC 0x191
[AFT2] E$Data (0x10) 0x006001e8.9cdb0000 0x00000000.00010000 ECC 0x185
[AFT2] E$Data (0x20) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
[AFT2] E$Data (0x30) 0x010e45ee.03000000 0x00000000.00000000 ECC 0x0c1
[AFT2] D$ data not available
[AFT2] I$ data not available
NOTICE: [AFT0] Corrected system bus (CE) Event on CPU23 at TL=0, errID 0x0000974c.f4678b20
AFSR 0x00000002<CE>.0000013c AFAR 0x00000000.33ed5590
Fault_PC 0x10096904 Esynd 0x013c /N0/SB1/P2/B1/D0 J15301
[AFT0] errID 0x0000974c.f4678b20 Corrected Memory Error on /N0/SB1/P2/B1/D0 J15301 is Persistent
[AFT0] errID 0x0000974c.f4678b20 Data Bit 69 was in error and corrected
[AFT2] errID 0x0000974c.f4678b20 PA=0x00000000.33ed5580
E$tag 0x00000000.cf4a0000 E$state_6 Exclusive
[AFT2] E$Data (0x00) 0x00000300.0ccaf190 0x00000000.00000000 ECC 0x105
[AFT2] E$Data (0x10) 0x00000310.15ed55e0 0x00000310.15ed5520 ECC 0x01e
[AFT2] E$Data (0x20) 0x00000310.15ed5580 0x00000310.15ed5580 ECC 0x030
[AFT2] E$Data (0x30) 0x006001e8.9cd50000 0x00000000.00010000 ECC 0x03b
[AFT2] D$ data not available
[AFT2] I$ data not available
NOTICE: [AFT0] Corrected system bus (CE) Event on CPU23 at TL=0, errID 0x0000974c.f46be530
AFSR 0x00000002<CE>.0000013c AFAR 0x00000000.33eda040
Fault_PC 0x1009690c Esynd 0x013c /N0/SB1/P1/B0/D0 J14300
[AFT0] errID 0x0000974c.f46be530 Corrected Memory Error on /N0/SB1/P1/B0/D0 J14300 is Persistent
[AFT0] errID 0x0000974c.f46be530 Data Bit 69 was in error and corrected
[AFT2] errID 0x0000974c.f46be530 PA=0x00000000.33eda040
E$tag 0x00000000.cf000012 E$state_1 Exclusive
[AFT2] E$Data (0x00) 0x00000310.15eda020 0x00000310.15eda020 ECC 0x13b
[AFT2] E$Data (0x10) 0x006001f5.facc0000 0x00000000.00010000 ECC 0x14c
[AFT2] E$Data (0x20) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
[AFT2] E$Data (0x30) 0x010e46b6.03000000 0x00000000.00000000 ECC 0x19b
[AFT2] D$ data not available
[AFT2] I$ data not available
NOTICE: [AFT0] Corrected system bus (CE) Event on CPU23 at TL=0, errID 0x0000974c.f46e13a0
AFSR 0x00000002<CE>.0000013c AFAR 0x00000000.33eda090
Fault_PC 0x10096904 Esynd 0x013c /N0/SB1/P2/B0/D0 J15300
[AFT0] errID 0x0000974c.f46e13a0 Corrected Memory Error on /N0/SB1/P2/B0/D0 J15300 is Persistent
[AFT0] errID 0x0000974c.f46e13a0 Data Bit 69 was in error and corrected
[AFT2] errID 0x0000974c.f46e13a0 PA=0x00000000.33eda080
E$tag 0x00000000.cf0004a0 E$state_2 Exclusive
[AFT2] E$Data (0x00) 0x00000300.0f4999d8 0x00000310.16489080 ECC 0x1a2
[AFT2] E$Data (0x10) 0x00000310.15eda0e0 0x00000310.15eda020 ECC 0x0b4
[AFT2] E$Data (0x20) 0x00000310.15eda080 0x00000310.15eda080 ECC 0x09a
[AFT2] E$Data (0x30) 0x006001f5.fac60000 0x00000000.00010000 ECC 0x1d0
[AFT2] D$Tag 0x00033edb D$state Valid D$utag 0xb6 D$snp 0x00033eda
[AFT2] PAtag 0x000.33eda080 PAsnp 0x000.33eda080 VAutag 0x2da080
[AFT2] D$Data (0x00) 0x00000300.0f4999d8 0x00000310.16489080
[AFT2] D$Data (0x10) 0x00000310.15eda0e0 0x00000310.15eda020
[AFT2] I$ data not available
NOTICE: [AFT0] Corrected system bus (CE) Event on CPU23 at TL=0, errID 0x0000974c.f51493b0
AFSR 0x00000002<CE>.0000013c AFAR 0x00000000.34037840
Fault_PC 0x1009690c Esynd 0x013c /N0/SB1/P1/B0/D0 J14300
[AFT0] errID 0x0000974c.f51493b0 Corrected Memory Error on /N0/SB1/P1/B0/D0 J14300 is Persistent
[AFT0] errID 0x0000974c.f51493b0 Data Bit 69 was in error and corrected
[AFT2] errID 0x0000974c.f51493b0 PA=0x00000000.34037840
E$tag 0x00000000.d0000012 E$state_1 Exclusive
[AFT2] E$Data (0x00) 0x00000310.16037820 0x00000310.16037820 ECC 0x063
[AFT2] E$Data (0x10) 0x006001db.9d090000 0x00000000.00010000 ECC 0x067
[AFT2] E$Data (0x20) 0x00000000.00000000 0x00000310.097cb338 ECC 0x1bc
[AFT2] E$Data (0x30) 0x010e80f6.03010000 0x00000001.00000000 ECC 0x0f7
[AFT2] D$ data not available
[AFT2] I$ data not available
panic: ptl1 trap reason 0x2
TL=0x1 TT=0x68 TICK=0x1e954f96362
TPC=0x1014a978 TnPC=0x1014a97c TSTATE=0x4480001606
TL=0x2 TT=0x68 TICK=0x1e954f9635c
TPC=0x10007098 TnPC=0x1000709c TSTATE=0x9180001507

panic[cpu22]/thread=2a10045fd20: Kernel panic at trap level 2
000000001040c1f0 unix:sys_tl1_panic+8 (2a10045e2c0, 30005c1a0b0, 3b0, 43452000, 81010100, 2a10045e670)
%l0-3: 0000000000000006 0000000000001400 0000004480001606 000000001000723c
%l4-7: 000000000000ff00 0000000001010000 000000000000000f 000000001040c2a0
000000001040c340 genunix:errorq_dispatch+88 (30000954dd0, 1, 30005ce0a98, 0, 81010100, 2a10045e2c0)
%l0-3: 00000000000003b0 0000030000954dd0 0000000000000000 0000000200000000
%l4-7: 000002a100965d20 00000000104cee40 0000030008a5b2d0 000002a100eabaf0
000002a10045e0a0 SUNW,UltraSPARC-III+:cpu_queue_one_event+104 (0, 0, 0, 0, 0, 0)
%l0-3: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
%l4-7: 0000000000000000 0000000000000000 0000000000000000 0000000000000000

simm error counts:
intr pers sticky name
5 8 3 /N0/SB1/P0/B0/D0 J13300
0 10 0 /N0/SB1/P1/B1/D0 J14301
3 7 2 /N0/SB1/P3/B1/D0 J16301
6 6 4 /N0/SB1/P3/B0/D0 J16300
0 5 2 /N0/SB1/P2/B0/D0 J15300
21 8 6 /N0/SB1/P2/B1/D0 J15301
10 8 3 /N0/SB1/P1/B0/D0 J14300
0 5 10 /N0/SB1/P0/B1/D0 J13301

4800-1:A> showlogs
Sep 20 02:48:46 vcore01-sc0 Platform.SC: 48 VDC 0 Temp. 0 value: 22
Degrees C
Sep 20 02:48:46 vcore01-sc0 Platform.SC: PS1, sensor status, under limit
(7,2,0x
604010b00030000)
Oct 06 18:38:30 vcore01-sc0 Platform.SC: /N0/SB5 reported first ECC
error
Oct 06 18:38:30 vcore01-sc0 Platform.SC: Undiagnosed ECC error affecting
 /N0/SB1
 /P1
Oct 06 18:38:30 vcore01-sc0 Platform.SC: Failed component found:
 /N0/SB5/P2
Oct 06 18:38:36 vcore01-sc0 Platform.SC: /N0/SB1 reported first ECC
error
Oct 06 18:38:36 vcore01-sc0 Platform.SC: Bad data read from a DIMM or
cache cont

e01-sc0 Platform.SC: Bad data read from a DIMM or cache controlled by
 /N0/SB1/P3
Oct 06 18:38:55 vcore01-sc0 Platform.SC: /N0/SB1 reported first ECC
error
Oct 06 18:38:55 vcore01-sc0 Platform.SC: Bad data read from a DIMM or
cache cont
rolled by /N0/SB1/P3
Oct 06 19:38:39 vcore01-sc0 Platform.SC: /N0/SB1 reported first ECC
error
Oct 06 19:38:39 vcore01-sc0 Platform.SC: Bad data read from a DIMM or
cache cont
rolled by /N0/SB1/P3
Oct 06 19:38:52 vcore01-sc0 Platform.SC: /N0/SB1 reported first ECC
error      

INTERNAL SUMMARY:

SUBMITTER: Mike Jaffee BUG REPORT ID: 4462509 PATCH ID: 108528-17 APPLIES TO: AFO Vertical Team Docs/Kernel ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.