SRDB ID | Synopsis | Date | ||
47430 | Sun Fire[TM] 12K/15K: PCI SERR PCI error occurred on device #x | 29 Oct 2002 |
Status | Issued |
Description |
A Sun Fire[TM] 12K/15K domain panics with the following pci serr panic string:
WARNING: pcisch-19: PCI fault log start: PCI SERR PCI error occurred on device #0 dwordmask=0 bytemask=0 pcisch-19: PCI primary error (0):pcisch-19: PCI secondary error (0):pcisch-19: PBM AFAR 0.00000000:WARNING: pcisch19: PCI config space CSR=0xc2a0<signaled-system-error,detected-parity-error> pcisch-19: PCI fault log end. panic[cpu128]/thread=2a10001fd20: pcisch-19: PCI bus 3 error(s)! 000002a10001bea0 pcisch:pbm_error_intr+148 (30000b643d8, 2772, 30000b84548, 3, 30000b643d8, 3) %l0-3: 00000300008b9860 0000000000004000 0000000000000000 0000030000b86584 %l4-7: 00000300009978c8 0000030008d03ea8 0000000000000000 0000030008d03ed0 000002a10001bf50 unix:current_thread+44 (0, ffffffffffffffff, 0, 300335b3528, 0, 1044f340) %l0-3: 0000000010007450 000002a10001f061 000000000000000e 0000000000000016 %l4-7: 0000000000010000 00000300339922a8 000000000000000b 000002a10001f910 000002a10001f9b0 unix:disp_getwork+40 (1044e398, 0, 1044f340, 10457310, 2, 0) %l0-3: 000000001010e2d8 0000000010509e00 00000300335bd518 000002a100c37d20 %l4-7: 000002a100cebd20 0000000002736110 0000000000000000 000002a10001f9c0 000002a10001fa60 unix:idle+a4 (0, 0, 80, 1044e398, 3000096d980, 0) %l0-3: 0000000010043d58 2030205b275d2076 616c20696e646578 000002a10011dd20 %l4-7: 70636220290a2020 202e22202073703a 20222031205b275d 2076616c20696e64
NOTE: The stack itself can be different, depending on each specific case. What matters is the CSR values (specifically the "detected-parity-error" bit).
SOLUTION SUMMARY:See FIN I0852-1 for details.
From the FIN:
The pcisch driver may panic on Sun Fire 15000 domains due to a parity error on the PCI Bus. In most cases this is due to a faulty hardware component. However, in some cases the panic cannot be corrected by replacing a hardware FRU. This second scenario may result in multiple unexpected domain failures if not corrected. This FIN describes how to diagnose and correct this type of pcisch driver panic.
With every other panic of this nature, a hardware replacement has resolved the case. However, with one customer, repeated hardware replacements did not resolve the issue. The customer's issue has since been replicated on multiple machines in an engineering environment. There are some unique factors that are needed to create this scenario:
A. To date, this problem has only been seen on 375-3030 (Crystal+) cards.
B. All the panics have been in slot 2 of the I/O Boat (Slot 2 is the lower left position)
C. Schizo 2.3 seems to bring the problem out with more regularity.
D. Veritas software (specifically adding mirrors to volumes) seems to increase the likelihood of failure.
Action:
From the FIN:
To diagnosis the pcisch panic from the above stack, follow these steps:
a) Use the /etc/path_to_inst file on the domain or the cfgadm/rcfgadm commands to isolate the slot. For example, using the two methods with the panic above (pcisch-19):
# grep pcisch /etc/path_to_inst "/pci@3d,600000" 7 "pcisch" "/pci@1c,700000" 0 "pcisch" "/pci@3c,700000" 4 "pcisch" "/pci@9d,600000" 19 "pcisch" <---------- "/pci@9c,600000" 17 "pcisch" "/pci@3c,600000" 5 "pcisch" "/pci@5d,600000" 11 "pcisch" "/pci@7d,600000" 15 "pcisch" "/pci@1c,600000" 1 "pcisch" "/pci@1d,600000" 3 "pcisch" "/pci@5c,700000" 8 "pcisch" "/pci@7c,700000" 12 "pcisch" "/pci@7c,600000" 13 "pcisch" "/pci@9c,700000" 16 "pcisch" "/pci@9d,700000" 18 "pcisch" "/pci@3d,700000" 6 "pcisch" "/pci@5c,600000" 9 "pcisch" "/pci@1d,700000" 2 "pcisch" "/pci@7d,700000" 14 "pcisch" "/pci@5d,700000" 10 "pcisch" "/pci@11c,700000" 20 "pcisch" "/pci@11c,600000" 21 "pcisch" "/pci@11d,700000" 22 "pcisch" "/pci@11d,600000" 23 "pcisch"
In this case, instance 19 is "/pci@9d,600000". To translate that into a slot location, break down the 9d into binary <10011101>, then add a space to obtain <100 1110 1>. That address now breaks down to slot 4 (100), skip the middle section (1110), pci 1 (or the pci slot on the left).
The other option is to use the conversion which the dynamic reconfiguration interface provides:
# rcfgadm -d a -la | grep pcisch pcisch0:e00b1slot1 pci-pci/hp connected configured ok pcisch10:e02b1slot3 unknown connected unconfigured unknown pcisch11:e02b1slot2 pci-pci/hp connected configured ok pcisch12:e03b1slot1 pci-pci/hp connected configured ok pcisch13:e03b1slot0 pci-pci/hp connected configured ok pcisch14:e03b1slot3 unknown connected unconfigured unknown pcisch15:e03b1slot2 pci-pci/hp connected configured ok pcisch16:e04b1slot1 unknown connected unconfigured unknown pcisch17:e04b1slot0 pci-pci/hp connected configured ok pcisch18:e04b1slot3 unknown connected unconfigured unknown --> pcisch19:e04b1slot2 unknown empty unconfigured unknown pcisch1:e00b1slot0 unknown empty unconfigured unknown pcisch20:e08b1slot1 unknown empty unconfigured unknown pcisch21:e08b1slot0 pci-pci/hp connected configured ok pcisch22:e08b1slot3 unknown empty unconfigured unknown pcisch23:e08b1slot2 unknown empty unconfigured unknown pcisch2:e00b1slot3 unknown connected unconfigured unknown pcisch3:e00b1slot2 pci-pci/hp connected configured ok pcisch4:e01b1slot1 pci-pci/hp connected configured ok pcisch5:e01b1slot0 unknown empty unconfigured unknown pcisch6:e01b1slot3 unknown connected unconfigured unknown pcisch7:e01b1slot2 pci-pci/hp connected configured ok pcisch8:e02b1slot1 pci-pci/hp connected configured ok pcisch9:e02b1slot0 unknown connected unconfigured unknown
In this case, the issue is on expander 4 (ex4), I/0 board (b1), slot 2.
b) Once you identify the correct location, there are three FRUs which could be causing the parity error: the hsPCI (p/n: 501-6302-03 or lower and 501-5397-11 or lower) also called the I/O boards), the 3.3v cassette (p/n: 501-5599-07), or the adapter itself.
To narrow down the problem, employ standard hardware troubleshooting techniques and move/replace one hardware FRU at a time (CPRE recommends moving/replacing the adapter, then the cassette, and finally the PCI). If the problem follows a FRU (on a move) or no longer panics (on a replacement), CPAS the offending FRU.
In the event that you are unable to follow this process, it may become necessary to replace all three FRUs at once. However, this is not recommended as this could impact FRU availability and will increase service costs to Sun.
c) Once you identify a failing FRU and have taken appropriate action, track the machine's availability for an appropriate amount of time. Depends on the time taken to identify a failing FRU, the recommendation is to run the machine for twice as long as the panic interval. In some cases, that is 1 hour, while in others that is 24 days. If the problem persists or shows up on another pcisch instance, the machine could be experiencing the problem reported in Bug 4699182. Please escalate to CPRE.
d) Once CPRE verifies the customer is experiencing this issue, choose a "workaround" option (of two) listed in the "Corrective Action" section.
The root cause for pcisch driver panics, which are unrelated to faulty hardware, is still under investigation. There is no final fix at this time. In the meantime, use the recommended workarounds mentioned in the Corrective Action section below.
CORRECTIVE ACTION:
The following recommendation is provided as a guideline for authorized Enterprise Services Field Representatives who may encounter the above mentioned problem.
Troubleshoot pcisch driver panics on F15K domains as outlined above. If the problem is determined NOT to be caused by faulty hardware, implement one of the two workarounds below.
A. Replace the 375-3030 (Crystal+) cards with 375-3019 (Amber) cards (for slot 2 only). This has been shown to alleviate the issue after extensive testing. Since 375-3030 cards in slot 0 have not shown the problem to date, they do not need replacement.
OR
B. Move all 375-3030 cards to slot 0. This assumes there are enough I/O boats.
INTERNAL SUMMARY: