SRDB ID | Synopsis | Date | ||
48124 | Sun Fire[TM] 12K/15K: Rstop: Slot1 data parity bit 1 error | 30 Oct 2002 |
Status | Issued |
Description |
- Problem Statement: Rstop: Slot1 data parity bit 1 error - Symptoms: 'wfail' output reports something similar to the following: 01 redxl> dumpf load dsmd.rstop.020915.1751.52 02 Created Sun Sep 15 17:51:53 2002 03 By hpost v. 1.2 Generic 112488-06 Jun 18 2002 15:53:15 executing as pid=13269 04 On ssc name = starcat1-sc0.bestbuy.com 05 Domain = 5=F = ds01ux Platform = starcat1 06 Boards in dump: master SC CPs/CSBs[1:0]: 3 07 EXB[17:0]: 0E000 08 Slot0[17:0]: 0E000 09 Slot1[17:0]: 0E000 10 -D option, -d 11 "DSMD RecordStop Dump" 12 0 errors occurred while creating this dump. 13 redxl> wfail 14 SDI EX13/S0: SDI is RStopped, requested by DARB. 15 SDI EX14/S0: SDI is RStopped, requested by DARB. 16 SDI EX14/S0 Recordstop0[31:0] = 04018001 17 Rstop0[16]: R 1E DARB texp request Recordstop (M) 18 Rstop0[26]: R Slot0 asserted EccErr, enabled to cause Rstop (M) 19 SDI EX15/S0 Master_Stop_Status0[31:0] = 80040008 20 MStop0[3]: SDI is Recordstopped 21 SDI EX15/S0 Recordstop0[31:0] = 00010001 22 Rstop0[16]: R DARB texp request Recordstop (M) 23 SDI EX15/S0 Recordstop1[31:0] = 00408040 24 Rstop1[22]: R 1E SDI Slave 1 requested all Recordstop 25 SDI EX15/S1 Master_Stop_Status0[31:0] = 00000008 26 MStop0[3]: SDI is Recordstopped 27 SDI EX15/S1 Recordstop0[31:0] = 00408040 28 Rstop0[22]: R 1E SDI internal Slot1 port request Recordstop 29 SDI EX15/S1 Slot1_Error1[31:0] = 2000A000 Mask = FFFF4FFF 30 S1Err1[29]: R 1E Slot1 data parity bit 1 error 31 slt1_datap[1:0], slt1_data[23:0] = 3 000020 32 FAIL Slot IO15: Dstop/Rstop detected by SDI EX15/S1 33 Primary service FRU is Slot IO15. 34 Secondary service FRU is EXB EX15. 35 DARB C0: enabled ports (expanders) [17:0]: 0EDFF 36 DARB C0: exps request Rstop [17:0]: 08000 37 DARB C0: other darb req Rstop for exps [17:0]: 08000 38 DARB C1: enabled ports (expanders) [17:0]: 0EDFF 39 DARB C1: exps request Rstop [17:0]: 08000 40 DARB C1: other darb req Rstop for exps [17:0]: 08000
SOLUTION SUMMARY:
- Troubleshooting: The dump header tells us that this Rstop was generated by dsmd (lines 10,11) while a domain was active. This is also evident by the dump file name. dsmd.rstop files are created by dsmd as part of error capturing. Walking the error chain: - EX14 shows errors, but the first error is the DARB requesting the stop (lines 16-18). - EX15/S0 (SDI0) reports a first error from slave SDI1 (line 24). - EX15/S1 (SDI1) reports a first error of a parity bit 1 error from slot 1 (line 30). - 'wfail' then informs us to fail IO15 (line 32) to avoid the error. - IO15 and EX15 are identified as primary/secondary FRUs (lines 33,34) When EX15/SDI1 detects the parity error, it implies that a bit error occurred to data in transit between the DXs on IO15 and SDI1 on EX15. Therefore, the error happened across an interconnect, and both IO15 and EX15 are suspect. Data path parity for the SDIs is only utilized as an interconnect diagnostic tool. Because the underlying data is protected by ECC, and the individual SDIs only have information on their specific data slice (i.e., multi-bit errors are unknown to a single SDI), data is allowed to pass despite parity errors. However, the SDI records the parity error if further diagnosis is needed. Since the data is allowed to pass, Solaris[TM] will also record an ECC event. For example: 41 Sep 15 17:51:52 ds01ux SUNW,UltraSPARC-III+: [ID 729643 kern.notice] NOTICE: 42 [AFT0] Corrected system bus (CE) Event on CPU448 at TL=0, errID 0x0000593b.a4ebc76b 43 Sep 15 17:51:52 ds01ux AFSR 0x00000002<CE>.000001b8 AFAR 0x00000400.fef171d0 44 Sep 15 17:51:52 ds01ux Fault_PC 0x103484dc Esynd 0x01b8 Not memory 45 Sep 15 17:51:52 ds01ux SUNW,UltraSPARC-III+: [ID 317326 kern.notice] [AFT0] errID 0x0000593b.a4ebc76b 46 Data Bit 127 was in error and corrected 47 Sep 15 17:51:52 ds01ux SUNW,UltraSPARC-III+: [ID 370641 kern.info] [AFT2] errID 0x0000593b.a4ebc76b 48 E$tag PA=0x000001e1.9af171c0 does not match AFAR=0x00000400.fef171c0 49 Sep 15 17:51:52 ds01ux SUNW,UltraSPARC-III+: [ID 684557 kern.info] [AFT2] errID 0x0000593b.a4ebc76b 50 PA=0x000001e1.9af171c0 51 Sep 15 17:51:52 ds01ux E$tag 0x00000786.6b000001 E$state_7 Invalid In this example, Solaris corrects the single bit error. The Rstop file can pinpoint where in the interconnect the bit error occurred. - Resolution: First the severity of the error must be judged. If the error was uncorrectable and resulted in a domain interruption, a component replacement is in order. Or, if the error is correctable, but repeating relatively often, a replacement may be best to avoid a future interruption. If a replacement is suitable, follow the suggestion of 'wfail'. Start with the IO board. During replacement, examine the interconnect for pin damage. If the IO board exchange does not correct the problem, the Expander is the secondary FRU. - Summary of part number and patch ID's 501-5179 Expander http://infoserver.central.sun.com/data/sshandbook/Devices/I_O/IO_SunFire_15K_hsPCI_IO_Board.html - References and bug IDs SunSolve Article 48122 - Additional background information: - Meta-Data/Problem categorization: Product/Platform: SF12K/SF15K Category: - Keywords 15K, 12K, SF15K, SF12K, starcat, rstop, Slot1 data parity bit 1 error
INTERNAL SUMMARY:
SUBMITTER: Scott Davenport APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS: