SRDB ID | Synopsis | Date | ||
47420 | Sun Fire[TM] 12K/15K: Rstop: CP1 half data parity error | 22 Nov 2002 |
Status | Issued |
Description |
- Problem Statement/Title: SF15K Troubleshooting Article: Rstop: CP1 half data parity error - Symptoms: 'wfail' output reports something similar to the following: 01 redxl> dumpf load dsmd.rstop.020804.1527.51 02 Created Sun Aug 4 15:27:51 2002 03 By hpost v. 1.2 Generic 112488-05 May 8 2002 17:05:18 executing as pid=14770 04 On ssc name = sms01-sc0. 05 Domain = 1=B = cris01 Platform = ocesf15k1 06 Boards in dump: master SC CPs/CSBs[1:0]: 3 07 EXB[17:0]: 047E0 08 Slot0[17:0]: 007E0 09 Slot1[17:0]: 041E0 10 -D option, -d 11 "DSMD RecordStop Dump" 12 0 errors occurred while creating this dump. 13 redxl> wfail 14 SDI EX05/S0: SDI is RStopped, requested by DARB. 15 SDI EX06/S0: SDI is RStopped, requested by DARB. 16 SDI EX07/S0: SDI is RStopped, requested by DARB. 17 SDI EX08/S0: SDI is RStopped, requested by DARB. 18 SDI EX09/S0 Master_Stop_Status0[31:0] = F0040008 19 MStop0[3]: SDI is Recordstopped 20 SDI EX09/S0 Recordstop0[31:0] = 00010001 21 Rstop0[16]: R DARB texp request Recordstop (M) 22 SDI EX09/S0 Recordstop1[31:0] = 00018001 23 Rstop1[16]: R 1E SDI Slave 3 requested all Recordstop 24 SDI EX09/S3 Master_Stop_Status0[31:0] = 00000008 25 MStop0[3]: SDI is Recordstopped 26 SDI EX09/S3 Recordstop0[31:0] = 00108010 27 Rstop0[20]: R 1E SDI internal CP port request Recordstop 28 SDI EX09/S3 CP_Error0[31:0] = 10009000 Mask = 7F3F67FF 29 CPErr0[28]: R 1E CP1 half data parity error 30 {cp1_datap,cp1_data[24:0]} = 126001C 31 FAIL EXB EX9 with Data Bus C1: Dstop/Rstop detected by SDI EX9/S3. 32 Primary service FRU is EXB EX9. 33 Secondary service FRU is CSB C1 or the logic centerplane. 34 SDI EX10/S0: SDI is RStopped, requested by DARB. 35 SDI EX14/S0: SDI is RStopped, requested by DARB. 36 DARB C0: enabled ports (expanders) [17:0]: 047E3 37 DARB C0: exps request Rstop [17:0]: 00200 38 DARB C0: other darb req Rstop for exps [17:0]: 00200 39 DARB C1: enabled ports (expanders) [17:0]: 047E3 40 DARB C1: exps request Rstop [17:0]: 00200 41 DARB C1: other darb req Rstop for exps [17:0]: 00200 42 DARB C1 Port 9 InterAsicStatus[31:0] = 00201015 43 IAStat[12]: DMX D0 requests Recordstop for this exp
SOLUTION SUMMARY:
- Troubleshooting: The dump header tells us that this Rstop was generated by dsmd (lines 10,11) while a domain was active. This is also evident by the dumpf file name - dsmd.rstop files are created by dsmd to capture the error state. Walking the error chain: - Master SDI on EX9 is directed to Rstop by slave SDI 3(line 23). - Slave SDI3 on EX9 reports a parity error received from the high half of the centerplane (lines 27-30). - EX9 using data bus 1 is FAILed from the configuration (line 31). - EX9 and CSB1 are named as the primary and secondary FRUs (lines 32,33). When EX9/SDI3 detects the parity error, it implies that a bit error occurred to data in transit between the DMXs on the high half of the centerplane and SDI3 on EX9. Since the error happened while crossing an interconnect, both EX9 and CP1's data bus are suspect. Data path parity for the SDIs is only utilized as an interconnect diagnostic tool. Because the underlying data is protected by ECC, and the individual SDIs only have information on their specific data slice (i.e. multi-bit errors are unknown to a single SDI), data is allowed to pass despite parity errors. However, the SDI records the parity error if further diagnosis is needed. - Resolution: The frequency of the error must be judged. If the error only occurred once, a component replacement is not warranted. However, if repeating relatively often, a replacement may be best to avoid a future interruption. Another factor is if the implicated expander was recently installed, serviced, etc. If a replacement is suitable, follow the suggestion of 'wfail'. Start with the expander board. During replacement, examine the interconnect for pin damage. If the expander board exchange does not correct the problem, the centerplane is the secondary FRU. - Summary of part number and patch ID's http://infoserver.central.sun.com/data/syshbk/Systems/SunFire15K/component.centerplane.html - References and bug IDs SRDB48122 - Additional background information: Using the DMX history and SDI error capture information, it may be possible to determine which bit was in error. In this example, DMX 1.0 also flagged a Recordstop (line 43). Looking at it's history register for expander 9: 44 redxl> shdmx -s 1 0 h 0x00200 45 Note: Data is displayed from the currently loaded dump file. 46 DMX C1/D0 ECC-compressed output history[8:0] to SDIs. 47 11 10 9 8 7 6 48 OE P ECC OE P ECC OE P ECC OE P ECC OE P ECC OE P ECC entry 49 0 1 52 0 old 50 0 1 52 1 51 0 1 52 2 52 0 1 52 3 53 0 1 52 4 54 0 1 52 5 55 0 1 52 6 56 0 1 52 7 57 0 1 52 8 58 0 1 52 9 59 0 1 52 10 60 0 1 52 11 61 0 1 52 12 62 0 1 52 13 63 0 1 52 14 64 1 1 52 15 65 1 0 1B 16 66 1 0 50 17 67 1 1 2A 18 68 0 0 6A 19 69 0 0 6A 20 70 0 0 6A 21 71 0 0 6A 22 72 0 0 6A 23 73 0 0 6A 24 74 0 0 6A 25 75 0 0 6A 26< 76 0 0 6A 27 new The ECC history entry indicated by "<" (line 75) can be compared to the SDI data capture (line 30) using 'parse dmxoh'. 77 redxl> parse dmxoh 126001C 6a 78 SDI capture[24:0] = 126001C. Computed ecc = 52. DMX hist ecc = 6A. 79 Could be a 1-bit error in bit 5 (as used to compute DMX oh ecc). - Keywords 15K, 12K, SF15K, SF12K, Sun Fire 15K, Enterprise, Server, Sun Fire 12K, Dstop, CP1 half data parity error
INTERNAL SUMMARY:
SUBMITTER: Joshua Freeman APPLIES TO: AFO Vertical Team Docs/HAS, Hardware/Sun Fire /15000 ATTACHMENTS: