SRDB ID | Synopsis | Date | ||
48204 | Sun Fire[TM] 12K/15K: Dstop: Data path command parity error detected by SDI(M) | 31 Oct 2002 |
Status | Issued |
Description |
- Problem Statement: Dstop: Data path command parity error detected by SDI(M). - Symptoms: redx wfail command output reports the following failure signature: redxl> dumpf load dsmd.dstop.020507.2037.16 Created Tue May 7 20:37:18 2002 By hpost v. 1.2 Generic 112488-04 Mar 18 2002 14:43:00 executing as pid=4959 On ssc name = rasputin-sc0.SD_RASCAL.West.Sun.COM Domain = 0=A Platform = rasputin Boards in dump: master SC CPs/CSBs[1:0]: 3 EXB[17:0]: 12100 Slot0[17:0]: 12100 Slot1[17:0]: 12100 -D option, -d "DSMD DomainStop Dump" 0 errors occurred while creating this dump. redxl> wfail SDI EX08/S0 Master_Stop_Status0[31:0] = B004000F MStop0[3:0]: All SDI logic is DStopped + Recordstopped. SDI EX08/S0 Dstop0[31:0] = 00098008 Dstop0[16]: D DARB texp requests all Dstop (M) Dstop0[19]: D 1E SDI internal core requested Dstop SDI EX08/S0 Core_Error0[31:0] = 00208020 Mask = 0051FFFF CoreErr0[21]: D 1E AXQ Data path command parity error (M) {dat_cmdp,dat_cmd[23:0]} = 0000001. {retired,half_used} = 3 NOTE: Compare dat+par to AXQ out history to isolate 1-bit errors. FAIL EXB EX8: Dstop/Rstop detected by AXQ. Primary service FRU is EXB EX8. SDI EX13/S0: All SDI is DStopped and RStopped, requested by DARB. SDI EX16/S0: All SDI is DStopped and RStopped, requested by DARB. DARB C0: enabled ports (expanders) [17:0]: 16100 DARB C0: other darb req Dstop+Rstop for exps[17:0]: 00100 DARB C1: enabled ports (expanders) [17:0]: 16100 DARB C1: other darb req Dstop+Rstop for exps[17:0]: 00100 redxl> shsdi -e 8 Note: Data is displayed from the currently loaded dump file. SDI EX08/S0 Component ID = 64317049 Master_Stop_Status0[31:0] = B004000F MStop0[3:0]: All SDI logic is DStopped + Recordstopped. Master_Stop_Status1[31:0] = E8E8000E 0x08 CP1StopExp[4:0] MSS1[20:16] 3 CP1StopSlot[0:1] MSS1[22:21] Dstop is 1st stop 1 CP1StopInfoValid MSS1[23] 0x08 CP0StopExp[4:0] MSS1[28:24] 3 CP0StopSlot[0:1] MSS1[30:29] Dstop is 1st stop 1 CP0StopInfoValid MSS1[31] Dstop0[31:0] = 00098008 Dstop0[16]: D DARB texp requests all Dstop (M) Dstop0[19]: D 1E SDI internal core requested Dstop Dstop1[31:0] = 00000000 Recordstop0[31:0] = 00018001 Rstop0[16]: R 1E DARB texp request Recordstop (M) Recordstop1[31:0] = 00000000 Core_Error0[31:0] = 00208020 Mask = 0051FFFF CoreErr0[21]: D 1E AXQ Data path command parity error (M) {dat_cmdp,dat_cmd[23:0]} = 0000001. {retired,half_used} = 3 NOTE: Compare dat+par to AXQ out history to isolate 1-bit errors. Core_ErrData[4:2][31:0] = 00000000 00080700 00000060 Core_ErrData[1:0][31:0] = 00000007 00001001 Core_Error1[31:0] = 00000000 Mask = FFFFFFFF Sysreg_Error[31:0] = 00000000 Mask = 780377FF STB_Error[31:0] = 00000000 Mask = 7F00FFFF CP_Error0[31:0] = 00000000 Mask = 580067FF CP_Error1[31:0] = 00000000 Mask = 7FFCFFFF Slot0_Error0[31:0] = 00000000 Mask = 7000FFFF Slot0_Error1[31:0] = 00000000 Mask = 31444EBF Slot0_Error2[31:0] = 00000000 Mask = 7FFCFFFF Slot1_Error0[31:0] = 00000000 Mask = 3000FFFF Slot1_Error1[31:0] = 00000000 Mask = 31404EBF Slot1_Error2[31:0] = 00000000 Mask = 7FFCFFFF redxl> shaxq 8 h Note: Data is displayed from the currently loaded dump file. AXQ EX08 Ecc-compressed output history[6:0] to AMX, RMX, and SDI. <---- AMX ----> RMX SDI DpCmd Sysreg 1.1 1.0 0.1 0.0 0 1 OE Ecc OE Ecc entry 15 15 15 15 05 05 1 68 0 49 0 old 15 15 15 15 05 05 1 68 0 49 1 15 15 15 15 05 05 1 68 0 49 2 15 15 15 15 05 05 1 68 0 49 3 15 15 15 15 05 05 1 68 0 49 4 15 15 15 15 05 05 1 68 0 49 5 15 15 15 15 05 05 1 68 0 49 6 15 15 15 15 05 05 1 68 0 49 7 15 15 15 15 05 05 1 68 0 49 8 15 15 15 15 05 05 1 68 0 49 9 15 15 15 15 05 05 1 68 0 49 10 15 15 15 15< 05 05< 1 68 0 49 11 15 15 15 15 05 05 1 68 0 49 12 15 15 15 15 05 05 1 68 0 49 13 15 15 15 15 05 05 1 68 0 49 14 15 15 15 15 05 05 1 68 0 49 15 15 15 15 15 05 05 1 68 0 49 16 15 15 15 15 05 05 1 68 0 49 17 15 15 15 15 05 05 1 68 0 49 18 15 15 15 15 05 05 1 68 0 49 19 15 15 15 15 05 05 1 68 0 49 20 15 15 15 15 05 05 1 68 0 49 21 15 15 15 15 05 05 1 68 0 49 22 15 15 15 15 05 05 1 68 0 49 23 15 15 15 15 05 05 1 68 0 49 24 15 15 15 15 05 05 1 68 0 49 25 15 15 15 15 05 05 1 68< 0 49 26 15 15 15 15 05 05 1 68 0 49 27 15 15 15 15 05 05 1 68 0 49 28 15 15 15 15 05 05 1 68 0 49 29 15 15 15 15 05 05 1 68 0 49 30 15 15 15 15 05 05 1 68 0 49 31 new NOTE: If a parity error was detected by a receiving AMX, RMX, or SDI, the ecc history entry indicated by '<' in this display can be compared to the receiver's data capture to isolate 1-bit errors. Use the command "parse axqoh" to do this analysis. This assumes only a single error exists in the system; multiple errors can delay recordstop, causing the history of interest to be in an indeterminate older entry in the output history. redxl> parse axqoh d x0000001 x68 SDI Dpath cmd capture[24:0] = 0000001. Computed ecc = 47. AXQ hist ecc = 68. Could be a 1-bit error in bit 0 (as used to compute AXQ oh ecc).
SOLUTION SUMMARY:
- Troubleshooting: It is evident from dump header that this Dstop dumpfile was generated by dsmd while the domain was running. This is also evident by the dump file name - dsmd.dstop files are created by dsmd as part of an ASR. Note the following first two errors (1E) on the two different error registers: Dstop0 - SDI internal core requested Dstop CoreErr0 - AXQ Data path command parity error (M) Note FAIL EXB EX8. This would be what POST would choose to deconfigure in order to recover the domain with maximal fault-free domain given the fault implied by this error during the POST run. Note the recommendation to the FRU(s) to be replaced in order to remove the fault: Primary service FRU is EXB EX8. AXQ sends data commands and domain/record stop information to the SDI(M) over the 24 bit unidirectional data path command interface data_cmd_l. One command can be sent per cycle. Parity (even) is provided concurrently with the transfer and allows a quiescent state of all highs on the active low bus. Using redx on AXQ out history to isolate 1-bit errors: redxl> parse axqoh d x0000001 x68, we have a possible 1-bit error in bit 0. - Resolution: This Data Path Command signal is on the Expander Board from AXQ to SDI(M). The service FRU is EXB 8. - Summary of part number and patch ID's http://infoserver.central.sun.com/data/syshbk/Systems/SunFire15K/component.centerplane.html - References and bug IDs Specification for an ASIC - SDI. - Additional background information: - Meta-Data/Problem categorization: Product/Platform: SF12K/SF15K Category: - Keywords 15K, 12K, SF15K, SF12K, starcat, dstop, AXQ, SDI(M), axqoh, Data path Command parity error
INTERNAL SUMMARY:
SUBMITTER: Tong-Pheng Koh APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS: