SRDB ID | Synopsis | Date | ||
47414 | Sun Fire[TM] 12K/15K: dstop; Timeout on command reissue transaction to Slot0 | 22 Oct 2002 |
Status | Issued |
Description |
During heavy I/O loads, a Sun Fire[TM] 12K/15K domain dstops. Here is an example of the wfail output:
redxl> wfail SDI EX00/S0: All SDI is DStopped and RStopped, requested by DARB. SDI EX01/S0: All SDI is DStopped and RStopped, requested by DARB. SDI EX01/S0 Recordstop0[31:0] = 00818001 Rstop0[16]: R 1E DARB texp request Recordstop (M) Rstop0[23]: R AXQ requests all Recordstop (M) SDI EX02/S0: All SDI is DStopped and RStopped, requested by DARB. SDI EX03/S0: All SDI is DStopped and RStopped, requested by DARB. SDI EX04/S0: All SDI is DStopped and RStopped, requested by DARB. SDI EX04/S0 Recordstop0[31:0] = 00818001 Rstop0[16]: R 1E DARB texp request Recordstop (M) Rstop0[23]: R AXQ requests all Recordstop (M) SDI EX04/S0 Core_Error0[31:0] = 02008200 Mask = 0051FFFF CoreErr0[25]: D 1E Command pool timeout, non-split exp (M) valid_{slot_wr[1:0],read}_TO = 1 (rev 4+) {cmd_pool_loc[5:0],cmd4io,retired,half_used} = 020 SDI EX04/S0 STB_Error[31:0] = 00018001 Mask = 7F00FFFF STBErr[16]: D 1E STB entry timeout {loc[4:0],stb_full[1:0],retired,half_used,reord} = 03C SDI EX05/S0: All SDI is DStopped and RStopped, requested by DARB. SDI EX05/S0 Recordstop0[31:0] = 00818001 Rstop0[16]: R 1E DARB texp request Recordstop (M) Rstop0[23]: R AXQ requests all Recordstop (M) SDI EX06/S0: All SDI is DStopped and RStopped, requested by DARB. SDI EX06/S0 Recordstop0[31:0] = 00818001 Rstop0[16]: R 1E DARB texp request Recordstop (M) Rstop0[23]: R AXQ requests all Recordstop (M) SDI EX07/S0 Master_Stop_Status0[31:0] = 4004004F MStop0[3:0]: All SDI logic is DStopped + Recordstopped. SDI EX07/S0 Recordstop0[31:0] = 00818080 Rstop0[16]: R DARB texp request Recordstop (M) Rstop0[23]: R 1E AXQ requests all Recordstop (M) AXQ EX07 ( 7) Error_Flag_00[31:0] = 00048004 Mask = 00047FFB Err0[18]: R 1E Timeout on command reissue transaction to Slot0 FAIL Slot SB7: Dstop/Rstop detected by AXQ. The FRU for this failure cannot be identified from the available information. This error is not diagnosable. The FAIL action is just a guess to satisfy the POST design requirement that something must be deconfigured after a stop to guarantee that the process terminates. The FAILed component is no more suspect than any other hardware in the domain. SDI EX08/S0: All SDI is DStopped and RStopped, requested by DARB. SDI EX09/S0: All SDI is DStopped and RStopped, requested by DARB. SDI EX10/S0: All SDI is DStopped and RStopped, requested by DARB. SDI EX11/S0: All SDI is DStopped and RStopped, requested by DARB. SDI EX12/S0: All SDI is DStopped and RStopped, requested by DARB. SDI EX12/S0 Recordstop0[31:0] = 00818001 Rstop0[16]: R 1E DARB texp request Recordstop (M) Rstop0[23]: R AXQ requests all Recordstop (M) SDI EX13/S0: All SDI is DStopped and RStopped, requested by DARB. SDI EX14/S0: All SDI is DStopped and RStopped, requested by DARB. DARB C0: enabled ports (expanders) [17:0]: 0FFFF DARB C0: exps request Rstop [17:0]: 00080 DARB C0: other darb req Dstop+Rstop for exps[17:0]: 08080 DARB C1: enabled ports (expanders) [17:0]: 0FFFF DARB C1: exps request Rstop [17:0]: 00080 DARB C1: other darb req Dstop+Rstop for exps[17:0]: 08080 redxl> shaxq 7 Note: Data is displayed from the currently loaded dump file. AXQ EX7 (7) Component ID = C4312049 Rev 6.0 ExpID[4:0] = 07 Config0[31:0] = 1B380CF9 Config1[31:0] = 00249BC0 Timeout_Conf 1[19:0] = 7BDEF 0[31:0] = 1EF7BE0F Sec_Config[22:0] = 000000 Csr0_status[4:0] = 0F ID_Mask[31:0] = 00000000 Home_Mask[31:0] = 00000000 Flow_Ctl_Config[28:0] = 00CF0888 Config6[31:0] = 00000000 Config4[31:0] = 09C00000 Slot0_Domain_Mask[17:0]: Slot1 = 0FFFF Slot0 = 0FFFF Where Slot SB7 Slot0_DomInt_Mask[17:0]: Slot1 = 00000 Slot0 = 0FFFF can send. Slot1_Domain_Mask[17:0]: Slot1 = 00080 Slot0 = 0FFFF Where Slot IO7 Slot1_DomInt_Mask[17:0]: Slot1 = 00000 Slot0 = 0FFFF can send. Error_Flag_00[31:0] = 00048004 Mask = 00047FFB Err0[18]: R 1E Timeout on command reissue transaction to Slot0 Port[1:0] = 2 ATransID[3:0] that timed out = 9 reqagent_errsave0[2:0][31:0] = 0000 00000000 00520000 Error_Flag_01[31:0] = 00000000 Mask = 40047FFB Error_Flag_02[31:0] = 00000000 Mask = 0000FFFF Error_Flag_03[31:0] = 00000000 Mask = 21005EFF Error_Flag_04[31:0] = 00000000 Mask = 01FEFFFF Error_Flag_05[31:0] = 00000000 Mask = 1024FFFF Error_Flag_06[31:0] = 00000000 Mask = 7E00FFFF Error_Flag_07[31:0] = 00000000 Mask = 63FF7D24 Error_Flag_08[31:0] = 00000000 Mask = 0000FFFF Error_Flag_09[31:0] = 00000000 Mask = 7E00FFFF Error_Flag_10[31:0] = 00000000 Mask = 7C00FFFF Error_Flag_11[31:0] = 00000000 Mask = 7FF0FFFF
SOLUTION SUMMARY:
Explanation:
The problem appears on the data path between the system board and the expander. The AXQ chip on the expander has a command reissue timeout, as detailed by the line, "1E Timeout on command reissue transaction to Slot0". It implicates slot0 (the system board), but that error is historically a "victim" error. This means that the error is most likely the source of the transaction, not the destination. So, in this case the data transaction is going from EXB to the SB, and the EXB exceeds the data transaction timeout which the SDI (System Data Interface) detects and thus prompts the Master Stop on the domain, resulting in the dstop condition.
Action:
In this specific case, and in most similer cases, this error is seen under heavy I/O load conditions, such as in a benchmark test or SunVTS testing. The AXQ 6.0 (and below) chip itself is susceptible to this type of failure under heavy I/O loads. Lighter I/O loads can still generate the error, but not in the same frequency of failures as a heavy load situation would produce.
There were several escalations opened against this issue. The current practice (where parts exist) is to swap out the EXBs on the platform (all of them!) for the newer EXBs with the AXQ 6.1 chip installed.
See Bug ID
FCO
Keywords: timeout, dstop, command, reissue, transaction, slot0
INTERNAL SUMMARY:
SUBMITTER: Joshua Freeman BUG REPORT ID: 4505200, 4508788 APPLIES TO: AFO Vertical Team Docs/HAS, Hardware/Sun Fire /15000 ATTACHMENTS: