SRDB ID   Synopsis   Date
47414   Sun Fire[TM] 12K/15K: dstop; Timeout on command reissue transaction to Slot0   22 Oct 2002

Status Issued

Description

During heavy I/O loads, a Sun Fire[TM] 12K/15K domain dstops. Here is an example of the wfail output:

redxl> wfail 
SDI EX00/S0: All SDI is DStopped and RStopped,         requested by DARB. 
SDI EX01/S0: All SDI is DStopped and RStopped,         requested by DARB. 
SDI EX01/S0  Recordstop0[31:0]  = 00818001 
        Rstop0[16]: R 1E DARB texp request Recordstop (M) 
        Rstop0[23]: R    AXQ requests all Recordstop (M) 
SDI EX02/S0: All SDI is DStopped and RStopped,         requested by DARB. 
SDI EX03/S0: All SDI is DStopped and RStopped,         requested by DARB. 
SDI EX04/S0: All SDI is DStopped and RStopped,         requested by DARB. 
SDI EX04/S0  Recordstop0[31:0]  = 00818001 
        Rstop0[16]: R 1E DARB texp request Recordstop (M) 
        Rstop0[23]: R    AXQ requests all Recordstop (M) 
SDI EX04/S0  Core_Error0[31:0]  = 02008200  Mask = 0051FFFF 
        CoreErr0[25]: D 1E Command pool timeout, non-split exp (M) 
            valid_{slot_wr[1:0],read}_TO = 1 (rev 4+) 
            {cmd_pool_loc[5:0],cmd4io,retired,half_used} = 020 
SDI EX04/S0  STB_Error[31:0]    = 00018001  Mask = 7F00FFFF 
        STBErr[16]: D 1E STB entry timeout 
            {loc[4:0],stb_full[1:0],retired,half_used,reord} = 03C 
SDI EX05/S0: All SDI is DStopped and RStopped,         requested by DARB. 
SDI EX05/S0  Recordstop0[31:0]  = 00818001 
        Rstop0[16]: R 1E DARB texp request Recordstop (M) 
        Rstop0[23]: R    AXQ requests all Recordstop (M) 
SDI EX06/S0: All SDI is DStopped and RStopped,         requested by DARB. 
SDI EX06/S0  Recordstop0[31:0]  = 00818001 
        Rstop0[16]: R 1E DARB texp request Recordstop (M) 
        Rstop0[23]: R    AXQ requests all Recordstop (M) 
SDI EX07/S0  Master_Stop_Status0[31:0] = 4004004F
        MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
SDI EX07/S0  Recordstop0[31:0]  = 00818080              
        Rstop0[16]: R    DARB texp request Recordstop (M) 
        Rstop0[23]: R 1E AXQ requests all Recordstop (M) 
AXQ EX07 ( 7) Error_Flag_00[31:0] = 00048004  Mask = 00047FFB
        Err0[18]: R 1E Timeout on command reissue transaction to Slot0
FAIL Slot SB7:  Dstop/Rstop detected by AXQ.
The FRU for this failure cannot be identified from the available information.
        This error is not diagnosable. The FAIL action is just a guess to
        satisfy the POST design requirement that something must be 
        deconfigured after a stop to guarantee that the process terminates. 
        The FAILed component is no more suspect than any other hardware 
        in the domain. 
SDI EX08/S0: All SDI is DStopped and RStopped,         requested by DARB. 
SDI EX09/S0: All SDI is DStopped and RStopped,         requested by DARB. 
SDI EX10/S0: All SDI is DStopped and RStopped,         requested by DARB. 
SDI EX11/S0: All SDI is DStopped and RStopped,         requested by DARB. 
SDI EX12/S0: All SDI is DStopped and RStopped,         requested by DARB. 
SDI EX12/S0  Recordstop0[31:0]  = 00818001 
        Rstop0[16]: R 1E DARB texp request Recordstop (M) 
        Rstop0[23]: R    AXQ requests all Recordstop (M) 
SDI EX13/S0: All SDI is DStopped and RStopped,         requested by DARB. 
SDI EX14/S0: All SDI is DStopped and RStopped,         requested by DARB. 
DARB C0: enabled ports (expanders)          [17:0]: 0FFFF 
DARB C0: exps request Rstop                 [17:0]: 00080 
DARB C0: other darb req Dstop+Rstop for exps[17:0]: 08080 
DARB C1: enabled ports (expanders)          [17:0]: 0FFFF 
DARB C1: exps request Rstop                 [17:0]: 00080 
DARB C1: other darb req Dstop+Rstop for exps[17:0]: 08080 
redxl> shaxq 7 
Note: Data is displayed from the currently loaded dump file. 
AXQ  EX7 (7)   Component ID = C4312049   Rev 6.0 
        ExpID[4:0] = 07 
        Config0[31:0] = 1B380CF9 
        Config1[31:0] = 00249BC0 
        Timeout_Conf 1[19:0] = 7BDEF  0[31:0] = 1EF7BE0F 
        Sec_Config[22:0] = 000000 
        Csr0_status[4:0] = 0F 
        ID_Mask[31:0] = 00000000  Home_Mask[31:0] = 00000000 
        Flow_Ctl_Config[28:0] = 00CF0888 
        Config6[31:0] = 00000000 
        Config4[31:0] = 09C00000 
        Slot0_Domain_Mask[17:0]: Slot1 = 0FFFF  Slot0 = 0FFFF  Where Slot SB7 
        Slot0_DomInt_Mask[17:0]: Slot1 = 00000  Slot0 = 0FFFF     can send. 
        Slot1_Domain_Mask[17:0]: Slot1 = 00080  Slot0 = 0FFFF  Where Slot IO7 
        Slot1_DomInt_Mask[17:0]: Slot1 = 00000  Slot0 = 0FFFF     can send. 
              Error_Flag_00[31:0] = 00048004  Mask = 00047FFB 
        Err0[18]: R 1E Timeout on command reissue transaction to Slot0
             Port[1:0] = 2  ATransID[3:0] that timed out = 9
            reqagent_errsave0[2:0][31:0] = 0000 00000000 00520000 
              Error_Flag_01[31:0] = 00000000  Mask = 40047FFB 
              Error_Flag_02[31:0] = 00000000  Mask = 0000FFFF 
              Error_Flag_03[31:0] = 00000000  Mask = 21005EFF 
              Error_Flag_04[31:0] = 00000000  Mask = 01FEFFFF 
              Error_Flag_05[31:0] = 00000000  Mask = 1024FFFF 
              Error_Flag_06[31:0] = 00000000  Mask = 7E00FFFF 
              Error_Flag_07[31:0] = 00000000  Mask = 63FF7D24 
              Error_Flag_08[31:0] = 00000000  Mask = 0000FFFF 
              Error_Flag_09[31:0] = 00000000  Mask = 7E00FFFF 
              Error_Flag_10[31:0] = 00000000  Mask = 7C00FFFF 
              Error_Flag_11[31:0] = 00000000  Mask = 7FF0FFFF                        

SOLUTION SUMMARY:

Explanation:

The problem appears on the data path between the system board and the expander. The AXQ chip on the expander has a command reissue timeout, as detailed by the line, "1E Timeout on command reissue transaction to Slot0". It implicates slot0 (the system board), but that error is historically a "victim" error. This means that the error is most likely the source of the transaction, not the destination. So, in this case the data transaction is going from EXB to the SB, and the EXB exceeds the data transaction timeout which the SDI (System Data Interface) detects and thus prompts the Master Stop on the domain, resulting in the dstop condition.

Action:

In this specific case, and in most similer cases, this error is seen under heavy I/O load conditions, such as in a benchmark test or SunVTS testing. The AXQ 6.0 (and below) chip itself is susceptible to this type of failure under heavy I/O loads. Lighter I/O loads can still generate the error, but not in the same frequency of failures as a heavy load situation would produce.

There were several escalations opened against this issue. The current practice (where parts exist) is to swap out the EXBs on the platform (all of them!) for the newer EXBs with the AXQ 6.1 chip installed.

See Bug ID 4505200 and 4508788, as well as escalations 536089, and 536128, and 536433 for more details.

FCO A0192-1 also addresses this situation.

Keywords: timeout, dstop, command, reissue, transaction, slot0

INTERNAL SUMMARY:

SUBMITTER: Joshua Freeman BUG REPORT ID: 4505200, 4508788 APPLIES TO: AFO Vertical Team Docs/HAS, Hardware/Sun Fire /15000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.