SRDB ID | Synopsis | Date | ||
48207 | Sun Fire[TM] 12K/15K: Dstop: AMX 0-3 flow control didn't arrive simultaneously | 31 Oct 2002 |
Status | Issued |
Description |
- Problem Statement/Title: Dstop: AMX 0-3 hio flow control didn't arrive simultaneously - Symptoms: 'wfail' output reports something similar to the following: 01 redxl> dumpf load dsmd.dstop.020204.1016.58 02 Created Mon Feb 4 10:43:33 2002 03 By hpost v. 1.1 Generic 112099-05 Nov 27 2001 12:41:09 executing as pid=2464 04 On ssc name = mayflower-sc0. 05 Domain = 1=B = pilgrim2 Platform = mayflower 06 Boards in dump: master SC CPs/CSBs[1:0]: 3 07 EXB[17:0]: 001E4 08 Slot0[17:0]: 001E4 09 Slot1[17:0]: 00100 10 -D option, -d 11 "DSMD DomainStop Dump" 12 0 errors occurred while creating this dump. 13 redxl> wfail 14 SDI EX02/S0 Master_Stop_Status0[31:0] = A000004F 15 MStop0[3:0]: All SDI logic is DStopped + Recordstopped. 16 SDI EX02/S0 Dstop0[31:0] = 12028200 17 Dstop0[17]: D DARB texp requests Slot0 Dstop (M) 18 Dstop0[25]: D 1E AXQ requests all Dstop (M) 19 Dstop0[28]: D Slot0 asserted Error, enabled to cause Dstop (M) 20 AXQ EX02 ( 2) Error_Flag_03[31:0] = 10009000 Mask = 21005EFF 21 Err3[28]: D 1E AMX data ECC uncorrectable error 22 FAIL EXB EX2: Dstop/Rstop detected by AXQ. 23 Primary service FRU is EXB EX2. 24 SDI EX05/S0 Master_Stop_Status0[31:0] = 8004004F 25 MStop0[3:0]: All SDI logic is DStopped + Recordstopped. 26 SDI EX05/S0 Dstop0[31:0] = 12028200 27 Dstop0[17]: D DARB texp requests Slot0 Dstop (M) 28 Dstop0[25]: D 1E AXQ requests all Dstop (M) 29 Dstop0[28]: D Slot0 asserted Error, enabled to cause Dstop (M) 30 AXQ EX05 ( 5) Error_Flag_02[31:0] = 04008400 Mask = 0000FFFF 31 Err2[26]: D 1E AMX 0-3 hs flow control didn't arrive simultaneously 32 FAIL EXB EX5: Dstop/Rstop detected by AXQ. 33 Primary service FRU is EXB EX5. 34 SDI EX06/S0 Master_Stop_Status0[31:0] = A004000F 35 MStop0[3:0]: All SDI logic is DStopped + Recordstopped. 36 SDI EX06/S0 Dstop0[31:0] = 02028200 37 Dstop0[17]: D DARB texp requests Slot0 Dstop (M) 38 Dstop0[25]: D 1E AXQ requests all Dstop (M) 39 AXQ EX06 ( 6) Error_Flag_03[31:0] = 10009000 Mask = 21005EFF 40 Err3[28]: D 1E AMX data ECC uncorrectable error 41 FAIL EXB EX6: Dstop/Rstop detected by AXQ. 42 Primary service FRU is EXB EX6. 43 SDI EX07/S0: Slot 0 port is DStopped, SDI is RStopped, requested by DARB. 44 SDI EX08/S0: All SDI is DStopped and RStopped, requested by DARB.
SOLUTION SUMMARY:
- Troubleshooting: The dump header tells us that this Dstop was generated by dsmd (lines 10,11) while a domain was active. This is also evident by the dumpf file name - dsmd.dstop files are created by dsmd as part of an ASR. Walking the error chain: - The SDI on EX2 calls for Dstop as directed by AXQ2 (line 18). - The AXQ on EX2 detected an uncorrectable ECC error (line 21). - The SDI on EX5 calls for Dstop as directed by AXQ5 (line 28). - The AXQ on EX5 detected a flow control issue from the centerplane (line 31). - The SDI on EX6 calls for Dstop as directed by AXQ6 (line 38). - The AXQ on EX6 detected an uncorrectable ECC error (line 40). - Expanders 2, 5 and 6 are FAILed from the configuration (lines 22,33,41). - Expanders 2, 5 and 6 are named as FRUs (lines 23,34,42). This is a large amount of hardware to replace. It calls for more analysis. The key signature is the AMX flow control error (line 31). Several bugs exist in SMS that result in this type of Dstop. For all of these bugs (listed below), a key piece was that a POST on another domain was in process at the time of the Dstop. From the name of the dump file, we know the Dstop was detected at 10:16:58 on Feb 2 2002 (line 01). Note that this is different from the time when the dump is created (line 02). Creation of the dump file (i.e., hpost -D runs and collects the dump) can be significantly delayed because of either POST throttling by tmd or split expander configurations (per-expander locks). Next, look through explorer to find if another POST was started prior to the Dstop. In this case, a POST was initiated at 10:15:01 Feb 2 2002. Also check the POST patch level from the dump header (line 03) or explorer. In this example, bug4712287 was encountered. - Resolution: Ensure patches and workarounds (if applicable) are up to date. If all patches/workarounds were in place, use the Mstop registers in the SDI to identify the first SDI calling for stop (see below). Initiate hardware actions with that boardset. Note that known hardware causes of AMX flow control have had the following characteristics: o All first errors are isolated to a single boardset (expander) o The reporting AXQ also logs ans AMX flow control/nack parity error - Summary of part number and patch ID's Bugs4505473 - when 2 domains share an expander, hpost -Q on one causes the other to DStop4644905 - POST of one domain Dstops another - AMXs lose lockstep4712287 - EXB asic LBIST needs to be skipped when CP is in use. SMS 1.1112080-05 (or higher) Use "no_asic_lbist axq # BugiD 4712287" .postrc workaround SMS 1.2112488-07 (or higher) - References and bug IDs SunSolve Article 48122 - Additional background information: As seen in this dump, it is possible for multiple SDIs to claim first error system wide. This is because from the time one component calls for Dstop to when the SDIs receive the broadcast from the DARBs is large enough that other errors may be recorded in the SDIs. When multiple SDIs all report first errors, the Mstop0 registers in the master SDI can be used to determine the first DARB stop message sent. The message indicates which expander sourced the request. Thus, it implies which expander was the first to call for stop. For example: 45 redxl> shsdi -e 2 46 Note: Data is displayed from the currently loaded dump file. 47 SDI EX02/S0 Component ID = 54317049 48 Master_Stop_Status0[31:0] = A000004F 49 MStop0[3:0]: All SDI logic is DStopped + Recordstopped. 50 Master_Stop_Status1[31:0] = E2E2000E 51 0x02 CP1StopExp[4:0] MSS1[20:16] 52 3 CP1StopSlot[0:1] MSS1[22:21] Dstop is 1st stop 53 1 CP1StopInfoValid MSS1[23] 54 0x02 CP0StopExp[4:0] MSS1[28:24] 55 3 CP0StopSlot[0:1] MSS1[30:29] Dstop is 1st stop 56 1 CP0StopInfoValid MSS1[31] redx parses the stop message. Lines 51 and 54 indicate which expander is calling for the stop. Here, it is EX2. The other SDIs in the domain configuration can be examined. Typically, all agree and the demanding boardset can be focused upon. - Meta-Data/Problem categorization: Product/Platform: SF12K/SF15K Category: - Keywords 15K, 12K, SF15K, SF12K, starcat, dstop, AMX 0-3 flow control didn't arrive simultaneously
INTERNAL SUMMARY:
SUBMITTER: Scott Davenport BUG REPORT ID: 4712287, 4505473, 4644905, 4712287 PATCH ID: 112080-05, 112488-07 APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS: