SRDB ID | Synopsis | Date | ||
48493 | Sun Fire[TM] 12K/15K: Dstop: CDC indicates an owner outside the domain | 1 Nov 2002 |
Status | Issued |
Description |
- Problem Statement: Dstop: CDC indicates an owner outside the domain - Symptoms: 'wfail' output reports something similar to the following: 01 redxl> dumpf load dsmd.dstop.020510.0947.08 02 Created Fri May 10 09:47:10 2002 03 By hpost v. 1.2 Generic 112488-04 Mar 18 2002 14:43:00 executing as pid=6825 04 On ssc name = rasputin-sc0.SD_RASCAL.West.Sun.COM 05 Domain = 0=A Platform = rasputin 06 Boards in dump: master SC CPs/CSBs[1:0]: 3 07 EXB[17:0]: 12100 08 Slot0[17:0]: 12100 09 Slot1[17:0]: 12100 10 -D option, -d 11 "DSMD DomainStop Dump" 12 0 errors occurred while creating this dump. 13 redxl> wfail 14 SDI EX08/S0 Master_Stop_Status0[31:0] = E004000F 15 MStop0[3:0]: All SDI logic is DStopped + Recordstopped. 16 SDI EX08/S0 Dstop0[31:0] = 04218400 17 Dstop0[16]: D DARB texp requests all Dstop (M) 18 Dstop0[21]: D SDI internal STB port requested Dstop 19 Dstop0[26]: D 1E AXQ requests Slot0 Dstop (M) 20 SDI EX08/S0 Recordstop0[31:0] = 00818080 21 Rstop0[16]: R DARB texp request Recordstop (M) 22 Rstop0[23]: R 1E AXQ requests all Recordstop (M) 23 AXQ EX08 ( 8) Error_Flag_07[31:0] = 020B8200 Mask = 63FF7D24 24 Err7[16]: R CDC0 correctable error 25 Err7[17]: R CDC0 address parity error 26 Err7[19]: R CDC1 correctable error 27 Err7[25]: R 1E CDC uncorrectable error 28 AXQ EX08 ( 8) Error_Flag_08[31:0] = 20002000 Mask = 0000FFFF 29 Err8[29]: D CDC indicates an owner outside the domain 30 FAIL CDC Dimm EX8: Dstop/Rstop detected by AXQ. 31 Primary service FRU is EXB EX8. 32 SDI EX13/S0: All SDI is DStopped and RStopped, requested by DARB. 33 SDI EX16/S0: All SDI is DStopped and RStopped, requested by DARB. 34 DARB C0: enabled ports (expanders) [17:0]: 16100 35 DARB C0: exps request Dstop+Rstop [17:0]: 00100 36 DARB C0: other darb req Dstop+Rstop for exps[17:0]: 00100 37 DARB C1: enabled ports (expanders) [17:0]: 16100 38 DARB C1: exps request Dstop+Rstop [17:0]: 00100 39 DARB C1: other darb req Dstop+Rstop for exps[17:0]: 00100
SOLUTION SUMMARY:
- Troubleshooting: The dump header tells us that this Dstop was generated by dsmd (lines 10,11) while a domain was active. This is also evident by the dumpf file name - dsmd.dstop files are created by dsmd as part of an ASR. Walking the error chain: - Master SDI on EX8 is directed to Dstop by AXQ8 (line 19) - Master SDI on EX8 is directed to Rstop by AXQ8 (line 22) - AXQ8 reports several CDC related errors, all indicating Rstop (lines 24-27) - AXQ8 reports a fatal error in the CDC (line 29) - The CDC DIMM is FAILed from the configuration (line 30) - EX8 is named as the FRU (line 31) The CDC DIMM is divided into 3 SRAMs, read in parallel, forming a 3-way set associative cache. CDC entries contain information about lines of memory recently referenced by SSM logic. Any error (correctable or uncorrectable) in the CDC is recorded and logged, but never causes a Dstop. Entries with correctable errors are written back with the corrected data. Uncorrectable errors are treated as cache misses. Notice that all the errors recorded in AXQ8's Err7 register (lines 24-27) are all Recordstop events ('R' precedes the error description). However, from the name of the dump file (line 01) and the dsmd action (line 11), we know this is a Dstop. The Dstop is triggered because the data in the CDC indicates the owner of a cache line is a board that is not in the resources comprising this domain. Either AXQ8 wrote the offending error, or the CDC entry has been trashed. In either scenario, this fault is deemed serious (coherency within the domain is in question), thus the Dstop. So, although the first error is a Recordstop (line 27), because another error requiring Dstop occurs, the stop acted upon is a Dstop. In this case, because of the sheer number of CDC-related errors, it is clear that the CDC is in dire straits. That's why the CDC DIMM is FAILed from the configuration (line 30). The CDC is not a FRU, so the expander must be replaced (line 31). Also note the blacklisting suggestion made by wfail: 40 redxl> wfail -B 41 membrd SB8 # redx wfail of dump 020510.0947.10 By not using memory on SB8, there is no home memory within EX8. Thus, the CDC DIMM on EX8 is not used. - Resolution: Repair/replace EX8. - Summary of part number and patch ID's http://infoserver.central.sun.com/data/syshbk/Systems/SunFire15K/component.centerplane.html - References and bug IDs SunSolve Article 48122 SDI ASIC Specification Starcat Architecture, 11/07/2000 - Additional background information: The details of the CDC DIMM entries in error is available in the AXQ data capture. First, understand the format of a CDC entry: SHARED ENTRY ============ |3| 2| 1| | |0|9|8|7|6|5|4|3|2|1|0|9|8|7|6|5|4|3|2|1|0|9|8|7|6|5|4|3|2|1|0| +-+-----------------------------------+-----------------------+ |1| Bitmask of sharers | Tag | +-+-----------------------------------+-----------------------+ OWNED ENTRY ============ |3| 2| 1| | |0|9|8|7|6|5|4|3|2|1|0|9|8|7|6|5|4|3|2|1|0|9|8|7|6|5|4|3|2|1|0| +-+---------------------+-+-+---------+-----------------------+ |0| Unused |V|R| Owner | Tag | +-+---------------------+-+-+---------+-----------------------+ V = Valid Entry (1 = valid, 0 = invalid) R = Retention priority Bit [30] indicates if the line is Owned or Shared. In Owned entries, bits [16:12] indicate the boardset that owns the line. The owner field is only valid if bit [18] is set. In Shared entries, bits [29:12] indicate which boardsets contain a copy of the cache line. The bit indicating a shared entry (bit [30]) implies a valid entry. A 3-wide CDC entry spanning the 3 CDC SRAMs is further protected by 8 bit ECC. A 3-wide entry also uses 3 bits of LRM (Least Recently Modified) to help in selection of an entry during victimization. Examine this dump example: 42 redxl> shaxq -e 8 43 Note: Data is displayed from the currently loaded dump file. 44 AXQ EX8 (8) Component ID = C4312049 Rev 6.0 45 Error_Flag_00[31:0] = 00000000 Mask = 0000FFFF 46 Error_Flag_01[31:0] = 00000000 Mask = 4000FFFF 47 Error_Flag_02[31:0] = 00000000 Mask = 0000FFFF 48 Error_Flag_03[31:0] = 00000000 Mask = 21005EFF 49 Error_Flag_04[31:0] = 00000000 Mask = 01FEFFFF 50 Error_Flag_05[31:0] = 00000000 Mask = 1024FFFF 51 Error_Flag_06[31:0] = 00000000 Mask = 7E00FFFF 52 Error_Flag_07[31:0] = 020B8200 Mask = 63FF7D24 53 Err7[16]: R CDC0 correctable error 54 CDC error count[3:0] = A Read Addr[18:0] = 19172 (GoodApar= 0) 55 CDC 0 sram data[35:0] = E.D0000C9E 56 CDC0 entry: Shared, Mask = 10000, Tag = C9E 57 CDC 1 sram data[35:0] = F.50000E1E 58 CDC1 entry: Shared, Mask = 10000, Tag = E1E 59 CDC 2 sram data[35:0] = A.50000D9E 60 CDC2 entry: Shared, Mask = 10000, Tag = D9E 61 ECC Syndrome[7:0] = 88: Uncorrectable Error 62 cdc_errsave1[19]: Capture is for Outside Domain Error 63 LRU[3:0] = A 64 cdc_errsave0[3:0][31:0] = 6D0000 C9EF5000 0E1EA500 00D9E880 65 cdc_errsave1[31:0] = 0A099172 66 Err7[17]: R CDC0 address parity error 67 CDC error save data is displayed above. 68 Err7[19]: R CDC1 correctable error 69 CDC error save data is displayed above. 70 Err7[25]: R 1E CDC uncorrectable error 71 CDC error save data is displayed above. 72 Error_Flag_08[31:0] = 20002000 Mask = 0000FFFF 73 Err8[29]: D CDC indicates an owner outside the domain 74 CDC error save data is displayed above. 75 Error_Flag_09[31:0] = 00000000 Mask = 7E00FFFF 76 Error_Flag_10[31:0] = 00000000 Mask = 7C00FFFF 77 Error_Flag_11[31:0] = 00000000 Mask = 7FF0FFFF Let's focus on the CDC0 entry (lines 55,56). The CDC0 entry is decoded by redx. We see the line is shared and the Mask is 10000. Thus, SB16 is the only sharer for this cache line. By our dump header, SB16 is in the domain (line 08). So by the data capture, there is no indication of the owner being outside the domain resources. Since the first error was an uncorrectable ECC error, the data in the capture is likely from that event. Subsequent CDC errors are not captured until after the dump is collected and the ASICs are rearmed. Note that this fault was injected by grounding part of the pathway between the CDC SRAM and AXQ. Fault injection aside, if a "CDC indicates an owner outside the domain" error, it implies one of two things: o The AXQ is writing faulty ownership/shared to the CDC entries o Multiple flips occurred in a CDC entry In any case, the expander is the FRU. - Meta-Data/Problem categorization: Product/Platform: SF12K/SF15K Category: - Keywords 15K, 12K, SF15K, SF12K, Sun Fire 15K, Enterprise, Server, Sun Fire 12K, starcat, dstop, CDC indicates an owner outside the domain
INTERNAL SUMMARY:
SUBMITTER: Scott Davenport APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS: