SRDB ID | Synopsis | Date | ||
48192 | Sun Fire[TM] 12K/15K: Dstop: Slot0 target slot transgression error | 31 Oct 2002 |
Status | Issued |
Description |
- Problem Statement: Dstop: Slot0 target slot transgression error - Symptoms: 'wfail' output reports something similar to the following: 01 redxl> dumpf load dsmd.dstop.020207.0007.29 02 Created Thu Feb 7 00:07:30 2002 03 By hpost v. 1.1 Generic 112099-05 Nov 27 2001 12:41:09 executing as pid=14740 04 On ssc name = xc46-sc0.SD_Lab.West.Sun.COM 05 Domain = 0=A Platform = sun15 06 Boards in dump: master SC CPs/CSBs[1:0]: 3 07 EXB[17:0]: 00011 08 Slot0[17:0]: 00010 09 Slot1[17:0]: 00000 Requested/not enabled: 00001 10 'Not enabled' refers to the Console Bus master port on the parent board. 11 -D option, -d 12 "DSMD DomainStop Dump" 13 Created in a Sun Microsystems Inc. internal environment. 14 0 errors occurred while creating this dump. 15 redxl> wfail 16 SDI EX04/S0 Master_Stop_Status0[31:0] = D004000A 17 MStop0[3,1]: Slot 0 port is DStopped, SDI is Recordstopped. 18 SDI EX04/S0 Dstop0[31:0] = 00828080 19 Dstop0[17]: D DARB texp requests Slot0 Dstop (M) 20 Dstop0[23]: D 1E SDI internal Slot0 port requested Dstop 21 SDI EX04/S0 Slot0_Error1[31:0] = 00088008 Mask = 31444EBF 22 S0Err1[19]: D 1E Slot0 target slot transgression error (M) 23 {texp[4:0],targ_dev[2:0],s0dtarg,s0dstat[1:0], 24 s0dtransid[8:0]} = 04844 25 FAIL Slot SB4: Dstop/Rstop detected by SDI 26 Primary service FRU is Slot SB4. 27 Secondary service FRU is EXB EX4.
SOLUTION SUMMARY:
- Troubleshooting: The dump header tells us that this Dstop was generated by dsmd (lines 11,12) while a domain was active. This is also evident by the dumpf file name - dsmd.dstop files are created by dsmd as part of an ASR. The header also reports that hardware state for IO0 was not collected in the dump (lines 09,10). As wfail indicates, the reason for this is that IO0's console bus master was not enabled. The console bus master for IO0 is SDI4 on EX0. EX0 therefore warrants further investigation. Let's finish with wfail first. Walking the error chain: - The SDI on EX4 calls for Dstop with an internal error with respect to its Slot 0 port (lines 18-20). - The SDI register flagged a target slot transgression error (lines 21-24). - wfail calls out SB4 as what FAILed and also the primary FRU (lines 25,26). EX4 is marked as the secondary FRU (line 27). A transgression is an attempt, successful or unsuccessful, to communicate with a board that is not participating in your domain. By the errors present, SB4 attempted such an operation. The master SDI maintains bit vectors for its slot boards which outline the other boards each is permitted to talk with. Looking at the master SDI on EX4: 28 redxl> shsdi 4 29 Note: Data is displayed from the currently loaded dump file. 30 SDI EX04/S0 Component ID = 64317049 31 Master_Reset_Config[31:0] = 04000000 32 Master_Stop_Config[31:0] = 41001997 33 Core_Config[21:0] = 0DB3E2 34 Sysreg_Config[23:0] = 200001 35 STB_Config[23:0] = 20010F 36 Bogon_Config[63:0] = 00000003 C03C0010 37 CP_Config[20:0] = 0F0F70 38 Slot0Config[1:0][31:0,30:0] = 20000000 3CA2A150 39 Slot1Config[1:0][31:0,31:0] = 2000E0F0 28A83880 40 Slot0_Domain_Mask[17:0]: Slot1 = 00000 Slot0 = 00010 41 Slot0_Expand_Mask[17:0]: Slot1 = 00000 Slot0 = 00010 42 Slot1_Domain_Mask[17:0]: Slot1 = 00010 Slot0 = 00001 43 Slot1_Expand_Mask[17:0]: Slot1 = 00010 Slot0 = 00001 Looking at the Slot 0 domain and expander masks (lines 40,41), we see that this SDI's Slot 0 board (SB4) is permitted to communicate with Slot 0 on EX4 - in other words, only itself. This is not much of a domain to say the least. But it does explain why the transgression error was detected: As soon as SB4 attempted any data transmission outside itself, the SDI considers that transmission invalid. The question now is why is EX4's master SDI programmed this way? It does not reflect a valid domain configuration. But, based on boards requested in the dump header (lines 8,9), we can infer that the valid domain contained SB4 and IO0. We previously noted that EX0 warranted a deeper look. Let's do that now: 44 redxl> shsdi 0 45 Note: Data is displayed from the currently loaded dump file. 46 SDI EX00/S0 Component ID = 64317049 47 Master_Reset_Config[31:0] = 00000018 48 Master_Stop_Config[31:0] = 41000897 49 Core_Config[21:0] = 0DA3C2 50 Sysreg_Config[23:0] = 200001 51 STB_Config[23:0] = 20010F 52 Bogon_Config[63:0] = 00000003 C03C0010 53 CP_Config[20:0] = 0F0F70 54 Slot0Config[1:0][31:0,30:0] = 20000000 3CA2A150 55 Slot1Config[1:0][31:0,31:0] = 0000E000 00000000 56 Slot0_Domain_Mask[17:0]: Slot1 = 00010 Slot0 = 00001 57 Slot0_Expand_Mask[17:0]: Slot1 = 00010 Slot0 = 00001 58 Slot1_Domain_Mask[17:0]: Slot1 = 00000 Slot0 = 00000 59 Slot1_Expand_Mask[17:0]: Slot1 = 00000 Slot0 = 00000 60 Force_Error[1:0][31:0] = 0000E000 00000000 61 Csr2Conf[28:0] = 01000000 62 IBIST_Enbl[1][3:0],[0][29:0] = 0 00000000 63 Master_Stop_Status0[31:0] = F0000000 64 Master_Stop_Status1[31:0] = 7F7F0000 65 Dstop0[31:0] = 00000000 66 Dstop1[31:0] = 00000000 67 Recordstop0[31:0] = 00000000 68 Recordstop1[31:0] = 00000000 69 Core_Error0[31:0] = 00000000 Mask = 0051FFFF 70 Core_Error1[31:0] = 00000000 Mask = FFFFFFFF 71 Sysreg_Error[31:0] = 00000000 Mask = 780377FF 72 STB_Error[31:0] = 00000000 Mask = 7F00FFFF 73 CP_Error0[31:0] = 00000000 Mask = 580067FF 74 CP_Error1[31:0] = 00000000 Mask = 7FFCFFFF 75 Slot0_Error0[31:0] = 00000000 Mask = 7000FFFF 76 Slot0_Error1[31:0] = 00000000 Mask = 31444EBF 77 Slot0_Error2[31:0] = 00000000 Mask = 7FFCFFFF 78 Slot1_Error0[31:0] = 00000000 Mask = FFFFFFFF 79 Slot1_Error1[31:0] = 08000000 Mask = FFFFFFFF 80 Slot1_Error2[31:0] = 00000000 Mask = FFFFFFFF No errors, not even a DARB request for Dstop. Also, the masks for Slot 1 (lines 59,60) do not permit any transmissions from IO0. IO0 is part of the defined domain, as indicated by the dump header (line 9). Again, the question is why is the SDI programmed in such a manner? The ASICs can only be programmed from the SCs. It's time to look outside the dump file. The dump header tells us the dump was taken at Feb 7, 00:07:30 2002 (line 2). Looking in the platform message log, we see the following messages reported around that time: Feb 7 00:07:34 2002 xc46-sc0 hwad[357]: [1217 37708520653855 ERR PciComm.cc 195] Cannot access console bus since the board IO0 is OFF Feb 7 00:07:34 2002 xc46-sc0 hwad[357]: [1132 37708521469608 ERR IosramComm.cc 516] Failed to read from offset 1e for key 53444344 Feb 7 00:07:34 2002 xc46-sc0 hwad[357]: [1217 37708540886871 ERR PciComm.cc 195] Cannot access console bus since the board IO0 is OFF Feb 7 00:07:34 2002 xc46-sc0 hwad[357]: [1132 37708541793044 ERR IosramComm.cc 516] Failed to read from offset 1e for key 53444344 Feb 7 00:07:34 2002 xc46-sc0 hwad[357]: [1217 37708543136069 ERR PciComm.cc 195] Cannot access console bus since the board IO0 is OFF Feb 7 00:07:34 2002 xc46-sc0 hwad[357]: [1132 37708543957908 ERR IosramComm.cc 516] Failed to read from offset c for key 53444344 Feb 7 00:07:35 2002 xc46-sc0 dsmd[448]: [2517 37708971377554 WARNING EventHandler.cc 155] Domain stop has been detected in domain A. IO0 is reported as being off. There are also errors reported for key 53444344; this is an IOSRAM key (the IosramComm.cc is a big clue here). IOSRAM access is via console bus and console bus requires power. Thus, we can be highly confident that IO0 was powered off. Since there were not esmd messages indicating a power off reason, it was likely done by a system administrator. It's likely that once we saw the Requested/not enabled report from wfail (line 9) we'd have began to look outside the dump file. But for the purposes of this discussion, the walk through the hardware provided some insight on why the transgression error was reported. Also, this is a prime example of the limitations of wfail. wfail can only report and analyze the data made available to it. - Resolution: Repair/replace IO0. - Summary of part number and patch ID's - References and bug IDs SunSolve Article 48122 - Additional background information: With the conclusion that IO0 was powered off, the programming of the SDIs makes sense. At power off, the power libraries remove the powered off board(s) from the masks in the SDIs (and AXQs) to ensure no transactions can be sent to, or more importantly source from, those board(s). This in turn also explains why the master SDI on EX0 did not report any errors. We know from the error reporting in the hardware that the DARBs broadcast a stop request to all master SDIs in the system. The SDI examines the request to determine if either of its Slot 0/1 boards needs to be stopped. By the time the SDI on EX0 received the stop request, IO0 had already been removed from the mask registers. Thus, the SDI determined it had no slots that needed to participate in the stop. - Meta-Data/Problem categorization: Product/Platform: SF12K/SF15K Category: - Keywords 15K, 12K, SF15K, SF12K, starcat, dstop, Slot0 target slot transgression error
INTERNAL SUMMARY:
SUBMITTER: Scott Davenport APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS: