SRDB ID | Synopsis | Date | ||
48190 | Sun Fire[TM] 12K/15K: Rstop: No components would be failed based on this state | 31 Oct 2002 |
Status | Issued |
Description |
- Problem Statement: Rstop: No components would be failed based on this state - Symptoms: 'wfail' output reports something similar to the following: 01 redxl> dumpf load dsmd.rstop.020805.1100.46 02 Created Mon Aug 5 11:00:46 2002 03 By hpost v. 1.2 Generic 112488-04 Mar 18 2002 14:43:00 executing as pid=22831 04 On ssc name = sc0. 05 Domain = 4=E = etuac21 Platform = sfgedas1 06 Boards in dump: master SC CPs/CSBs[1:0]: 3 07 EXB[17:0]: 00C00 08 Slot0[17:0]: 00400 09 Slot1[17:0]: 00C00 10 -D option, -d 11 "DSMD RecordStop Dump" 12 0 errors occurred while creating this dump. 13 redxl> wfail 14 SDI EX10/S0: SDI is RStopped, requested by DARB. 15 SDI EX11/S0 Master_Stop_Status0[31:0] = F0040308 16 MStop0[3]: SDI is Recordstopped 17 SDI EX11/S0 Recordstop0[31:0] = 04018400 18 Rstop0[16]: R DARB texp request Recordstop (M) 19 Rstop0[26]: R 1E Slot0 asserted EccErr, enabled to cause Rstop (M) 20 Note: SDI EX11/S0 detects error from Slot SB11, not in dump. Ignored. 21 DARB C0: enabled ports (expanders) [17:0]: 03FFF 22 DARB C0: exps request Rstop [17:0]: 00800 23 DARB C0: other darb req Rstop for exps [17:0]: 00800 24 DARB C1: enabled ports (expanders) [17:0]: 03FFF 25 DARB C1: exps request Rstop [17:0]: 00800 26 DARB C1: other darb req Rstop for exps [17:0]: 00800 27 No components would be failed based on this state.
SOLUTION SUMMARY:
- Troubleshooting: The dump header tells us that this Rstop was generated by dsmd (lines 10,11) while a domain was active. This is also evident by the dump file name. dsmd.rstop files are created by dsmd as part of error capturing. Walking the error chain: - EX11/S0 (SDI0) reports a first error from its Slot 0 board, SB11 (line 19). - However, on line 20, wfail notes that SB11 is not in the dump. The "Ignored" statement means that SB11 is not considered in selecting a component to FAIL. - As no other errors are present, the diagnosis returns no failures (line 27). Therefore, the source of the error is attributed to an ECC error from a board that is not in this domain. We can confirm that SB11 was indeed not included in the dump by checking the Slot 0 board mask (line 08). Since POST refers to the PCD to determine which boards are part of a domain, this tells us that SB11 is not part of Domain E in the PCD. At this point in the analysis, it is wise to examine activity on other domains around the time of this Dstop. Looking through the explorer we see: 28 % ls sf15k/[A-R]/adm/dump/dsmd.rstop.020805.*. 29 D/adm/dump/dsmd.rstop.020805.2042.40 E/adm/dump/dsmd.rstop.020805.1100.46 30 F/adm/dump/dsmd.rstop.020805.1101.08 There's also an Rstop on Domain F shortly after the Domain E Rstop. And we find a relationship between Domains E and F: 31 % grep "^[SI][BO]11" sf15k/showboards_-v.out 32 IO11/C3V0 On C3V - - etuac21 33 IO11/C5V0 On C5V - - etuac21 34 IO11/C3V1 On C3V - - etuac21 35 IO11/C5V1 On C5V - - etuac21 36 SB11 On CPU Active Passed etuac31 37 IO11 On HPCI Active Passed etuac21 38 % grep "^[EF]" sf15k/showplatform_-v.out | head -2 39 E etuac21 etuac21 Running Solaris 40 F etuac31 etuac31 Running Solaris Domains E and F share EX11, and SB11 is assigned to Domain F. We can determine that EX11 is split from the dump file as well: 41 redxl> shsdi -v 11 42 Note: Data is displayed from the currently loaded dump file. 43 SDI EX11/S0 Component ID = 64317049 44 Master_Reset_Config[31:0] = 0B000000 45 0 SDI_diserrlog MResC[0] => SDI Intern Reset 46 0 Slot0_diserrlog MResC[1] 47 0 Slot1_diserrlog MResC[2] 48 0x0B ExpID[4:0] MResC[28:24] 49 0 Mode[2:0] MResC[31:29] Master (0) 50 Master_Stop_Config[31:0] = 41001997 51 1 DstopEnbl MStopC[0] 52 1 RstopEnbl MStopC[1] 53 1 SCIntEnbl MStopC[2] 54 0 L1Err->ErrPause MStopC[3] 55 1 Dstop->ErrPause MStopC[4] 56 0 L1Ecc->ScInt[1:0] MStopC[6:5] 57 3 L1Ecc->Rstop[1:0] MStopC[8:7] 58 0 L1Err->ScInt[1:0] MStopC[10:9] L1Slot asserted err 59 3 L1Err->Dstop[1:0] MStopC[12:11] L1Slot asserted err 60 0 SBBCErr->SCInt MStopC[13] 61 0 SBBCErr->Dstop MStopC[14] 62 1 EnblStopReqChk MStopC[24] 63 0 L1Dstop->ExpDStop MStopC[28] 64 0 AnyDstop->ExpDStop MStopC[29] 65 1 Dstop->DReset MStopC[30] For split exp 66 0 ShiftErrPausePhase MStopC[31] 67 Core_Config[21:0] = 0DB3E2 68 0 Pass4TargIDDisbl CoreC[0] Rev 4+ 69 1 Slot1=SerDom1 CoreC[1] Rev 4+ 70 1 SplitSlotEnbl CoreC[5] In master SDI (0) However, SB11's assignment is only available from explorer (or, of course, a live SC). Because EX11 is a split expander, and the source of the error is within that expander's boardset, Domain E suffers a residual Rstop. It can be ignored. - Resolution: The real source of the problem is a stop condition on a different domain that shares this expander (Domain F in the example above). Analyze that stop dump. - Summary of part number and patch ID's - References and bug IDs SunSolve Article 48122 - Additional background information: http://cpre-amer.west.sun.com/esg/hsg/starcat/xctt/hw_expander_split.html - Meta-Data/Problem categorization: Product/Platform: SF12K/SF15K Category: - Keywords 15K, 12K, SF15K, SF12K, starcat, rstop, split, expander, no, components, failed
INTERNAL SUMMARY:
SUBMITTER: Scott Davenport APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS: