SRDB ID | Synopsis | Date | ||
48197 | Sun Fire[TM] 12K/15K: Dstop: Select command parity error | 31 Oct 2002 |
Status | Issued |
Description |
- Problem Statement: Dstop: Select command parity error - Symptoms: 'wfail' output reports something similar to the following: 01 redxl> dumpf load dsmd.dstop.020429.0840.40 02 Created Mon Apr 29 08:40:41 2002 03 By hpost v. 1.2 Generic 112488-03 Feb 15 2002 13:40:50 executing as pid=6794 04 On ssc name = rasputin-sc0.SD_RASCAL.West.Sun.COM 05 Domain = 0=A Platform = rasputin 06 Boards in dump: master SC CPs/CSBs[1:0]: 1 Requested/not enabled: 2 07 EXB[17:0]: 12100 08 Slot0[17:0]: 12100 09 Slot1[17:0]: 12100 10 'Not enabled' refers to the Console Bus master port on the parent board. 11 -D option, -d 12 "DSMD DomainStop Dump" 13 0 errors occurred while creating this dump. 14 redxl> wfail 15 SDI EX08/S0 Master_Stop_Status0[31:0] = C00400CF 16 MStop0[3:0]: All SDI logic is DStopped + Recordstopped. 17 SDI EX08/S0 Dstop0[31:0] = 30019000 18 Dstop0[16]: D DARB texp requests all Dstop (M) 19 Dstop0[28]: D 1E Slot0 asserted Error, enabled to cause Dstop (M) 20 Dstop0[29]: D Slot1 asserted Error, enabled to cause Dstop (M) 21 EPLD SB08 Err1_Dom0: Mask= 00 Err= 41 1stErr= 40 22 Err1[0]: Error reported by AR 23 Err1[6]: 1E+ Error reported by BBC0 24 BBC SB08/BB0 Device_Err_Stat[31:0] = 80008010 25 DevErr[ 4]: 1E DCDS asserted error 26 DCDSs SB08/DG0 slice 5 CPU[1:0]_Cmd_Err[22:0] = 008008 008008 27 C0CE[ 3]: 1E C0 Select command parity error 28 C1CE[ 3]: 1E C1 Select command parity error 29 FAIL Port SB8/P0: Dstop detected by DCDS. 30 Primary service FRU is Slot SB8. 31 FAIL Port SB8/P1: Dstop detected by DCDS. 32 Primary service FRU is Slot SB8. 33 SDI EX13/S0: All SDI is DStopped and RStopped, requested by DARB. 34 SDI EX16/S0: All SDI is DStopped and RStopped, requested by DARB.
SOLUTION SUMMARY:
- Troubleshooting: The dump header tells us that this Dstop was generated by dsmd (lines 11,12) while a domain was active. This is also evident by the dumpf file name - dsmd.dstop files are created by dsmd as part of an ASR. Walking the error chain: - The SDI on EX8 calls for Dstop as directed by its Slot 0 board, SB8 (line 19). There is also a Slot 1 error asserted, but it is not the first error (line 20). - The EPLD on SB8 indicates BBC0 asserted error the first error (line 23). - BBC0 indicates the DCDS called for error (line 25). - DCDS slice 5 reports select command parity errors (lines 26-28). - The DCDSs off of BBC0 serve processors 0 and 1. Hence, 'wfail' FAILs SB8/P0 and SB8/P1 (lines 29,31). - The FRU called out is SB8 (lines 30,32). The DCDSs are slave ASICs, and all transactions are controlled via select commands sourced by processors. The select lines are parity protected. DCDS slice 5 is configured as the parity checker, hence its detection of the error. The select line pathways between the processors and DCDSs are entirely contained within the system board, so the board is the FRU. In the general case, this error could also occur on a MaxCPU board. - Resolution: Repair/replace SB8. In general, repair/replace the board reporting the error. - Summary of part number and patch ID's http://infoserver.central.sun.com/data/syshbk/Devices/System_Board/SYSBD_SunFire_USIIICu.html http://infoserver.central.sun.com/data/sshandbook/Devices/CPU_Module/UltraSPARC_MaxCPU.html - References and bug IDs SunSolve Article 48122 15K System Controller Specification - Additional background information: In the dump header, there's an indication of communication problems to CSB 1 (line 06). This indicates that console bus access to this component was disabled (line 10). Console bus fans out from the SCM ASICs. They are physically located on the CSBs, but are part of the SC's power domain. So even if a CSB is powered off, the SCMs still have power. Examining the SCMs, we do not see CSB 1 enabled (lines 40, 56): 35 redxl> shscm 0 36 Note: Data is displayed from the currently loaded dump file. 37 scm 0 Component ID = 215C007D 38 DevTemp[8:0] = 041: Valid 43.83 DegC 39 CBus_Config[31:0] = 3FFF8103 40 0x103 MasterPortEnbl[9:0] CbCnf[9:0] EXBs 06100 41 0 CBH_SlavePortEnbl CbCnf[13] 42 0 ShortTimeout CbCnf[14] 43 0x7FFF PortErrMask[14:0] CbCnf[29:15] 44 0 DisableArb CbCnf[30] With other SC's CBH 45 0 ForceBusy CbCnf[31] To other SC's CBH 46 ResetStat[26:0] = 01000000 47 SCM_Mapping_Reg[5:0] = 10 48 CBus_PortErr[ 0][25:0] = 0000000 (EXB 14 (master)) 49 CBus_PortErr[ 1][25:0] = 0000000 (EXB 13 (master)) 50 CBus_PortErr[ 8][25:0] = 0000000 (EXB 8 (master)) 51 redxl> shscm 1 52 Note: Data is displayed from the currently loaded dump file. 53 scm 1 Component ID = 215C007D 54 DevTemp[8:0] = 03F: Valid 42.50 DegC 55 CBus_Config[31:0] = 3FFF8030 56 0x030 MasterPortEnbl[9:0] CbCnf[9:0] EXBs 10000 CSBs 1 57 0 CBH_SlavePortEnbl CbCnf[13] 58 0 ShortTimeout CbCnf[14] 59 0x7FFF PortErrMask[14:0] CbCnf[29:15] 60 0 DisableArb CbCnf[30] With other SC's CBH 61 0 ForceBusy CbCnf[31] To other SC's CBH 62 ResetStat[26:0] = 01000000 63 SCM_Mapping_Reg[5:0] = 11 64 CBus_PortErr[ 4][25:0] = 0000000 (CSB 0 (master)) 65 CBus_PortErr[ 5][25:0] = 0000000 (EXB 16 (master)) The platform logs for this system should be investigated to determine why CSB1 was deconfigured - Meta-Data/Problem categorization: Product/Platform: SF12K/SF15K Category: - Keywords 15K, 12K, SF15K, SF12K, starcat, dstop, Select command parity error
INTERNAL SUMMARY:
SUBMITTER: Scott Davenport APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS: