SRDB ID |
|
Synopsis |
|
Date |
50095 |
|
Sun Fire[TM] 12K/15K: Dstop: Port 0 Safari device asserted error |
|
23 Jan 2003 |
- Problem Statement/Title: SF15K Troubleshooting Article:
Dstop: Port 0 Safari device asserted error
- Symptoms:
'wfail' output reports something similar to the following:
01 redxl> dumpf load dsmd.dstop.021220.1126.07
02 Created Fri Dec 20 11:26:08 2002
03 By hpost v. 1.2 Generic 112488-09 Sep 17 2002 13:34:28 executing as pid=15485
04 On ssc name: p015-sc0.
05 Domain = 0=A Platform = p015
06 Boards in dump: master SC CPs/CSBs[1:0]: 3
07 EXB[17:0]: 03FFF
08 Slot0[17:0]: 03FFF
09 Slot1[17:0]: 0003F Requested/not enabled: 001C0
10 'Not enabled' refers to the Console Bus master port on the parent board.
11 -D option, -d
12 "DSMD DomainStop Dump"
13 0 errors occurred while creating this dump.
14 redxl> wfail
15 SDI EX12/S0 Master_Stop_Status0[31:0] = A000004F
16 MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
17 SDI EX12/S0 Dstop0[31:0] = 10029000
18 Dstop0[17]: D DARB texp requests Slot0 Dstop (M)
19 Dstop0[28]: D 1E Slot0 asserted Error, enabled to cause Dstop (M)
20 EPLD SB12 Err1_Dom0: Mask= 00 Err= 82 1stErr= 80
21 Err1[1]: Error reported by SDC
22 Err1[7]: 1E+ Error reported by BBC1
23 BBC SB12/BB1 Device_Err_Stat[31:0] = 80018100
24 DevErr[ 16]: Port 0 bootbus command or protocol error
25 DevErr[ 8]: 1E Port 0 Safari device asserted error
26 Proc SB12/P2 (12.0.2) EmuShad[0:78] = 0000 00000000 00000000 (Note rev order)
27 FAIL Port SB12/P2: Dstop detected by BBC SB12/BB1.
28 Primary service FRU is Slot SB12.
29 DARB C0: enabled ports (expanders) [17:0]: 03FFF
30 DARB C0: other darb req Dstop+Rstop for exps[17:0]: 01000
31 DARB C1: enabled ports (expanders) [17:0]: 03FFF
32 DARB C1: other darb req Dstop+Rstop for exps[17:0]: 01000
SOLUTION SUMMARY:
- Troubleshooting:
The dump header tells us that this Dstop was generated by dsmd (lines 11,12) while
a domain was active. This is also evident by the dump file name - dsmd.dstop files are
created by dsmd as part of error capturing. Walking the error chain:
- Not all the Slot1 boards were collected in the dump file (lines 09,10).
See Additional Background Information for more on this.
- EX12/S0 (SDI0) reports a first error from its Slot0 board SB12 (line 19).
- SB12's EPLD reports a first error from BBC1 (line 22).
- SB12/BBC1 indicates that Port 0 asserted an error (line 25). Note that for
BBC1, port 0 is processor 2.
- SB12/P2 is displayed, but no errors are logged in the CPU (line 26).
- 'wfail' then informs us to fail SB12/P2 (line 27) to avoid the error.
- SB12 is identified as the primary FRUs (line 28).
It is clear that SB12/BBC1 logged an error as directed by SB12/P2. But
'wfail' does not present any information on the processor's error because
the EmuShad register is zeroed. Looking at the processor:
33 redxl> shproc 12 0 2
34 Note: Data is displayed from the currently loaded dump file.
35 Proc SB12/P2 (12.0.2) ComponentID= 4919D07D MaskID= 422 US3+_2.2 EPIC6cu
36 PC[63:6],2b'0 = 00000000.0000000_ ([5:0] not available)
37 tPC[63:0] = FFFFFFFF.F0000040
38 tnPC[63:0] = FFFFFFFF.F0000044
39 tl[2:0] = 2 tt[8:0] = 003: XIR
40 tstate[39:0] = 44.F0003505
41 ASI[7:0] = F0 CWP[2:0] = 5 PState[11:0] = 035 CCR[7:0] = 44
42 EmuShad[0:78] = 0000 00000000 00000000 (Note rev order)
43 AFSR [63:0] = 00000000.00000000 AFAR [42:4] = 000.0000000_
44 AFSR2[63:0] = 00000000.00000000 AFAR2[42:4] = 000.0000000_
45 Data Ecc synd[8:0] = 000: No Error.
46 Mtag Ecc synd[3:0] = 0: No Error.
The tPC and tnPC (lines 37, 38) look odd. The upper word is FFFFFFFF. Also
note the trap type is an XIR (line 39). Some external action issued a reset
to this processor - either from the SC or a spurious signal within the SB.
In this case, an XIR trap type is present. However, even if this information
is not present, there is some error within the system board. The BBC clearly
logs that the processor asserted error. But, the processor does not record
a reason. Either the processor did not assert error, in which case the SBBC
is at fault, or the processor failed to record the error, in which case the
CPU is at fault. In either case, both components and their communication
pathways are entirely contained within the SB.
- Resolution:
Correlation with the platform/domain messages to determine if an XIR
was issued is a good step. It would corroborate the trap type listed
for the processor. But even without such evidence, replacement of SB12
is still the best action.
- Summary of part number and patch ID's
http://infoserver.central.sun.com/data/syshbk/Devices/System_Board/SYSBD_SunFire_USIIICu.html
- References and bug IDs
SRDB 48122
- Additional background information:
The dump header in this dump also noted that not all Slot1 boards were
collected in the dump file. This means that POST believed these boards to
be part of the domain (per the PCD), but was unable to collect the ASIC
information. In general, when such messages are displayed, it is an indicator
to examine the platform/domain logs for events that may explain why the
board(s) are unavailable.
From the dump, it is clear that the console bus access to the missing Slot1
boards is disabled. For example, examining the console bus repeater portion
of EX6/SDI4:
47 redxl> shcbr exb 6
48 Note: Data is displayed from the currently loaded dump file.
49 EXBCBR EX6 Component ID = 00000000
50 Port_Config_Stat[31:0] = 068683EF
51 3 SC_PortEnbl[1:0] CnfSt[1:0]
52 1 BBC_PortEnbl CnfSt[2]
53 1 Slot_PortEnbl[1:0] CnfSt[4:3]
54 3 SC_MaskErrs[1:0] CnfSt[6:5]
55 1 BBC_MaskErrs CnfSt[7]
56 3 Slot_MaskErrs[1:0] CnfSt[9:8]
57 0 SC_PortBusy[1:0] CnfSt[11:10]
58 0 BBC_PortBusy CnfSt[12]
59 0 Slot_PortBusy[1:0] CnfSt[14:13]
60 1 AXQ_CB_MaskErrs CnfSt[15] Rev 4+
61 0x06 BoardId[4:0] CnfSt[20:16]
62 4 Mode[2:0] CnfSt[23:21] EXB SDI4
63 6 Chip Version[3:0] CnfSt[27:24] Rev 4+
64 0 CBErr=>SCInt CnfSt[28]
65 0 CBErr=>DStop CnfSt[29]
66 0 Domain[1:0] CnfSt[31:30]
67 AXQ_CBus_&_ForceErr[31:0] = 01800000
68 0x000000 ForceError[22:0] AXQCB[22:0]
69 1 MaskAXQCBErr AXQCB[23]
70 1 AXQ_CBusEnbl AXQCB[24]
...
The port enable to IO6 is not active (line 53). Also, the SDI domain masks
for IO6 are also clear (lines 85,86):
71 redxl> shsdi -s 6
72 Note: Data is displayed from the currently loaded dump file.
73 SDI EX06/S0 Component ID = 64317049
74 Master_Reset_Config[31:0] = 06000018
75 Master_Stop_Config[31:0] = 01000897
76 Core_Config[21:0] = 0DA3C2
77 Sysreg_Config[23:0] = 200001
78 STB_Config[23:0] = 20010F
79 Bogon_Config[63:0] = 00000003 C03C0010
80 CP_Config[20:0] = 0F0F70
81 Slot0Config[1:0][31:0,30:0] = 20000000 3CA2A150
82 Slot1Config[1:0][31:0,31:0] = 0000E000 00000000
83 Slot0_Domain_Mask[17:0]: Slot1 = 0003F Slot0 = 03FFF
84 Slot0_Expand_Mask[17:0]: Slot1 = 0003F Slot0 = 03FFF
85 Slot1_Domain_Mask[17:0]: Slot1 = 00000 Slot0 = 00000
86 Slot1_Expand_Mask[17:0]: Slot1 = 00000 Slot0 = 00000
...
A most common cause for this is the power off of a board. The power
libraries will deconfigure the console bus repeater to the board being
powered off as well as ensure the AXQ/SDI domain masks are reprogrammed
properly to disallow communication to that board.
- Keywords Section
15K, 12K, SF15K, SF12K, starcat, dstop, Port 0 Safari device asserted error
SUBMITTER: Scott Davenport
APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000
ATTACHMENTS:
Copyright (c) 1997-2003 Sun Microsystems, Inc.