SRDB ID |
|
Synopsis |
|
Date |
48494 |
|
Sun Fire[TM] 12K/15K: Dstop: Error on notify wire 0 from other DARB |
|
1 Nov 2002 |
- Problem Statement:
Dstop: Error on notify wire [01] from other DARB
- Symptoms:
'wfail' output reports something similar to the following:
01 redxl> dumpf load dsmd.dstop.020202.2254.39
02 Created Sat Feb 2 22:54:42 2002
03 By hpost v. 1.2 sms1.2_DR_13 Jan 23 2002 01:06:19 executing as pid=24872
04 On ssc name = xc15p13-sc0.SD_Lab.West.Sun.COM
05 Domain = 4=E = xc15p13-b14 Platform = sun15
06 Boards in dump: master SC CPs/CSBs[1:0]: 3
07 EXB[17:0]: 06000
08 Slot0[17:0]: 04000 Requested/not enabled: 02000
09 Slot1[17:0]: 04000
10 'Not enabled' refers to the Console Bus master port on the parent board.
11 -D option, -d
12 "DSMD DomainStop Dump"
13 Created in a Sun Microsystems Inc. internal environment.
14 4 errors occurred while creating this dump.
15 redxl> wfail
16 SDI EX14/S0: All SDI is DStopped and RStopped, requested by DARB.
17 DARB C0 MasterStatus[31:0] = E0070003
18 MStat[17,18,30,31]: Dstop + Rstop, demand from other DARB
19 MStat[ 0]: Error on notify wire 0 from other DARB
20 MStat[ 1]: Error on notify wire 1 from other DARB
21 MStat[29]: Global Domainstop has occurred
22 MStat[28:25]: ErrorCount[3:0] = 0
23 DARB C0: enabled ports (expanders) [17:0]: 052D2
24 DARB C0: other darb req Dstop+Rstop for exps[17:0]: 052D2
25 FAIL Non-degraded cplane configurations: Dstop/Rstop detected by DARB.
26 Primary service FRU is the logic centerplane.
SOLUTION SUMMARY:
- Troubleshooting:
The dump header tells us that this Dstop was generated by dsmd (lines 11,12)
while a domain was active. This is also evident by the dumpf file name -
dsmd.dstop files are created by dsmd as part of an ASR. Also note that
SB13 was requested during the dump collection (line 08) but was not accessable
by hpost (see Background Information). Walking the error chain:
- DARB0 reports errors on the notify wires from DARB1 (lines 19,20).
- DARB0 also records the errors of a stop demand from DARB1 (line 18).
and a Global Domainstop event (line 21).
- Non-degraded centerplane configurations are FAILed (line 25).
- The centerplane is named as the FRU (line 26).
The 12K/15K centerplane (CP) is logically divided into two halves. When
operating at full capacity, the DARB on each logical half of the CP
communicates with the other DARB via connections termed notify wires.
If an error occurs on these pathways, the two logical CP halves no longer
can maintain lockstep.
Since the notify wires are only used while the centerplane is in full
bandwidth, degrading the centerplane to half bandwidth avoids the failure.
Hence the FAIL message on line 25. Furthermore, since the error implies a
loss of lockstep within the centerplane, all domains are affected. Thus
you see the global domain stop indicator (line 21).
The notify wire pathways are copper etch completely contained within the
centerplane. Since the ASICs are also within the centerplane, no interconnects
are involved. The centerplane is the FRU.
- Resolution:
Degrade the centerplane to half bandwidth with 'setbus' until such time as
the centerplane can be repaired/replaced.
- Summary of part number and patch ID's
http://infoserver.central.sun.com/data/syshbk/Systems/SunFire15K/component.centerplane.html
- References and bug IDs
SunSolve Article 48122
DARB ASIC Specification
- Additional background information:
The signals sent from DARB to DARB is a serial bit stream. A DARB sends
two copies of the signal, in phase, to its colleague DARB. Included in the
bitstream is a checksum. The receiving DARB checks the notify signals. If
the notifies are not equal, and neither checksum is good, a Dstop occurs.
If one of the checksums is good, it is trusted and the other is dropped.
Also, by examining an SDI's master stop register, we can see the posting
of the global Dstop:
27 redxl> shsdi -e 14
28 Note: Data is displayed from the currently loaded dump file.
29 SDI EX14/S0 Component ID = 54317049
30 Master_Stop_Status0[31:0] = 7004000F
31 MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
32 Master_Stop_Status1[31:0] = F2EE0002
33 0x0E CP1StopExp[4:0] MSS1[20:16]
34 3 CP1StopSlot[0:1] MSS1[22:21] Dstop is 1st stop
35 1 CP1StopInfoValid MSS1[23]
36 0x12 CP0StopExp[4:0] MSS1[28:24] Global stop
37 3 CP0StopSlot[0:1] MSS1[30:29] Dstop is 1st stop
38 1 CP0StopInfoValid MSS1[31]
The Expander value 18 (0x12) is used to indicate a global Dstop (line 36). All
SDIs with configured slot boards must acknowledge such a signal and stop. From
this output, we also see the disagreement between the centerplane halves. DARB0
signaled a global Dstop while DARB1 indicated EX14 as the stop requestor (line
33).
Also, in the dump header, it's noted that SB13 was requested but not collected
as part of the dump (line 08). This is an indicator that console bus access to
this component was disabled (line 10). The console bus repeater to a system
board is the expander's SDI (SDI4, specifically). Examine the console bus
repeater on EX13:
39 redxl> shcbr exb 13
40 Note: Data is displayed from the currently loaded dump file.
41 EXBCBR EX13 Component ID = 00000000
42 Port_Config_Stat[31:0] = 048D83E7
43 3 SC_PortEnbl[1:0] CnfSt[1:0]
44 1 BBC_PortEnbl CnfSt[2]
45 0 Slot_PortEnbl[1:0] CnfSt[4:3]
46 3 SC_MaskErrs[1:0] CnfSt[6:5]
47 1 BBC_MaskErrs CnfSt[7]
48 3 Slot_MaskErrs[1:0] CnfSt[9:8]
49 0 SC_PortBusy[1:0] CnfSt[11:10]
50 0 BBC_PortBusy CnfSt[12]
51 0 Slot_PortBusy[1:0] CnfSt[14:13]
52 1 AXQ_CB_MaskErrs CnfSt[15] Rev 4+
We see that console bus access to both SB13 and IO13 is not enabled (line 45).
The platform logs for this system should be investigated to determine the
state of SB13.
- Meta-Data/Problem categorization:
Product/Platform: SF12K/SF15K
Category:
- Keywords
15K, 12K, SF15K, SF12K, Sun Fire 15K, Enterprise, Server, Sun Fire 12K,
starcat, dstop, Error on notify wire [01] from other DARB
INTERNAL SUMMARY:
SUBMITTER: Scott Davenport
APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000
ATTACHMENTS:
Copyright (c) 1997-2003 Sun Microsystems, Inc.