SRDB ID |
|
Synopsis |
|
Date |
48186 |
|
Sun Fire[TM] 12K/15K: Dstop: Parity Bidi error |
|
13 Dec 2002 |
- Problem Statement:
Dstop: Parity Bidi error
- Symptoms:
'wfail' output reports something similar to the following:
01 redxl> dumpf load dsmd.dstop.011228.1558.01
02 Created Fri Dec 28 15:58:19 2001
03 By hpost v. 1.1 Generic 112080-04 Oct 11 2001 16:13:25 executing as pid=17553
04 On ssc name = sc0.
05 Domain = 1=B Platform = edw
06 Boards in dump: master SC CPs/CSBs[1:0]: 3
07 EXB[17:0]: 3C000
08 Slot0[17:0]: 3C000
09 Slot1[17:0]: 3C000
10 -D option, -d
11 "DSMD DomainStop Dump"
12 0 errors occurred while creating this dump.
13 redxl> wfail
14 SDI EX14/S0 Master_Stop_Status0[31:0] = 5000004F
15 MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
16 SDI EX14/S0 Dstop0[31:0] = 10019000
17 Dstop0[16]: D DARB texp requests all Dstop (M)
18 Dstop0[28]: D 1E Slot0 asserted Error, enabled to cause Dstop (M)
19 EPLD SB14 Err1_Dom0: Mask= 00 Err= 02 1stErr= 02
20 Err1[1]: 1E+ Error reported by SDC
21 SDC SB14 PortErr [0][25:0] = 0028001 (Safari Port 0)
22 P0Err[ 0]: 1E Parity Bidi error
23 P0Err[ 17]: Parity Single error
24 FAIL Port SB14/P0: Dstop detected by SDC
25 Primary service FRU is Slot SB14.
26 SDI EX15/S0: All SDI is DStopped and RStopped, requested by DARB.
27 SDI EX16/S0: All SDI is DStopped and RStopped, requested by DARB.
28 SDI EX17/S0: All SDI is DStopped and RStopped, requested by DARB.
29 DARB C0: enabled ports (expanders) [17:0]: 3C1E8
30 DARB C0: other darb req Dstop+Rstop for exps[17:0]: 04000
31 DARB C1: enabled ports (expanders) [17:0]: 3C1E8
32 DARB C1: other darb req Dstop+Rstop for exps[17:0]: 04000
SOLUTION SUMMARY:
- Troubleshooting:
The dump header tells us that this Dstop was generated by dsmd (lines 10,11) while a
domain was active. This is also evident by the dumpf file name. dsmd.dstop files are
created by dsmd as part of an ASR. Walking the error chain:
- SDI14 reports a first error of its Slot 0 board asserting error (line 18).
- Next, the EPLD on SB14 is examined, and we see it's reporting an error from
the SDC (line 20).
- Finally, the SDC shows a first error Parity Bidi error (line 22). The parity
error involves Safari Port 0 (line 21), or processor SB14/P0.
SDC on a CPU board interfaces with the four processors, two SBBCs, and the expander
ASICs (AXQ, Master SDI). These connections are parity protected. The parity
protection is enforced by the hardware. Thus, we have a parity error on SB14
between the SDC and processor 0. These pathways are entirely contained within SB14.
No interconnects are involved. The FRU is SB14, as called out by wfail (line 25).
However, there is a software method that induces a Parity Bidi error. When a
processor is issued a safari reset, it stops generating its piece of the partial
parity that contributes to the Parity Bidi detected by the SDC. Without all
processors generating partial parity, the SDC detects an error, and thus a Dstop.
Therefore, also correlate the Dstop to any invocations of the 'reset' command from
the SC.
Unfortunately, executions of 'reset' are not logged. However, there may be some
entries in the message logs that are indicators of a reset. Often, with the execution
of 'reset -d X', messages similar to the following apperar in the platform
message log:
Dec 10 17:43:09 2002 mufasa-sc0 hwad[11639]: [0 2591415823836454 NOTICE Index.cc 392]
Lock 283 is already locked by client 489.1 while locking 145 Policy Violation
Dec 10 17:43:09 2002 mufasa-sc0 hwad[11639]: [1156 2591416001226198 ERR InterruptHandler.cc
2159] Domain Stop interrupt detected, domain C
The domain log may have messages similar to the following:
Dec 12 12:52:44 2002 mufasa-sc0 reset[19653]-C(): [0 2746790693483158 NOTICE Index.cc 392]
Lock 283 is already locked by client 19653.1 while locking 145 Policy Violation
Dec 12 12:53:37 2002 mufasa-sc0 dsmd[11689]-C(): [6301 2746844020311971 ERR xirDomain.cc
459] HWAD call failed: iosramProxy.read() , rc = 2
Dec 12 12:53:37 2002 mufasa-sc0 dsmd[11689]-C(): [6309 2746844021544569 ERR xirDomain.cc
150] Error obtaining processor list for domain C
Dec 12 12:53:37 2002 mufasa-sc0 dsmd[11689]-C(): [6309 2746844023627027 ERR xirDomain.cc
1181] Error obtaining processor list for domain C
Dec 12 12:53:55 2002 mufasa-sc0 dsmd[11689]-C(): [2526 2746862101885584 NOTICE Domains
Patrol.cc 132] Domain C stop occurred, rebooting domain.
- Resolution:
If the failure occurred in proximity to a 'reset' command, the Dstop is a side
effect and not a hardware error. If not, replace the board reporting the Parity
Bidi error.
- Summary of part number and patch ID's
http://infoserver.central.sun.com/data/sshandbook/Devices/System_Board/SYSBD_SunFire_USIIICu.html
- References and bug IDs
SunSolve Article 48122
4792976 - SBBC's Reset Status Register RESET_SAF inappropriately cleared
- Additional background information:
Among several registers, the SBBC has a device reset register and a reset status
register. From 'redx', they appear as ResetCtl and ResetStat, respectively. For
example:
redxl> shbbc 3 0 0
Note: Data is displayed from the currently loaded dump file.
BBC SB03/BB0 Component ID = 216C407D
DevTemp[8:0] = 04B: Valid 50.49 DegC
DevConfig[31:0] = 18800100
[...]
ResetCtl[29:0] = 00000000
0 Sync[1:0] RstCtl[1:0]
0 StopStick RstCtl[2]
0 SafReset[1:0] RstCtl[5:4]
0 SafXIR[1:0] RstCtl[7:6]
[...]
ResetStat[12:0] = 1000
0 XIR_Asserted[1:0] RstSt[4:3] All fields R/W1C
0 SafResAsserted[1:0] RstSt[6:5]
0 PCIResetAsserted RstSt[7]
[...]
When 'reset' issues a safari reset to a processor, the appropriate SafReset bit is
set. The SBBC tracks this event by setting the corresponding status SafResAsserted.
However, because of bug 4792976, SafResAsserted is cleared. Thus, the Dstop does
not contain hints of a reset action.
Longer term, when 4792976 is addressed, the SafResAsserted bits can also be used
as an indicator of an issued 'reset' command.
- Meta-Data/Problem categorization:
Product/Platform: SF12K/SF15K
Category:
- Keywords
15K, 12K, SF15K, SF12K, starcat, dstop, Parity Bidi error
SUBMITTER: Scott Davenport
BUG REPORT ID: 4792976
APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000
ATTACHMENTS:
Copyright (c) 1997-2003 Sun Microsystems, Inc.