SRDB ID   Synopsis   Date
48186   Sun Fire[TM] 12K/15K: Dstop: Parity Bidi error   13 Dec 2002

Status Issued

Description
- Problem Statement:

	Dstop: Parity Bidi error

- Symptoms:

	'wfail' output reports something similar to the following:

	   01  redxl> dumpf load dsmd.dstop.011228.1558.01
	   02  Created Fri Dec 28 15:58:19 2001
	   03  By hpost v. 1.1 Generic 112080-04 Oct 11 2001 16:13:25  executing as pid=17553
	   04  On ssc name =  sc0.
	   05  Domain =  1=B    Platform = edw
	   06  Boards in dump: master SC    CPs/CSBs[1:0]: 3
	   07            EXB[17:0]: 3C000
	   08          Slot0[17:0]: 3C000
	   09          Slot1[17:0]: 3C000
	   10  -D option, -d
	   11  "DSMD DomainStop Dump"
	   12  0 errors occurred while creating this dump.
	   13  redxl> wfail
	   14  SDI EX14/S0  Master_Stop_Status0[31:0] = 5000004F
	   15          MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
	   16  SDI EX14/S0  Dstop0[31:0] = 10019000
	   17          Dstop0[16]: D    DARB texp requests all Dstop (M)
	   18          Dstop0[28]: D 1E Slot0 asserted Error, enabled to cause Dstop (M)
	   19  EPLD SB14  Err1_Dom0: Mask= 00  Err= 02  1stErr= 02
	   20          Err1[1]:  1E+ Error reported by SDC
	   21  SDC SB14  PortErr [0][25:0] =  0028001            (Safari Port 0)
	   22          P0Err[    0]:   1E  Parity Bidi error
	   23          P0Err[   17]:       Parity Single error
	   24  FAIL Port SB14/P0:  Dstop detected by SDC
	   25  Primary service FRU is Slot SB14.
	   26  SDI EX15/S0: All SDI is DStopped and RStopped,         requested by DARB.
	   27  SDI EX16/S0: All SDI is DStopped and RStopped,         requested by DARB.
	   28  SDI EX17/S0: All SDI is DStopped and RStopped,         requested by DARB.
	   29  DARB C0: enabled ports (expanders)          [17:0]: 3C1E8
	   30  DARB C0: other darb req Dstop+Rstop for exps[17:0]: 04000
	   31  DARB C1: enabled ports (expanders)          [17:0]: 3C1E8
	   32  DARB C1: other darb req Dstop+Rstop for exps[17:0]: 04000                  
SOLUTION SUMMARY:
- Troubleshooting:

	The dump header tells us that this Dstop was generated by dsmd (lines 10,11) while a 
	domain was active. This is also evident by the dumpf file name.  dsmd.dstop files are 
	created by dsmd as part of an ASR. Walking the error chain: 

	 - SDI14 reports a first error of its Slot 0 board asserting error (line 18). 
	 - Next, the EPLD on SB14 is examined, and we see it's reporting an error from 
	   the SDC (line 20). 
	 - Finally, the SDC shows a first error Parity Bidi error (line 22). The parity 
	   error involves Safari Port 0 (line 21), or processor SB14/P0. 

	SDC on a CPU board interfaces with the four processors, two SBBCs, and the expander 
	ASICs (AXQ, Master SDI). These connections are parity protected. The parity 
	protection is enforced by the hardware. Thus, we have a parity error on SB14 
	between the SDC and processor 0. These pathways are entirely contained within SB14. 
	No interconnects are involved. The FRU is SB14, as called out by wfail (line 25).
	 
        However, there is a software method that induces a Parity Bidi error. When a
        processor is issued a safari reset, it stops generating its piece of the partial 
        parity that contributes to the Parity Bidi detected by the SDC. Without all 
        processors generating partial parity, the SDC detects an error, and thus a Dstop.
        Therefore, also correlate the Dstop to any invocations of the 'reset' command from
        the SC.

        Unfortunately, executions of 'reset' are not logged. However, there may be some
        entries in the message logs that are indicators of a reset. Often, with the execution
        of 'reset -d X', messages similar to the following apperar in the platform 
        message log:

           Dec 10 17:43:09 2002 mufasa-sc0 hwad[11639]: [0 2591415823836454 NOTICE Index.cc 392] 
             Lock 283 is already locked by client 489.1 while locking 145 Policy Violation
           Dec 10 17:43:09 2002 mufasa-sc0 hwad[11639]: [1156 2591416001226198 ERR InterruptHandler.cc 
             2159] Domain Stop interrupt detected, domain C

        The domain log may have messages similar to the following:

           Dec 12 12:52:44 2002 mufasa-sc0 reset[19653]-C(): [0 2746790693483158 NOTICE Index.cc 392] 
             Lock 283 is already locked by client 19653.1 while locking 145 Policy Violation
           Dec 12 12:53:37 2002 mufasa-sc0 dsmd[11689]-C(): [6301 2746844020311971 ERR xirDomain.cc 
             459] HWAD call failed: iosramProxy.read() , rc = 2
           Dec 12 12:53:37 2002 mufasa-sc0 dsmd[11689]-C(): [6309 2746844021544569 ERR xirDomain.cc 
             150] Error obtaining processor list for domain C 
           Dec 12 12:53:37 2002 mufasa-sc0 dsmd[11689]-C(): [6309 2746844023627027 ERR xirDomain.cc 
             1181] Error obtaining processor list for domain C 
           Dec 12 12:53:55 2002 mufasa-sc0 dsmd[11689]-C(): [2526 2746862101885584 NOTICE Domains
             Patrol.cc 132] Domain C stop occurred, rebooting domain.                  
- Resolution:

        If the failure occurred in proximity to a 'reset' command, the Dstop is a side
        effect and not a hardware error. If not, replace the board reporting the Parity 
        Bidi error. 

- Summary of part number and patch ID's 

	http://infoserver.central.sun.com/data/sshandbook/Devices/System_Board/SYSBD_SunFire_USIIICu.html
	
- References and bug IDs

	SunSolve Article 48122
        4792976 - SBBC's Reset Status Register RESET_SAF inappropriately cleared

- Additional background information:

	Among several registers, the SBBC has a device reset register and a reset status
        register. From 'redx', they appear as ResetCtl and ResetStat, respectively. For
        example:

           redxl> shbbc 3 0 0
           Note: Data is displayed from the currently loaded dump file.
           BBC SB03/BB0   Component ID = 216C407D
                   DevTemp[8:0] = 04B:  Valid  50.49 DegC
                   DevConfig[31:0] = 18800100
                      [...]
                   ResetCtl[29:0] = 00000000
                      0   Sync[1:0]              RstCtl[1:0]    
                      0   StopStick              RstCtl[2]      
                      0   SafReset[1:0]          RstCtl[5:4]    
                      0   SafXIR[1:0]            RstCtl[7:6]    
                      [...]
                   ResetStat[12:0] = 1000
                      0   XIR_Asserted[1:0]      RstSt[4:3]     All fields R/W1C
                      0   SafResAsserted[1:0]    RstSt[6:5]     
                      0   PCIResetAsserted       RstSt[7]       
                      [...]

        When 'reset' issues a safari reset to a processor, the appropriate SafReset bit is 
        set. The SBBC tracks this event by setting the corresponding status SafResAsserted.
        However, because of bug 4792976, SafResAsserted is cleared. Thus, the Dstop does
        not contain hints of a reset action.

        Longer term, when 4792976 is addressed, the SafResAsserted bits can also be used
        as an indicator of an issued 'reset' command.


- Meta-Data/Problem categorization:

Product/Platform: SF12K/SF15K
Category:

- Keywords

15K, 12K, SF15K, SF12K, starcat, dstop, Parity Bidi error

                              

SUBMITTER: Scott Davenport BUG REPORT ID: 4792976 APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.