SRDB ID   Synopsis   Date
48223   Sun Fire[TM] 12K/15K: Dstop: CP arbiter lockstep consistency check error   31 Oct 2002

Status Issued

Description
- Problem Statement: 

	Dstop: CP arbiter lockstep consistency check error

- Symptoms:

	One or more domains suffer a Dstop with a CP arbiter lockstep
	consistency check error. A typical Dstop signature is:

	   redxl> wfail
	   SDI EX00/S0  Master_Stop_Status0[31:0] = 7004000F
	           MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
	   SDI EX00/S0  Dstop0[31:0] = 00418040
	           Dstop0[16]: D    DARB texp requests all Dstop (M)
	           Dstop0[22]: D 1E SDI internal CP port requested Dstop
	   SDI EX00/S0  CP_Error0[31:0]    = 2000A000  Mask = 580067FF
	           CPErr0[29]: D 1E CP arbiter lockstep consistency check error (M)
	               cp0_{dembusp,texp,unload,demand[1:0]} = 14
	               cp1_{dembusp,texp,unload,demand[1:0]} = 00
	   FAIL EXB EX0:  Dstop/Rstop detected by SDI EX0/S0.
	   Primary service FRU is EXB EX0.
            

SOLUTION SUMMARY:
- Troubleshooting:

	Cases of CP arbiter lockstep Dstops have been observed when one or
	more domains was executing a 'setkeyswitch standby' operation, a 
	'setkeyswitch on' operation, or when inserting an expander into the
	chassis. The domain in the setkeyswitch operation is NOT the
	domain that experiences the Dstop.

	1. Note the time stamp of the Dstop. For example:

	      % cd EXPLORER_DIR/sf15k
	      % ls [A-R]/adm/dump/dsmd.dstop.020615.20*
	      A/adm/dump/dsmd.dstop.020615.2006.33  C/adm/dump/dsmd.dstop.020615.2009.19
	      B/adm/dump/dsmd.dstop.020615.2008.03  D/adm/dump/dsmd.dstop.020615.2010.22
	   
	   If multiple Dstops are present, use the time stamp of the earliest 
	   Dstop: in this case, the stop for domain A. The time stamp is of 
	   the form YYMMDD.HHMM.SS. Domain A's stop occurred at Jun 15 20:06:33 2002.

	2. Examine the other logs on the system for evidence of a 
	   setkeyswitch operation in near proximity to the Dstop time. 
	   'setkeyswitch standby' operations are not logged, so administrator 
	   testimony must be used. 'setkeyswitch on' operations are also 
	   not logged, but since a POST is launched, the POST log can be 
	   used to infer the time of a 'setkeyswitch on'.

	   From explorer data:

	      % cd EXPLORER_DIR/sf15k
	      % ls [A-R]/adm/post/post020615.20*
	      A/adm/post/post020615.2006.34.log  C/adm/post/post020615.2019.57.log
	      A/adm/post/post020615.2009.36.log  C/adm/post/post020615.2027.45.log
	      B/adm/post/post020615.2008.03.log  C/adm/post/post020615.2036.06.log
	      B/adm/post/post020615.2010.18.log  C/adm/post/post020615.2050.55.log
	      B/adm/post/post020615.2019.51.log  D/adm/post/post020615.2010.23.log
	      B/adm/post/post020615.2027.48.log  D/adm/post/post020615.2012.01.log
	      B/adm/post/post020615.2036.15.log  F/adm/post/post020615.2006.52.log
	      B/adm/post/post020615.2050.56.log  F/adm/post/post020615.2024.42.log
	      C/adm/post/post020615.2009.19.log  F/adm/post/post020615.2038.34.log
	      C/adm/post/post020615.2011.02.log  F/adm/post/post020615.2051.56.log

	   'A/adm/post/post020615.2006.34.log' and 'F/adm/post/post020615.2006.52.log' 
	   are very close. Checking the contents of the log, we can infer a 
	   'setkeyswitch on' operation:

	      % grep Cmdline A/adm/post/post020615.2006.34.log
	      # Cmdline:  /opt/SUNWSMS/SMS1.2/bin/hpost -d A -D/var/opt/SUNWSMS/SMS1.2/
	      adm/A/dump/dsmd.dstop.020615.2006.33 -y "DSMD DomainStop Dump"
	      % grep Cmdline F/adm/post/post020615.2006.52.log
	      # Cmdline:  /opt/SUNWSMS/SMS1.2/bin/hpost -d F

	   Domain A's is the processing of the Dstop. Domain F's is most likely
	   a 'setkeyswitch on' operation. 'setkeyswitch on' spawns an 'hpost -d X'
	   process. Also note that the POST log for Domain F is time stamped 
	   after the Dstop. This indicates the Dstop occurred during initial
	   processing of the 'setkeyswitch on' before the hpost was spawned.

	   It is possible that a user executed hpost manually, but it's unlikely.
	   So, there is alignment in this case with known causes of CP arbiter
	   lockstep.

	   It may be possible to further narrow the cause to a specific bug. Refer
	   to Background Information for more details.

	3. Identify any missing patches that address CP arbiter lockstep causes.
	   Relevant patches are listed below.

	4. In the vast majority of cases, a re-POST (either automatically issued by
	   SMS or manually executed) of the stopped domain(s) clears the CP arbiter 
	   lockstep condition and the domain recovers. In rare cases, the condition
	   is persistent. In such cases, use the following workaround:

		a) Set the keyswitch position for ALL effected domains to OFF.

		   % setkeyswitch -d X off

		b) Degrade the centerplane to half capacity.

		   % setbus -c cs0

		c) Return the centerplane to full capacity.

		   % setbus -c cs0,cs1

		d) Set the keyswitch position for ALL affected domains to ON.

	   Only use this workaround if domains do not recover via normal operations.

- Resolution:

	Ensure the following patches are applied to SMS 1.2 systems:

	   112481-06 (or higher)
	   112488-06 (or higher)

	** Make sure the special install instructions for 112481 are **
	** performed. If omitted, the system is still vulnerable.    **

	There are currently no patches for SMS 1.1. Upgrading to SMS 1.2 or
	higher is strongly recommended.

- Summary of part number and patch ID's 

	112481-06 (or higher)
	112488-06 (or higher)
	
- References and bug IDs

	4671526 - libPower needs to clear board test status when boards are reset 
	4671531 - libKeyswitch needs to deconfigure L1 boards before the expander 
	4699827 - Deconfigure L1 boards should reset Darb ports if necessary 
	4712287 - EXB asic LBIST needs to be skipped when CP is in use 
	4724771 - LibPower should send events sychronously

- Additional background information:

	Note: "Dstop time stamp" in this discussion refers to the time stamp
	      of the _first_ Dstop occurrence, if multiples are present.

	Each DARB has 18 individual ports, each port serving a given expander. 
	At start-of-day for a port, if an error is present on that port it 
	will delay the initiation of the arbitration cycle because the error 
	must be handled/cleared. Remaining error free ports are not affected. 
	Because of the delay, all ports do not start arbitration cycling at 
	the same time. Hence, they fall out of lockstep.

	There are three known triggers to CP arbiter lockstep errors:

	   1. Incomplete deconfiguration of centerplane ports when a domain
	      goes to keyswitch position STANDBY/OFF (4671526, 4671531, 
	      and 4699827)
	   2. Failure to deconfigure the centerplane ports (4724771)
	   3. LBIST test execution on the SDI ASIC (4712287)

	By interpretation of the data, it is often possible to determine 
	precisely which trigger and bug was encountered. The first step is
	to obtain the 'cp-ports' utility from:

	   http://cpre-amer.west.sun.com/esg/hsg/starcat/tools/cp-ports-dl.html

	This utility examines either a live system or a hardware state dump
	and outputs the state of the centerplane ports in relevant centerplane
	ASICs. For a given port, the state of the DARB masks and the ASIC ports
	should be consistent (either all "Enabled" or all "Disabled"). For
	example:

	% cp-ports dsmd.dstop.020615.2010.22

	Using dump file dsmd.dstop.020615.2010.22...
	Collecting DARB 0 information...mask is  01E00...done.
	Collecting DARB 1 information...mask is  01E00...done.
	Collecting AMX 0.0 information...done.
	Collecting AMX 0.1 information...done.
	Collecting AMX 1.0 information...done.
	Collecting AMX 1.1 information...done.
	Collecting RMX 0 information...done.
	Collecting RMX 1 information...done.

	        Domain Mask         Port Status         Port Status         Port Status         Port Status
	Port  DARB 0    DARB 1    DARB 0    DARB 1    AMX 0.0   AMX 0.1   AMX 1.0   AMX 1.1   RMX 0     RMX 1
	----  --------  --------  --------  --------  --------  --------  --------  --------  --------  --------
	  0   Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  
	  1   Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  
	  2   Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  
	  3   Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  
	  4   Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  
	  5   Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  
	  6   Disabled  Disabled  Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   
	  7   Disabled  Disabled  Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   
	  8   Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  
	  9   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   
	 10   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   
	 11   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   
	 12   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   Enabled   
	 13   Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  
	 14   Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  
	 15   Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  
	 16   Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  
	 17   Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  Disabled  

	From this output, we see that ports 6 and 7 are not completely deconfigured.
	Thus, trigger #1 is the cause. Further verification is that either a nearby
	POST log starts after the Dstop time stamp (i.e. 'setkeyswitch on') or there
	is no nearby POST log (i.e. 'setkeyswitch standby'). To specifically determine 
	which bug, the patch level of the system can be examined. Perusing the patch 
	README for details on which patch level corrected which bug is left to an 
	exercise for the reader.

	If the cp-ports output shows all centerplane ports completely deconfigured, 
	the timings of the Dstop versus a nearby POST log can be used to distinguish 	
	between triggers #2 and #3. As above, if the POST log starts after the Dstop
	time stamp, this is trigger #2. If the POST log starts before the Dstop time
	stamp, the Dstop occurred during POST. This is trigger #3

- Meta-Data/Problem categorization:

Product/Platform: SF12K/SF15K
Category:

- Keywords

dstop, 15K, 12K, SF15K, SF12K, starcat, CP arbiter lockstep consistency check error            

INTERNAL SUMMARY:

SUBMITTER: Scott Davenport BUG REPORT ID: 4671526, 4671531, 4699827, 4712287, 4724771, 4671526, 4671531, 4699827, 4724771, 4712287 PATCH ID: 112481-06, 112488-06, 112481-06, 112488-06 APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.