Document fins/I0827-1
FIN #: I0827-1
SYNOPSIS: SSP 3.4 and 3.5 'showfailover' output may be inconsistent with the
actual E10K clock source and may lead to domain arbstops when an
active control board is hot-swapped
DATE: May/15/02
KEYWORDS: SSP 3.4 and 3.5 'showfailover' output may be inconsistent with the
actual E10K clock source and may lead to domain arbstops when an
active control board is hot-swapped
---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------
FIELD INFORMATION NOTICE
(For Authorized Distribution by SunService)
SYNOPSIS: SSP 3.4 and 3.5 'showfailover' output may be inconsistent with
the actual E10K clock source and may lead to domain arbstops when
an active control board is hot-swapped.
Sun Alert: No
TOP FIN/FCO REPORT: No
PRODUCT_REFERENCE: SSP3.4, SSP3.5 on E10000
PRODUCT CATEGORY: Server / Service
PRODUCTS AFFECTED:
Systems affected:
-----------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- E10000 ALL Ultra Enterprise 10000 -
X-Options affected:
-------------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
SSP9S-340-SAM9 - - E10000 SSP SW 3.4, CD RELEASE -
SSP9S-350-SAM9 - - E10000 SSP SW 3.5, CD RELEASE -
PART NUMBERS AFFECTED:
Part Number Description Model
----------- ----------- -----
501-4345-55 ECB Assy Set Control E10K -
501-4839-02 ECB Assy Set Control E10K -
501-5494-02 ECB Assy Set Control Tested STF+ -
REFERENCES:
BugId: 4661808 - RFE: We need to check to provide an option for an SA
to check the hw clock source.
PROBLEM DESCRIPTION:
It has been found that the output from the SSP 'showfailover' command
for SSP 3.4 and 3.5 may be inconsistent with the actual clock source
for an E10000 system. In this situation, all system domains may
arbstop when a service person, using the information from
'showfailover, removes a primary Control Board which is the active
clock source for the system.
In this scenario, the ssp_resource file may show "sysclock:0" as the
clock source, but the hardware is actually using "sysclock:1" as the
clock source. The showfailover command shows the main clock source on
Control Board 0 (CB0). When CB1 is replaced, based on the output of
'showfailover', a global arbstop occurs.
Given the arbstop, the following example message is coming from the
CB1 clock:
gaarb 0 arbstoplog[15:0] = 7fff recordstoplog[15:0] = 7fff
gaarb 1 pll error
Using showfailover from SSP3.4:
Failover State:
SSP Failover: Disabled
CB Failover: Disabled
Failover Connection Map:
Main SSP to Spare SSP thru Main Hub: GOOD
Main SSP to Spare SSP thru Spare Hub: GOOD
Main SSP to Primary Control Board: GOOD
Main SSP to Spare Control Board: FAILED
Spare SSP to Main SSP thru Main Hub: GOOD
Spare SSP to Main SSP thru Spare Hub: GOOD
Spare SSP to Primary Control Board: GOOD
Spare SSP to Spare Control Board: FAILED
SSP/CB Host Information
Main SSP: ssp-obb1
Spare SSP: -
Primary Control Board (JTAG source): erp-obb1cb0
Spare Control Board: erp-obb1cb1
System Clock source: erp-obb1cb0
When pulling CB1, the system clock to all the domains was lost,
creating a global arbstop.
Here's a sample of error message when CB1 was pulled:
Mar 19 15:59:19 ssp-obb1 SystemConfChangeact: [ID 702911 local0.info]
Data passed in is: confConBrdList.0 0
Mar 19 15:59:20 ssp-obb1 actionsysconfchange: [ID 702911 local0.info]
Control board 0 VccFan voltage being set.
Mar 19 15:59:34 ssp-obb1 SystemConfChangeact: [ID 702911 local0.info]
Data passed in is: confConBrdList.0 0-1
Mar 19 15:59:35 ssp-obb1 actionsysconfchange: [ID 702911 local0.info]
Control board 0 VccFan voltage being set.
Mar 19 15:59:38 ssp-obb1 actionsysconfchange: [ID 702911 local0.info]
Control board 1 VccFan voltage being set.
Mar 19 16:01:33 ssp-obb1 syslog: [ID 702911 local0.warning] cb_reset:
WARNING: cb_reset.c, 495: Resetting host erp-obb1cb1
Mar 19 16:03:27 ssp-obb1 SystemConfChangeact: [ID 702911 local0.info]
Data passed in is: confConBrdList.0 0
Mar 19 16:03:27 ssp-obb1 actionsysconfchange: [ID 702911 local0.info]
Control board 0 VccFan voltage being set.
Mar 19 16:03:41 ssp-obb1 SystemConfChangeact: [ID 702911 local0.info]
Data passed in is: confConBrdList.0 0-1
This problem is caused by a "flickering" of the main Control Board
which continuely generates a clock signal, thereby not allowing the
sysclock on the domains to failover.
The permanent fix will be a workaround integrated into the SSP
software. This will become known once the problem is more fully
understood. In the interim, a procedural change to the Control Board
replacement process will protect customers from unnecessary downtime.
In this new procedure, 'showfailover' can be used to identify a failing
control board. Once a hardware failure of a Control Board on any
E10000 system is identified, a special script is run to determine the
true clock source being used by the system. See details below.
IMPLEMENTATION:
---
| | MANDATORY (Fully Proactive)
---
---
| | CONTROLLED PROACTIVE (per Sun Geo Plan)
---
---
| X | REACTIVE (As Required)
---
CORRECTIVE ACTION:
An Authorized Enterprise Field Service Representative may avoid the
above mentioned problems by following the recommendations as shown
below.
Prior to hot-swapping a Control Board:
1. Run the clock-check.sh script. The script is available at:
http://cpre-amer.west/esg/hsg/starfire/tools/cgi-bin/clock-check.sh
A sample output that's expected:
southpark-ssp2:cartman% ./clock-check.sh
Board Clock Source
-------------------- -------------------
System Board 0 Control Board 1
System Board 1 Control Board 1
System Board 2 Control Board 1
System Board 3 Not present
System Board 4 Control Board 1
System Board 5 Control Board 1
System Board 6 Control Board 1
System Board 7 Control Board 1
System Board 8 Control Board 1
System Board 9 Control Board 1
System Board A Control Board 1
System Board B Control Board 1
System Board C Control Board 1
System Board D Control Board 1
System Board E Control Board 1
System Board F Control Board 1
Cplane Support Board 0 Control Board 1
Cplane Support Board 1 Control Board 1
Control board 1 is listed as primary in cb_config.
2. If the above output is displaying you may safely pull the spare
Control Boards.
3. If the below output is displayed DO NOT pull the spare control
board as this may arbstop the domains. Instead, schedule a
maintenance window when you can bring all domains down
gracefully to replace the control board.
southpark-ssp:chef% ./clock-check.sh
Board Clock Source
-------------------- -------------------
System Board 0 Control Board 1
System Board 1 Control Board 1
System Board 2 Control Board 1
System Board 3 Not present
System Board 4 Control Board 1
System Board 5 Control Board 1
System Board 6 Control Board 1
System Board 7 Control Board 1
System Board 8 Control Board 1
System Board 9 Control Board 1
System Board A Control Board 1
System Board B Control Board 1
System Board C Control Board 1
System Board D Control Board 1
System Board E Control Board 1
System Board F Control Board 1
Cplane Support Board 0 Control Board 1
Cplane Support Board 1 Control Board 1
Control board 0 is listed as primary in cb_config.
WARNING: The SSP and hardware disagree on clock source!!
NOTE: An RFE has been filed to include the HW read into showfailover.
COMMENTS:
None
============================================================================
Implementation Footnote:
i) In case of MANDATORY FINs, Enterprise Services will attempt to
contact all affected customers to recommend implementation of
the FIN.
ii) For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical
support teams will recommend implementation of the FIN (to their
respective accounts), at the convenience of the customer.
iii) For REACTIVE FINs, Enterprise Services will implement the FIN as
the need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:
SunWeb Access:
--------------
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/
* From there, select the appropriate link to query or browse the FIN and
FCO Homepage collections.
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/
* From there, select the appropriate link to browse the FIN or FCO
index.
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@sdpsweb.EBay
--------------------------------------------------------------------------
Copyright (c) 1997-2003 Sun Microsystems, Inc.