Document fins/I0827-1


FIN #: I0827-1

SYNOPSIS: SSP 3.4 and 3.5 'showfailover' output may be inconsistent with the
          actual E10K clock source and may lead to domain arbstops when an
          active control board is hot-swapped

DATE: May/15/02

KEYWORDS: SSP 3.4 and 3.5 'showfailover' output may be inconsistent with the
          actual E10K clock source and may lead to domain arbstops when an
          active control board is hot-swapped


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)



SYNOPSIS: SSP 3.4 and 3.5 'showfailover' output may be inconsistent with 
          the actual E10K clock source and may lead to domain arbstops when 
          an active control board is hot-swapped.


Sun Alert:          No

TOP FIN/FCO REPORT: No 
 
PRODUCT_REFERENCE:  SSP3.4, SSP3.5 on E10000 
 
PRODUCT CATEGORY:   Server / Service


PRODUCTS AFFECTED:  

Systems affected:
-----------------  
Mkt_ID   Platform   Model   Description               Serial Number
------   --------   -----   -----------               -------------
  -       E10000     ALL    Ultra Enterprise 10000          -


X-Options affected:
-------------------
Mkt_ID          Platform   Model   Description                   Serial Number
------          --------   -----   -----------                   -------------
SSP9S-340-SAM9     -         -     E10000 SSP SW 3.4, CD RELEASE       -
SSP9S-350-SAM9     -         -     E10000 SSP SW 3.5, CD RELEASE       -


PART NUMBERS AFFECTED: 

Part Number     Description                         Model
-----------     -----------                         -----
501-4345-55     ECB Assy Set Control E10K             -
501-4839-02     ECB Assy Set Control E10K             -
501-5494-02     ECB Assy Set Control Tested STF+      -


REFERENCES:

BugId: 4661808 - RFE: We need to check to provide an option for an SA 
                 to check the hw clock source.
 
     
PROBLEM DESCRIPTION:

It has been found that the output from the SSP 'showfailover' command
for SSP 3.4 and 3.5 may be inconsistent with the actual clock source
for an E10000 system.  In this situation, all system domains may
arbstop when a service person, using the information from
'showfailover, removes a primary Control Board which is the active
clock source for the system.  

In this scenario, the ssp_resource file may show "sysclock:0" as the
clock source, but the hardware is actually using "sysclock:1" as the
clock source.  The showfailover command shows the main clock source on
Control Board 0 (CB0).  When CB1 is replaced, based on the output of
'showfailover', a global arbstop occurs.

Given the arbstop, the following example message is coming from the 
CB1 clock:

   gaarb 0     arbstoplog[15:0] = 7fff   recordstoplog[15:0] = 7fff
   gaarb 1     pll error  

Using showfailover from SSP3.4:

   Failover State:
        SSP Failover: Disabled 
        CB  Failover: Disabled 
   Failover Connection Map:
        Main SSP to Spare SSP thru Main Hub:   GOOD
        Main SSP to Spare SSP thru Spare Hub:  GOOD
        Main SSP to Primary Control Board:     GOOD
        Main SSP to Spare Control Board:       FAILED
        Spare SSP to Main SSP thru Main Hub:   GOOD
        Spare SSP to Main SSP thru Spare Hub:  GOOD
        Spare SSP to Primary Control Board:    GOOD
        Spare SSP to Spare Control Board:      FAILED
    SSP/CB Host Information
        Main SSP:                              ssp-obb1
        Spare SSP:                             -
        Primary Control Board (JTAG source):   erp-obb1cb0
        Spare Control Board:                   erp-obb1cb1
        System Clock source:                   erp-obb1cb0

When pulling CB1, the system clock to all the domains was lost,
creating a global arbstop.

Here's a sample of error message when CB1 was pulled:

   Mar 19 15:59:19 ssp-obb1 SystemConfChangeact: [ID 702911 local0.info]
        Data passed in is: confConBrdList.0 0
   Mar 19 15:59:20 ssp-obb1 actionsysconfchange: [ID 702911 local0.info] 
        Control board 0 VccFan voltage being set.
   Mar 19 15:59:34 ssp-obb1 SystemConfChangeact: [ID 702911 local0.info] 
        Data passed in is: confConBrdList.0 0-1
   Mar 19 15:59:35 ssp-obb1 actionsysconfchange: [ID 702911 local0.info] 
        Control board 0 VccFan voltage being set.
   Mar 19 15:59:38 ssp-obb1 actionsysconfchange: [ID 702911 local0.info]
        Control board 1 VccFan voltage being set.
   Mar 19 16:01:33 ssp-obb1 syslog: [ID 702911 local0.warning] cb_reset:
        WARNING: cb_reset.c, 495: Resetting host erp-obb1cb1
   Mar 19 16:03:27 ssp-obb1 SystemConfChangeact: [ID 702911 local0.info] 
        Data passed in is: confConBrdList.0 0
   Mar 19 16:03:27 ssp-obb1 actionsysconfchange: [ID 702911 local0.info] 
        Control board 0 VccFan voltage being set.
   Mar 19 16:03:41 ssp-obb1 SystemConfChangeact: [ID 702911 local0.info] 
        Data passed in is: confConBrdList.0 0-1

This problem is caused by a "flickering" of the main Control Board
which continuely generates a clock signal, thereby not allowing the
sysclock on the domains to failover.

The permanent fix will be a workaround integrated into the SSP
software.  This will become known once the problem is more fully
understood.  In the interim, a procedural change to the Control Board
replacement process will protect customers from unnecessary downtime.

In this new procedure, 'showfailover' can be used to identify a failing
control board.  Once a hardware failure of a Control Board on any
E10000 system is identified, a special script is run to determine the
true clock source being used by the system.  See details below.


IMPLEMENTATION:

         ---
        |   |   MANDATORY (Fully Proactive)
         ---    
         
  
         ---
        |   |   CONTROLLED PROACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        | X |   REACTIVE (As Required)
         ---


CORRECTIVE ACTION:

An Authorized Enterprise Field Service Representative may avoid the
above mentioned problems by following the recommendations as shown
below. 

Prior to hot-swapping a Control Board:

   1.  Run the clock-check.sh script.  The script is available at:

       http://cpre-amer.west/esg/hsg/starfire/tools/cgi-bin/clock-check.sh
        
       A sample output that's expected:

          southpark-ssp2:cartman% ./clock-check.sh

          Board                   Clock Source
          --------------------    -------------------
          System Board 0          Control Board 1
          System Board 1          Control Board 1
          System Board 2          Control Board 1
          System Board 3          Not present
          System Board 4          Control Board 1
          System Board 5          Control Board 1
          System Board 6          Control Board 1
          System Board 7          Control Board 1
          System Board 8          Control Board 1
          System Board 9          Control Board 1
          System Board A          Control Board 1
          System Board B          Control Board 1
          System Board C          Control Board 1
          System Board D          Control Board 1
          System Board E          Control Board 1
          System Board F          Control Board 1
          Cplane Support Board 0  Control Board 1
          Cplane Support Board 1  Control Board 1

          Control board 1 is listed as primary in cb_config.

   2.  If the above output is displaying you may safely pull the spare
       Control Boards.

   3.  If the below output is displayed DO NOT pull the spare control
       board as this may arbstop the domains.  Instead, schedule a
       maintenance window when you can bring all domains down
       gracefully to replace the control board.

           southpark-ssp:chef% ./clock-check.sh

           Board                   Clock Source
           --------------------    -------------------
           System Board 0          Control Board 1
           System Board 1          Control Board 1
           System Board 2          Control Board 1
           System Board 3          Not present
           System Board 4          Control Board 1
           System Board 5          Control Board 1
           System Board 6          Control Board 1
           System Board 7          Control Board 1
           System Board 8          Control Board 1
           System Board 9          Control Board 1
           System Board A          Control Board 1
           System Board B          Control Board 1
           System Board C          Control Board 1
           System Board D          Control Board 1
           System Board E          Control Board 1
           System Board F          Control Board 1
           Cplane Support Board 0  Control Board 1
           Cplane Support Board 1  Control Board 1

           Control board 0 is listed as primary in cb_config.
           WARNING: The SSP and hardware disagree on clock source!!
        
NOTE: An RFE has been filed to include the HW read into showfailover.


COMMENTS:  

None 

============================================================================

Implementation Footnote:

i)   In case of MANDATORY FINs, Enterprise Services will attempt to
     contact all affected customers to recommend implementation of
     the FIN.

ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical

     support teams will recommend implementation of the FIN  (to their
     respective accounts), at the convenience of the customer.

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as
     the need arises.
----------------------------------------------------------------------------

All released FINs and FCOs can be accessed using your favorite network
browser as follows:

SunWeb Access:
--------------
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and

  FCO Homepage collections.

SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO
index.

Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------

General:
--------
* Send questions or comments to finfco-manager@sdpsweb.EBay
--------------------------------------------------------------------------




Copyright (c) 1997-2003 Sun Microsystems, Inc.