Document fins/I0852-1

FIN #: I0852-1

SYNOPSIS: Some pcisch driver panics on F15K systems are unrelated to failed
          hardware

DATE: Nov/15/02

KEYWORDS: Some pcisch driver panics on F15K systems are unrelated to failed
          hardware


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)

           

SYNOPSIS:  Some pcisch driver panics on F15K systems are unrelated to
           failed hardware.

      
SunAlert:           No

TOP FIN/FCO REPORT: No 
 
PRODUCT_REFERENCE:  Sun Fire 15K 
 
PRODUCT CATEGORY:   Server / Service


PRODUCTS AFFECTED:  

Systems Affected:
-----------------  
Mkt_ID   Platform   Model   Description          Serial Number
------   --------   -----   -----------          -------------
  -        F15K      ALL    Sun Fire 15000             -
  

X-Options Affected:
-------------------
Mkt_ID   Platform   Model   Description   Serial Number
------   --------   -----   -----------   -------------
  -         -         -          -              -


PART NUMBERS AFFECTED: 

Part Number            Description                         Model
-----------            -----------                         -----
375-3030-01            PCI Dual FC Network Adapter+          -	
375-3019-01            PCI Single FC Host Adapter            -
501-6302-03 or lower   hsPCI I/O Board (w/ Cassettes)        -
501-5397-11 or lower   hsPCI I/O Board (w/o Cassettes)       -
501-5599-07 or lower   3.3V hsPCI Cassette                   -


REFERENCES:

BugId:  4699182 - OS panics w/ PCI SERR that H/W replacements 
                  don't alleviate .

ESC:    537306 - SWON/LT/ system generated core file.

MANUAL: 806-3512-10: Sun Fire 15K System Service Manual.

URL:    http://infoserver.central/data/816/816-5002/pdf/816-5002-11.pdf
 
     
PROBLEM DESCRIPTION:

The pcisch driver may panic on Sun Fire 15000 domains due to a parity
error on the PCI Bus.  In most cases this is due to a faulty hardware
component.  However, in some cases the panic cannot be corrected by
replacing a hardware FRU.  This second scenario may result in multiple
unexpected domain failures if not corrected.  This FIN describes how to
diagnose and correct this type of pcisch driver panic.

It is important to note that the panic stack for this problem is
IDENTICAL to the panic stack that is produced as a result of bad
hardware.  It is imperative when diagnosing these types of errors that
the field troubleshoot the issue as faulty hardware first.  Only after
the panic persists or moves instances repeatedly, should the field
attribute the problem to the issue outlined in this FIN.

Panics in the pcisch driver cover a wide range of possible failures.
In this case, the control status register (CSR) calls out the detection
of bad parity on the PCI bus:

  WARNING: pcisch-19: PCI fault log start:
  PCI SERR
  PCI error occurred on device #0
  dwordmask=0 bytemask=0
  pcisch-19: PCI primary error (0):pcisch-19: PCI secondary error
(0):pcisch-19: 
       PBM AFAR 0.00000000:WARNING: pcisch19: PCI config space 
       CSR=0xc2a0<signaled-system-error,detected-parity-error>
  pcisch-19: PCI fault log end.

  panic[cpu128]/thread=2a10001fd20: pcisch-19: PCI bus 3 error(s)!

  000002a10001bea0 pcisch:pbm_error_intr+148 (30000b643d8, 2772, 30000b84548,
3, 
        30000b643d8, 3)
    %l0-3: 00000300008b9860 0000000000004000 0000000000000000 0000030000b86584
    %l4-7: 00000300009978c8 0000030008d03ea8 0000000000000000 0000030008d03ed0
  000002a10001bf50 unix:current_thread+44 (0, ffffffffffffffff, 0, 300335b3528,

        0, 1044f340)
    %l0-3: 0000000010007450 000002a10001f061 000000000000000e 0000000000000016
    %l4-7: 0000000000010000 00000300339922a8 000000000000000b 000002a10001f910
  000002a10001f9b0 unix:disp_getwork+40 (1044e398, 0, 1044f340, 10457310, 2, 0)
    %l0-3: 000000001010e2d8 0000000010509e00 00000300335bd518 000002a100c37d20
    %l4-7: 000002a100cebd20 0000000002736110 0000000000000000 000002a10001f9c0
  000002a10001fa60 unix:idle+a4 (0, 0, 80, 1044e398, 3000096d980, 0)
    %l0-3: 0000000010043d58 2030205b275d2076 616c20696e646578 000002a10011dd20
    %l4-7: 70636220290a2020 202e22202073703a 20222031205b275d 2076616c20696e64

NOTE:  The stack itself can be different, depending on each specific case.  
       What matters is the CSR values (specifically the 
       "detected-parity-error" bit).

With every other panic of this nature, a hardware replacement has
resolved the case.  However, with one customer, repeated hardware
replacements did not resolve the issue.  The customer's issue has since
been replicated on multiple machines in an engineering environment.
There are some unique factors that are needed to create this scenario:

  A. To date, this problem has only been seen on 375-3030 (Crystal+) 
     cards.
  B. All the panics have been in either slot 0 or slot 2 of the I/O Boat. 
     (Slots 0 and 2 is the lower 66 MHz slots)
  C. Schizo 2.3 seems to bring the problem out with more regularity.
  D. Veritas software (specifically adding mirrors to volumes) seems 
     to increase the likelihood of failure.

Steps for Diagnosis
===================

As a reminder, when looking at an F15K I/O boat, the slots are designated:

   -----------------------------------------------------
  | Schizo 1, leaf B (33Mhz) | Schizo 0, leaf B (33Mhz) |
  |--------------------------+--------------------------|
  | Schizo 1, leaf A (66Mhz) | Schizo 0, leaf A (66Mhz) | 
   -----------------------------------------------------

  		OR

   -----------------
  | Slot 3 | Slot 1 |
  |   OR   |   OR   |
  | X.1.1.1| X.1.0.1|
  |--------+--------|
  | Slot 2 | Slot 0 |
  |   OR   |   OR   |
  | X.1.1.0| X.1.0.0|
   -----------------

  NOTE: X = hsPCI number (0-17)

To diagnosis the pcisch panic from the above stack, follow these steps:

 a) Use the /etc/path_to_inst file on the domain or the cfgadm/rcfgadm
    commands to isolate the slot.  For example, using the two methods with 
    the panic above (pcisch-19):

       # grep pcisch /etc/path_to_inst

    "/pci@3d,600000" 7 "pcisch"
    "/pci@1c,700000" 0 "pcisch"
    "/pci@3c,700000" 4 "pcisch"
    "/pci@9d,600000" 19 "pcisch"  <----------
    "/pci@9c,600000" 17 "pcisch"
    "/pci@3c,600000" 5 "pcisch"
    "/pci@5d,600000" 11 "pcisch"
    "/pci@7d,600000" 15 "pcisch"
    "/pci@1c,600000" 1 "pcisch"
    "/pci@1d,600000" 3 "pcisch"
    "/pci@5c,700000" 8 "pcisch"
    "/pci@7c,700000" 12 "pcisch"
    "/pci@7c,600000" 13 "pcisch"
    "/pci@9c,700000" 16 "pcisch"
    "/pci@9d,700000" 18 "pcisch"
    "/pci@3d,700000" 6 "pcisch"
    "/pci@5c,600000" 9 "pcisch"
    "/pci@1d,700000" 2 "pcisch"
    "/pci@7d,700000" 14 "pcisch"
    "/pci@5d,700000" 10 "pcisch"
    "/pci@11c,700000" 20 "pcisch"
    "/pci@11c,600000" 21 "pcisch"
    "/pci@11d,700000" 22 "pcisch"
    "/pci@11d,600000" 23 "pcisch" 

    In this case, instance 19 is "/pci@9d,600000".  To translate that
into a
    slot location, break down the 9d into binary <10011101>, then add a
space 
    to obtain <100 1110 1>.  That address now breaks down to slot 4
(100), 
    skip the middle section (1110), pci 1 (or the pci slot on the left).

    The other option is to use the conversion which the dynamic 
    reconfiguration interface provides:

       # rcfgadm -d a -la | grep pcisch

    pcisch0:e00b1slot1       pci-pci/hp   connected    configured   ok
    pcisch10:e02b1slot3      unknown      connected    unconfigured unknown
    pcisch11:e02b1slot2      pci-pci/hp   connected    configured   ok
    pcisch12:e03b1slot1      pci-pci/hp   connected    configured   ok
    pcisch13:e03b1slot0      pci-pci/hp   connected    configured   ok
    pcisch14:e03b1slot3      unknown      connected    unconfigured unknown
    pcisch15:e03b1slot2      pci-pci/hp   connected    configured   ok
    pcisch16:e04b1slot1      unknown      connected    unconfigured unknown
    pcisch17:e04b1slot0      pci-pci/hp   connected    configured   ok
    pcisch18:e04b1slot3      unknown      connected    unconfigured unknown
-->    pcisch19:e04b1slot2      unknown      empty        unconfigured
unknown
    pcisch1:e00b1slot0       unknown      empty        unconfigured unknown
    pcisch20:e08b1slot1      unknown      empty        unconfigured unknown
    pcisch21:e08b1slot0      pci-pci/hp   connected    configured   ok
    pcisch22:e08b1slot3      unknown      empty        unconfigured unknown
    pcisch23:e08b1slot2      unknown      empty        unconfigured unknown
    pcisch2:e00b1slot3       unknown      connected    unconfigured unknown
    pcisch3:e00b1slot2       pci-pci/hp   connected    configured   ok
    pcisch4:e01b1slot1       pci-pci/hp   connected    configured   ok
    pcisch5:e01b1slot0       unknown      empty        unconfigured unknown
    pcisch6:e01b1slot3       unknown      connected    unconfigured unknown
    pcisch7:e01b1slot2       pci-pci/hp   connected    configured   ok
    pcisch8:e02b1slot1       pci-pci/hp   connected    configured   ok
    pcisch9:e02b1slot0       unknown      connected    unconfigured unknown

    In this case, the issue is on expander 4 (ex4), I/0 board (b1), slot 2.

 b) Once you identify the correct location, there are three FRUs which
    could be causing the parity error: the hsPCI (p/n: 501-6302-03 or lower
    and 501-5397-11 or lower) also called the I/O boards), 
    the 3.3v cassette (p/n: 501-5599-07), or the adapter itself.

    To narrow down the problem, employ standard hardware
    troubleshooting techniques and move/replace one hardware FRU at a
    time (CPRE recommends moving/replacing the adapter, then the
    cassette, and finally the PCI).  If the problem follows a FRU (on a
    move) or no longer panics (on a replacement), CPAS the offending
    FRU.

    In the event that you are unable to follow this process, it may become
    necessary to replace all three FRUs at once.  However, this is not
    recommended as this could impact FRU availability and will increase
    service costs to Sun.
 
 c) Once you identify a failing FRU and have taken appropriate action,
    track the machine's availability for an appropriate amount of
    time.  Depends on the time taken to identify a failing FRU, the
    recommendation is to run the machine for twice as long as the panic
    interval.  In some cases, that is 1 hour, while in others that is
    24 days.  If the problem persists or shows up on another pcisch
    instance, the machine could be experiencing the problem reported in
    Bug 4699182.  Please escalate to CPRE.

 d) Once CPRE verifies the customer is experiencing this issue, choose 
    a "workaround" option (of two) listed in the "Corrective
Action" 
    section.

The root cause for pcisch driver panics, which are unrelated to faulty
hardware, is still under investigation.  There is no final fix at this
time.  In the meantime, use the recommended workarounds mentioned in
the Corrective Action section below.


IMPLEMENTATION: 

         ---
        |   |   MANDATORY (Fully Proactive)
         ---    
         
  
         ---
        |   |   CONTROLLED PROACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        | X |   REACTIVE (As Required)
         ---


CORRECTIVE ACTION:

The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives who may encounter the above
mentioned problem.

Troubleshoot pcisch driver panics on F15K domains as outlined above.
If the problem is determined NOT to be caused by faulty hardware,
implement one of the two workarounds below.

  A. Replace the 375-3030 (Crystal+) cards with 375-3019 (Amber) cards.
     This has been shown to alleviate the issue after extensive testing. 
     
  OR

  B. Move all 375-3030 cards to either slot 1 or slot 3.  This assumes  
     there are enough I/O boats.

  C. Upgrade the 375-3030 (Crystal+) cards to 375-3108 (Crystal-2A). This 
     will require new drivers to be installed and LC-SC or LC-LC Fibre 
     Cables.  See Product Note 816-5002 for details:

        http://infoserver.central/data/816/816-5002/pdf/816-5002-11.pdf


COMMENTS:  

None

============================================================================

Implementation Footnote:

i)   In case of MANDATORY FINs, Enterprise Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO index.

Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@sdpsweb.EBay
--------------------------------------------------------------------------