Document fins/I0877-1


FIN #: I0877-1

SYNOPSIS: PCI adapters in Sun Fire 12K/15K domains may intermittently fail

DATE: Sept/16/02

KEYWORDS: PCI adapters in Sun Fire 12K/15K domains may intermittently fail


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)

           

SYNOPSIS: PCI adapters in Sun Fire 12K/15K domains may intermittently fail.
      

SunAlert:           No

TOP FIN/FCO REPORT: Yes 
  
PRODUCT_REFERENCE:  Sun Fire 15K/12K 
 
PRODUCT CATEGORY:   Server / SW Admin


PRODUCTS AFFECTED:  

Systems Affected:
-----------------  
Mkt_ID   Platform   Model   Description            Serial Number
------   --------   -----   -----------            -------------
  -        F15K      ALL    Sun Fire 15K                 -
  -        F12K      ALL    Sun Fire 12K                 -


X-Options Affected:
-------------------
Mkt_ID    Platform   Model   Description                          Serial 
Number
------    --------   -----   -----------                          
-------------
X2222A       -        ALL    Sun Dual FE + Dual SCSI PCI Adapter        -
X1151A       -        ALL    Sun GigaSwift Ethernet MMF PCI Adapter     -


PART NUMBERS AFFECTED: 

Part Number   Description                             Model
-----------   -----------                             -----
501-5727      Dual FE + Dual SCSI PCI Adapter           -
501-5524      GigaSwift Ethernet MMF PCI Adapter        -


REFERENCES:

BugId:   4732416: hpost needs to modify its auto-connect sequence to 
                  properly connect the Cauldron.
         4732369: post needs to increase the wait time during its 
                  auto-connect sequence.
         4735782: SMS needs to modify its auto-connect sequence to 
                  properly connect the Cauldron.
         4735779: SMS needs to increase the wait time during its 
                  auto-connect sequence.
         4723789: PCI devices within Cauldron adapter intermittantly 
                  not seen.
         4704847: PCI adapters are intermittently not detected by OBP 
                  probing.
                   

PatchId: 112488: SMS 1.2: domain isolation not seen when degrading 
                    domain using setbus.
         112481: SMS 1.2: AMX0=32768 and AMX1=0 on Centerplane 1 
                    are not equal.

ESC:     538228: PCI Gigabit ethernet cards not able to be recognized by 
                 OBP consistently. 
         539120: devices in S1 storage device attached to F15K disappear 
                 after reboot.
 
     
PROBLEM DESCRIPTION:

PCI adapters in F12K/15K domains may intermittently fail following a
reset, reboot or "setkeyswitch on" operation.  The failure may prevent
the domain from booting or cause a loss of access to devices such as
disks or networks.  It is also possible for the failure to cause an OS
panic.  Any panic may potentially cause disk file system corruption.

Affected configurations are any Sun Fire 12K or 15K without patches
112488 and 112481 with any version of Systems Management Software
(SMS) and with any of the following installed:

	. Sun Dual Ethernet + Dual SCSI adapter (X2222A) Cauldron
	. Sun GigaSwift Ethernet MMF adapter (X1151A) Kuheen
	. 3rd party adapters with JTAG reset implementaion similar to
	  X2222A or X1151A.

The failure is intermittent and CPRE escalations have seen cases with
adapters failing as little as one or two instances to as often as 90%.
In particular, it has been observed that some Cauldron adapters are
more susceptible than others.  (Note: No design problem exists in the
Cauldron adapter.  The issue resides completely within the software
reset control.)

The failure can be manifested in any of several failing signatures.
One distinguishing identifier is that the problem occurs soon after a
system reset.  The reset may be user initiated or initiated through a
system reboot.  Ultimately, personnel will note that devices or whole
adapters are missing.

A domain reset is marked by console logs appearing like the two
examples below.  Domain console logs are found in
/var/opt/SUNWSMS/adm/[A-R]/console.

Example A:
==========
  Jun 22 15:40:18 2002 {20} ok reset-all
  Jun 22 15:40:19 2002 Resetting...
  Jun 22 15:41:59 2002 
  Jun 22 15:42:05 2002 
  Jun 22 15:42:05 2002 
  Jun 22 15:42:05 2002 Sun Fire 15000, using IOSRAM based Console
  Jun 22 15:42:05 2002 Copyright 1998-2001 Sun Microsystems, Inc.  All  
         rights reserved.
  Jun 22 15:42:06 2002 OpenBoot 4.5, 8192 MB memory installed, Serial 
         #44570894.
  Jun 22 15:42:06 2002 Ethernet address 0:0:be:a8:19:e, Host ID: 82a8190e.
  Jun 22 15:42:06 2002 

Example B:
==========
  Jun 22 18:35:40 2002 # init 6
  Jun 22 18:35:50 2002 # 
  Jun 22 18:35:50 2002 INIT: New run level: 6
  Jun 22 18:35:50 2002 The system is coming down.  Please wait.
  Jun 22 18:35:50 2002 System services are now being stopped.
  Jun 22 18:35:50 2002 Unable to stop VERITAS VM Storage Administrator
         Server      
  Jun 22 18:35:50 2002 
  Jun 22 18:35:50 2002 Print services stopped.
  Jun 22 18:36:14 2002 The system is down.
  Jun 22 18:36:15 2002 syncing file systems... done
  Jun 22 18:36:16 2002 rebooting...
  Jun 22 18:36:18 2002 Resetting...
  Jun 22 18:37:58 2002 
  Jun 22 18:38:04 2002 
  Jun 22 18:38:05 2002 
  Jun 22 18:38:05 2002 Sun Fire 15000, using IOSRAM based Console
  Jun 22 18:38:05 2002 Copyright 1998-2001 Sun Microsystems, Inc.  All 
      rights reserved.
  Jun 22 18:38:05 2002 OpenBoot 4.5, 8192 MB memory installed, Serial 
      #44570894.
  Jun 22 18:38:05 2002 Ethernet address 0:0:be:a8:19:e, Host ID: 82a8190e.
  Jun 22 18:38:06 2002 
  Jun 22 18:38:06 2002 
  Jun 22 18:38:06 2002 
  Jun 22 18:38:06 2002 Rebooting with command: boot

Six example failure signatures are shown below.

Example 1:
==========
The most common failure signature for the X1151A (Kuheen) adapter is
that it is not detected during OBP probing.  An adapter installed in
the /pci@1d,600000 device path (IO0 C3V1) is not detected by OBP.  This
is indicated by the text that states "Nothing there".  (Note:
diag-switch? must be set true for OBP probing to occur.)

  Jun 17 15:23:19 2002 Probing Memory chunk #0 4096 Megabytes
  Jun 17 15:23:19 2002 Probing Memory chunk #1 4096 Megabytes
  Jun 17 15:23:20 2002 Probing gptwo at 0,0 SUNW,UltraSPARC-III+ 
      (900 MHz @ 6:1, 8 MB)
  Jun 17 15:23:20 2002    memory-controller 
  Jun 17 15:23:20 2002 Probing gptwo at 1,0 SUNW,UltraSPARC-III+ 
      (900 MHz @ 6:1, 8 MB)
  Jun 17 15:23:20 2002    memory-controller 
  Jun 17 15:23:20 2002 Probing gptwo at 2,0 SUNW,UltraSPARC-III+ 
      (900 MHz @ 6:1, 8 MB)
  Jun 17 15:23:21 2002    memory-controller 
  Jun 17 15:23:21 2002 Probing gptwo at 3,0 SUNW,UltraSPARC-III+ 
      (900 MHz @ 6:1, 8 MB)
  Jun 17 15:23:21 2002    memory-controller 
  Jun 17 15:23:21 2002 Probing gptwo at 1c,0 
  Jun 17 15:23:21 2002 Probing PCI B pci 
  Jun 17 15:23:22 2002 Probing /pci@1c,700000 Device 1  scsi scsi 
  Jun 17 15:23:22 2002 Probing /pci@1c,700000 Device 2  bootbus-controller 
      iosram 
  Jun 17 15:23:23 2002 Probing /pci@1c,700000 Device 3  pci108e,1100 network 
      firewire usb 
  Jun 17 15:23:23 2002 
  Jun 17 15:23:23 2002 Probing PCI A pci 
  Jun 17 15:23:23 2002 Probing /pci@1c,600000 Device 1  Nothing there 
  Jun 17 15:23:23 2002 
  Jun 17 15:23:24 2002 Probing gptwo at 1d,0 
  Jun 17 15:23:24 2002 Probing PCI B pci 
  Jun 17 15:23:25 2002 Probing /pci@1d,700000 Device 1  pci 
  Jun 17 15:23:25 2002 Probing /pci@1d,700000/pci@1 Device 0  network 
  Jun 17 15:23:25 2002 Probing /pci@1d,700000/pci@1 Device 1  network 
  Jun 17 15:23:27 2002 Probing /pci@1d,700000/pci@1 Device 2  scsi disk 
      tape scsi disk tape 
  Jun 17 15:23:27 2002 Probing /pci@1d,700000/pci@1 Device 3  Nothing there 
  Jun 17 15:23:27 2002 Probing /pci@1d,700000/pci@1 Device 4  Nothing there 
  Jun 17 15:23:27 2002 Probing /pci@1d,700000/pci@1 Device 5  Nothing there 
  Jun 17 15:23:27 2002 Probing /pci@1d,700000/pci@1 Device 6  Nothing there 
  Jun 17 15:23:28 2002 Probing /pci@1d,700000/pci@1 Device 7  Nothing there 
  Jun 17 15:23:28 2002 Probing /pci@1d,700000/pci@1 Device 8  Nothing there 
  Jun 17 15:23:28 2002 Probing /pci@1d,700000/pci@1 Device 9  Nothing there 
  Jun 17 15:23:28 2002 Probing /pci@1d,700000/pci@1 Device a  Nothing there 
  Jun 17 15:23:29 2002 Probing /pci@1d,700000/pci@1 Device b  Nothing there 
  Jun 17 15:23:29 2002 Probing /pci@1d,700000/pci@1 Device c  Nothing there 
  Jun 17 15:23:29 2002 Probing /pci@1d,700000/pci@1 Device d  Nothing there 
  Jun 17 15:23:29 2002 Probing /pci@1d,700000/pci@1 Device e  Nothing there 
  Jun 17 15:23:30 2002 Probing /pci@1d,700000/pci@1 Device f  Nothing there 
  Jun 17 15:23:30 2002 
  Jun 17 15:23:30 2002 
  Jun 17 15:23:30 2002 Probing PCI A pci 
  Jun 17 15:23:30 2002 Probing /pci@1d,600000 Device 1 Nothing there
<---MISSING


Examples 2 through 6 show various observed failure signatures for X2222A 
(Cauldron) adapters.

Example 2:
==========
The simplest example is the inability to see the device.  (Note: This
failure signature can also occur if the device alias is not set
properly.  Insure that it is correct.)

  Jun  9 12:12:03 2002 {0} ok boot /pci@1d,700000/pci@1/scsi@2/sd@0,0
  Jun  9 12:12:03 2002 Boot device: /pci@1d,700000/pci@1/scsi@2/sd@0,0   
       File and args: 
  Jun  9 12:12:03 2002 
  Jun  9 12:12:03 2002 Can't locate boot device
  Jun  9 12:12:03 2002 

Example 3:
==========
The example below shows no disks seen on the
/pci@5d,700000/pci@1/scsi@2 device path even though disks were
attached.  Output should have appeared similar to the
/pci@7d,700000/pci@1/scsi@2 device path.

  Jul 12 03:19:40 2002 {40} ok probe-scsi-all
  Jul 12 03:19:40 2002 /pci@7d,700000/pci@1/scsi@2,1
  Jul 12 03:19:57 2002 
  Jul 12 03:19:57 2002 /pci@7d,700000/pci@1/scsi@2
  Jul 12 03:19:59 2002 Target 0 
  Jul 12 03:20:00 2002   Unit 0   Disk     SEAGATE ST318305LSUN18G 0340
  Jul 12 03:20:00 2002 Target 1 
  Jul 12 03:20:01 2002   Unit 0   Disk     SEAGATE ST318305LSUN18G 0340
  Jul 12 03:20:01 2002 Target 2 
  Jul 12 03:20:03 2002   Unit 0   Disk     SEAGATE ST318305LSUN18G 0340
  Jul 12 03:20:16 2002 
  Jul 12 03:20:16 2002 /pci@5d,700000/pci@1/scsi@2,1
  Jul 12 03:20:33 2002 
  Jul 12 03:20:33 2002 /pci@5d,700000/pci@1/scsi@2
  Jul 12 03:20:50 2002 
  Jul 12 03:21:08 2002 {40} ok 

Example 4:
==========
The console output below shows an example where the domain successfully
booted, but network devices (ce interfaces) and disks were not seen
following a successful boot.  In this example the failed adapter was a
X2222A (Cauldron) adapter with an attached Netra X1 disk subsystem.
Note the missing ce2 and ce3 network interfaces as well as the disk02
device not found error.

  Aug 20 21:33:25 2002 Starting VxVM restore daemon...
  Aug 20 21:33:25 2002 VxVM starting in boot mode...
  Aug 20 21:33:26 2002 NOTICE: vxvm:vxdmp: added disk array 700292, 
      datype = EMC
  Aug 20 21:33:26 2002 
  Aug 20 21:33:27 2002 vxvm:vxconfigd: WARNING: Disk disk02 in group rootdg:  
      Disk device not found
  Aug 20 21:33:27 2002 WARNING: Unexpected EOF on line 2 of 
      /kernel/drv/ce.conf
  Aug 20 21:33:28 2002 ip_rput_dlpi(ce2): DL_ERROR_ACK for DL_ATTACH_REQ(11), 
      errno 8, unix 0
  Aug 20 21:33:28 2002 ip_rput_dlpi(ce2): DL_ERROR_ACK for DL_BIND_REQ(1),  
      errno 3, unix 0
  Aug 20 21:33:28 2002 ip_rput_dlpi(ce2): DL_ERROR_ACK for 
      DL_PHYS_ADDR_REQ(49), errno 3, unix 0
  Aug 20 21:33:28 2002 ip_rput_dlpi(ce2): DL_ERROR_ACK for DL_UNBIND_REQ(2),  
      errno 3, unix 0
  Aug 20 21:33:28 2002 ip_rput_dlpi(ce2): DL_ERROR_ACK for DL_DETACH_REQ(12), 
      errno 3, unix 0
  Aug 20 21:33:28 2002 ifconfig: SIOCSLIFNAME for ip: ce2: no such interface
  Aug 20 21:33:28 2002 ip_rput_dlpi(ce3): DL_ERROR_ACK for DL_ATTACH_REQ(11), 
      errno 8, unix 0
  Aug 20 21:33:28 2002 ip_rput_dlpi(ce3): DL_ERROR_ACK for DL_BIND_REQ(1),  
      errno 3, unix 0
  Aug 20 21:33:28 2002 ip_rput_dlpi(ce3): DL_ERROR_ACK for 
      DL_PHYS_ADDR_REQ(49), errno 3, unix 0
  Aug 20 21:33:28 2002 ip_rput_dlpi(ce3): DL_ERROR_ACK for DL_UNBIND_REQ(2),  
      errno 3, unix 0
  Aug 20 21:33:28 2002 ip_rput_dlpi(ce3): DL_ERROR_ACK for DL_DETACH_REQ(12), 
      errno 3, unix 0
  Aug 20 21:33:28 2002 ifconfig: SIOCSLIFNAME for ip: ce3: no such interface
  Aug 20 21:33:28 2002 configuring IPv4 interfaces: ce0 qfe0.
  Aug 20 21:33:28 2002 moving addresses from failed IPv4 interfaces: ce2  
      (moved to qfe0) ce3 (couldn't move, no alternative interface).
  Aug 20 21:33:28 2002 Hostname: host01

Example 5:
==========
It is also possible to see additional failure manifestation of the
missing network adapters from the prior example.  In the OBP probing
you may see missing devices if diag-switch? is set true.

  Aug 16 10:16:04 2002 Probing PCI A pci 
  Aug 16 10:16:04 2002 Probing /pci@23d,600000 Device 1  pci 
  Aug 16 10:16:05 2002 Probing /pci@23d,600000/pci@1 Device 0  network 
                                                  <---- device 1 missing!
  Aug 16 10:16:07 2002 Probing /pci@23d,600000/pci@1 Device 2  scsi disk  
      tape scsi disk tape 

Example 6:
==========
This example demonstrates the panic that may occur during the boot
sequence.  The features that distinguish this panic from hardware
failure panics is that it occurs during the boot sequence and will
likely occur before the boot has completed.  Second, the panic is
marked by "PCI SERR".

  Jul 11 23:43:40 2002 VxVM starting special volumes ( swapvol var )...
  Jul 11 23:43:50 2002 WARNING: /pci@1d,700000/pci@1/scsi@2 (glm2):
  Jul 11 23:43:50 2002         Unexpected DMA state: WAIT. 
      dstat=c0<DMA-FIFO-empty,master-data-parity-error>
  Jul 11 23:43:50 2002 WARNING: /pci@1d,700000/pci@1/scsi@2 (glm2):
  Jul 11 23:43:50 2002         got SCSI bus reset
  Jul 11 23:43:50 2002 WARNING: /pci@1d,700000/pci@1/scsi@2/sd@0,0 (sd30):
  Jul 11 23:43:50 2002         SCSI transport failed: reason 'reset': 
      retrying command
  Jul 11 23:43:50 2002
  Jul 11 23:43:58 2002 VxVM general startup...
  Jul 11 23:44:09 2002 The system is coming up.  Please wait.
  Jul 11 23:44:10 2002 checking vxfs filesystems
  Jul 11 23:44:10 2002 Running parallel replay fsck ...
  Jul 11 23:44:10 2002 /dev/vx/rdsk/datadg/homevol:log replay in progress
  Jul 11 23:44:10 2002 /dev/vx/rdsk/datadg/wwwvol:log replay in progress
  Jul 11 23:44:11 2002 /dev/vx/rdsk/datadg/homevol:replay complete - 
      marking super-block as CLEAN
  Jul 11 23:44:11 2002 /dev/vx/rdsk/datadg/wwwvol:replay complete - 
      marking super-block as CLEAN
  Jul 11 23:44:12 2002 mount: nonexistent mount point: /home/oracle
  Jul 11 23:44:13 2002 WARNING: pcisch-2: PCI fault log start:
  Jul 11 23:44:13 2002 PCI SERR
  Jul 11 23:44:13 2002 PCI error ocurred on device #6
  Jul 11 23:44:14 2002 dwordmask=0 bytemask=0
  Jul 11 23:44:14 2002 pcisch-2: PCI primary error (0):pcisch-2: PCI 
      secondary error (0):pcisch-2: PBM AFAR 0.00000000:WARN<-->
      ING: pcisch2: PCI config space CSR=0x4280<signaled-system-error>
  Jul 11 23:44:14 2002 pcisch-2: PCI fault log end.
  Jul 11 23:44:14 2002
  Jul 11 23:44:14 2002 panic[cpu3]/thread=2a10056bd20: pcisch-2: PCI 
      bus 2 error(s)!

Field personnel can employ the following techniques to help determine
if the customer is experiencing the problem and collect failure
signatures similar to those above.

Diagnostic Technique 1:
=======================
Set the OBP environment variable "diag-switch?" to true.  This will
enable OBP device probing and the output will be logged to the domain
console log (/var/opt/SUNWSMS/adm/[A-R]/console).  For example, you can
set the environment variable from the Main SC.

  % setobpparams -d A "diag-switch?=true"

The command and device probing will take effect following the next reset.  
Output will be logged as is shown in Example 1.

Diagnostic Technique 2:
=======================
Increase the verbosity level of post logging.  Either the platform
postrc file (/etc/opt/SUNWSMS/config/platform/.postrc) or domain postrc
file (/etc/opt/SUNWSMS/config/[A-R]/.postrc) may be edited to add the
following directive.

  verbose		30

The heightened log level will cause hpost to print a summary of the
GDCD data structure.  The resulting log better demonstrates which
adapters are detected by hpost.  The log may be used to determine
whether hpost is detecting a physically present adapter.

  -------------------------------------------------------------------------
  Creating GDCD IOSRAM handoff structures in Slot IO8...
  Writing domain information to PCD...

  CPU_Brds:  Proc  Mem P/B: 3/1 3/0  2/1 2/0  1/1 1/0  0/1 0/0
  Slot  Gen  3210        /L: 10  10   10  10   10  10   10  10     CDC
  SB07:  P   PPPP            mm  PP   mm  PP   mm  PP   mm  PP      P
  SB08:  P   PPPP            mm  PP   mm  PP   mm  PP   mm  PP      P

  I/O_Brds:         IOC  P1/Bus/Adapt   IOC  P0/Bus/Adapt
  Slot  Gen  Type   P1   B1/10 B0/10    P0   B1/eb10 B0/10  (e=ENet, b=BBC)
  IO08:  P   hsPCI   P    p _p  p _p     P    p PP_m  p _p        

  Configured in 333 with 8 procs, 16.000 GBytes, 3 IO adapters.
  Interconnect frequency is 149.993 MHz, Measured.
  Golden sram is on Slot IO8.
  POST (level=16, verbose=30, -f) execution time 4:38
  # SMI Sun Fire 15K POST log closed Wed Jun 26 09:07:37 2002
  ------------------------------------------------------------------------

This issue is occurring because software bugs exist in hwad and hpost
which cause JTAG signals to float and the adapter reset to be
improperly applied.  This affects all available SMS versions, including
those patched with hwad patch 112481 and hpost patch 112488.  See
the bugs for further details.

Bugs 4732416 and 4732369 are written against the problems in the hpost
code.  The hpost code is responsible for the failures that occur
immediately following a reset.  The second component of the fix is for
the hwad daemon.  Bugs 4735782 and 4735779 are written against the hwad
bugs.  The problems in the hwad code would only be manifested during a
hot-swap operation.  At this time we have not had a reported instance
of the hot-swap related bugs.

The fixes for the problem are addressed through separate patches for
hpost and hwad.  The fixes for the hpost code and bugs 4732416 and
4732369 reside in patch 112488.  The hwad fixes for bugs 4735782 and
4735779 are included in patch 112481.  SMS 1.1 will not be patched
for these problems.  Customers running SMS 1.1 who experience the
failure will need to upgrade to SMS 1.2.



IMPLEMENTATION: 

         ---
        |   |   MANDATORY (Fully Proactive)
         ---    
         
  
         ---
        | X |   CONTROLLED PROACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        |   |   REACTIVE (As Required)
         ---


CORRECTIVE ACTION:

The following recommendation is provided as a guideline for authorized
Sun Services Field Representatives who may encounter the above mentioned 
problem.

Full resolution for this issue is met by installing the currently
available patch 112488 and the future patch for bug 4735782.  The
patch for bug 4735782 has not yet been defined.

Service personnel or customers should not wait for bug 4735782's patch
availability prior to installing patch 112488.  Install each patch
as they become available.  The fix for bug 4735782 is needed only for
adapter hot-swap operation.  At this time, adapter hot-swap is not
available because of OS bug 4496757.  Until this OS bug is patched, the
fix for 4735782 will have no value.

It is not recommended to replace adapters that are failing because of
the reset bug.  The patches should be installed instead.


COMMENTS:  

None.

============================================================================

Implementation Footnote:

i)   In case of MANDATORY FINs, Enterprise Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO index.

Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@sdpsweb.EBay
--------------------------------------------------------------------------


Copyright (c) 1997-2003 Sun Microsystems, Inc.