Document fins/I0877-1
FIN #: I0877-1
SYNOPSIS: PCI adapters in Sun Fire 12K/15K domains may intermittently fail
DATE: Sept/16/02
KEYWORDS: PCI adapters in Sun Fire 12K/15K domains may intermittently fail
---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------
FIELD INFORMATION NOTICE
(For Authorized Distribution by SunService)
SYNOPSIS: PCI adapters in Sun Fire 12K/15K domains may intermittently fail.
SunAlert: No
TOP FIN/FCO REPORT: Yes
PRODUCT_REFERENCE: Sun Fire 15K/12K
PRODUCT CATEGORY: Server / SW Admin
PRODUCTS AFFECTED:
Systems Affected:
-----------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- F15K ALL Sun Fire 15K -
- F12K ALL Sun Fire 12K -
X-Options Affected:
-------------------
Mkt_ID Platform Model Description Serial
Number
------ -------- ----- -----------
-------------
X2222A - ALL Sun Dual FE + Dual SCSI PCI Adapter -
X1151A - ALL Sun GigaSwift Ethernet MMF PCI Adapter -
PART NUMBERS AFFECTED:
Part Number Description Model
----------- ----------- -----
501-5727 Dual FE + Dual SCSI PCI Adapter -
501-5524 GigaSwift Ethernet MMF PCI Adapter -
REFERENCES:
BugId: 4732416: hpost needs to modify its auto-connect sequence to
properly connect the Cauldron.
4732369: post needs to increase the wait time during its
auto-connect sequence.
4735782: SMS needs to modify its auto-connect sequence to
properly connect the Cauldron.
4735779: SMS needs to increase the wait time during its
auto-connect sequence.
4723789: PCI devices within Cauldron adapter intermittantly
not seen.
4704847: PCI adapters are intermittently not detected by OBP
probing.
PatchId: 112488: SMS 1.2: domain isolation not seen when degrading
domain using setbus.
112481: SMS 1.2: AMX0=32768 and AMX1=0 on Centerplane 1
are not equal.
ESC: 538228: PCI Gigabit ethernet cards not able to be recognized by
OBP consistently.
539120: devices in S1 storage device attached to F15K disappear
after reboot.
PROBLEM DESCRIPTION:
PCI adapters in F12K/15K domains may intermittently fail following a
reset, reboot or "setkeyswitch on" operation. The failure may prevent
the domain from booting or cause a loss of access to devices such as
disks or networks. It is also possible for the failure to cause an OS
panic. Any panic may potentially cause disk file system corruption.
Affected configurations are any Sun Fire 12K or 15K without patches
112488 and 112481 with any version of Systems Management Software
(SMS) and with any of the following installed:
. Sun Dual Ethernet + Dual SCSI adapter (X2222A) Cauldron
. Sun GigaSwift Ethernet MMF adapter (X1151A) Kuheen
. 3rd party adapters with JTAG reset implementaion similar to
X2222A or X1151A.
The failure is intermittent and CPRE escalations have seen cases with
adapters failing as little as one or two instances to as often as 90%.
In particular, it has been observed that some Cauldron adapters are
more susceptible than others. (Note: No design problem exists in the
Cauldron adapter. The issue resides completely within the software
reset control.)
The failure can be manifested in any of several failing signatures.
One distinguishing identifier is that the problem occurs soon after a
system reset. The reset may be user initiated or initiated through a
system reboot. Ultimately, personnel will note that devices or whole
adapters are missing.
A domain reset is marked by console logs appearing like the two
examples below. Domain console logs are found in
/var/opt/SUNWSMS/adm/[A-R]/console.
Example A:
==========
Jun 22 15:40:18 2002 {20} ok reset-all
Jun 22 15:40:19 2002 Resetting...
Jun 22 15:41:59 2002
Jun 22 15:42:05 2002
Jun 22 15:42:05 2002
Jun 22 15:42:05 2002 Sun Fire 15000, using IOSRAM based Console
Jun 22 15:42:05 2002 Copyright 1998-2001 Sun Microsystems, Inc. All
rights reserved.
Jun 22 15:42:06 2002 OpenBoot 4.5, 8192 MB memory installed, Serial
#44570894.
Jun 22 15:42:06 2002 Ethernet address 0:0:be:a8:19:e, Host ID: 82a8190e.
Jun 22 15:42:06 2002
Example B:
==========
Jun 22 18:35:40 2002 # init 6
Jun 22 18:35:50 2002 #
Jun 22 18:35:50 2002 INIT: New run level: 6
Jun 22 18:35:50 2002 The system is coming down. Please wait.
Jun 22 18:35:50 2002 System services are now being stopped.
Jun 22 18:35:50 2002 Unable to stop VERITAS VM Storage Administrator
Server
Jun 22 18:35:50 2002
Jun 22 18:35:50 2002 Print services stopped.
Jun 22 18:36:14 2002 The system is down.
Jun 22 18:36:15 2002 syncing file systems... done
Jun 22 18:36:16 2002 rebooting...
Jun 22 18:36:18 2002 Resetting...
Jun 22 18:37:58 2002
Jun 22 18:38:04 2002
Jun 22 18:38:05 2002
Jun 22 18:38:05 2002 Sun Fire 15000, using IOSRAM based Console
Jun 22 18:38:05 2002 Copyright 1998-2001 Sun Microsystems, Inc. All
rights reserved.
Jun 22 18:38:05 2002 OpenBoot 4.5, 8192 MB memory installed, Serial
#44570894.
Jun 22 18:38:05 2002 Ethernet address 0:0:be:a8:19:e, Host ID: 82a8190e.
Jun 22 18:38:06 2002
Jun 22 18:38:06 2002
Jun 22 18:38:06 2002
Jun 22 18:38:06 2002 Rebooting with command: boot
Six example failure signatures are shown below.
Example 1:
==========
The most common failure signature for the X1151A (Kuheen) adapter is
that it is not detected during OBP probing. An adapter installed in
the /pci@1d,600000 device path (IO0 C3V1) is not detected by OBP. This
is indicated by the text that states "Nothing there". (Note:
diag-switch? must be set true for OBP probing to occur.)
Jun 17 15:23:19 2002 Probing Memory chunk #0 4096 Megabytes
Jun 17 15:23:19 2002 Probing Memory chunk #1 4096 Megabytes
Jun 17 15:23:20 2002 Probing gptwo at 0,0 SUNW,UltraSPARC-III+
(900 MHz @ 6:1, 8 MB)
Jun 17 15:23:20 2002 memory-controller
Jun 17 15:23:20 2002 Probing gptwo at 1,0 SUNW,UltraSPARC-III+
(900 MHz @ 6:1, 8 MB)
Jun 17 15:23:20 2002 memory-controller
Jun 17 15:23:20 2002 Probing gptwo at 2,0 SUNW,UltraSPARC-III+
(900 MHz @ 6:1, 8 MB)
Jun 17 15:23:21 2002 memory-controller
Jun 17 15:23:21 2002 Probing gptwo at 3,0 SUNW,UltraSPARC-III+
(900 MHz @ 6:1, 8 MB)
Jun 17 15:23:21 2002 memory-controller
Jun 17 15:23:21 2002 Probing gptwo at 1c,0
Jun 17 15:23:21 2002 Probing PCI B pci
Jun 17 15:23:22 2002 Probing /pci@1c,700000 Device 1 scsi scsi
Jun 17 15:23:22 2002 Probing /pci@1c,700000 Device 2 bootbus-controller
iosram
Jun 17 15:23:23 2002 Probing /pci@1c,700000 Device 3 pci108e,1100 network
firewire usb
Jun 17 15:23:23 2002
Jun 17 15:23:23 2002 Probing PCI A pci
Jun 17 15:23:23 2002 Probing /pci@1c,600000 Device 1 Nothing there
Jun 17 15:23:23 2002
Jun 17 15:23:24 2002 Probing gptwo at 1d,0
Jun 17 15:23:24 2002 Probing PCI B pci
Jun 17 15:23:25 2002 Probing /pci@1d,700000 Device 1 pci
Jun 17 15:23:25 2002 Probing /pci@1d,700000/pci@1 Device 0 network
Jun 17 15:23:25 2002 Probing /pci@1d,700000/pci@1 Device 1 network
Jun 17 15:23:27 2002 Probing /pci@1d,700000/pci@1 Device 2 scsi disk
tape scsi disk tape
Jun 17 15:23:27 2002 Probing /pci@1d,700000/pci@1 Device 3 Nothing there
Jun 17 15:23:27 2002 Probing /pci@1d,700000/pci@1 Device 4 Nothing there
Jun 17 15:23:27 2002 Probing /pci@1d,700000/pci@1 Device 5 Nothing there
Jun 17 15:23:27 2002 Probing /pci@1d,700000/pci@1 Device 6 Nothing there
Jun 17 15:23:28 2002 Probing /pci@1d,700000/pci@1 Device 7 Nothing there
Jun 17 15:23:28 2002 Probing /pci@1d,700000/pci@1 Device 8 Nothing there
Jun 17 15:23:28 2002 Probing /pci@1d,700000/pci@1 Device 9 Nothing there
Jun 17 15:23:28 2002 Probing /pci@1d,700000/pci@1 Device a Nothing there
Jun 17 15:23:29 2002 Probing /pci@1d,700000/pci@1 Device b Nothing there
Jun 17 15:23:29 2002 Probing /pci@1d,700000/pci@1 Device c Nothing there
Jun 17 15:23:29 2002 Probing /pci@1d,700000/pci@1 Device d Nothing there
Jun 17 15:23:29 2002 Probing /pci@1d,700000/pci@1 Device e Nothing there
Jun 17 15:23:30 2002 Probing /pci@1d,700000/pci@1 Device f Nothing there
Jun 17 15:23:30 2002
Jun 17 15:23:30 2002
Jun 17 15:23:30 2002 Probing PCI A pci
Jun 17 15:23:30 2002 Probing /pci@1d,600000 Device 1 Nothing there
<---MISSING
Examples 2 through 6 show various observed failure signatures for X2222A
(Cauldron) adapters.
Example 2:
==========
The simplest example is the inability to see the device. (Note: This
failure signature can also occur if the device alias is not set
properly. Insure that it is correct.)
Jun 9 12:12:03 2002 {0} ok boot /pci@1d,700000/pci@1/scsi@2/sd@0,0
Jun 9 12:12:03 2002 Boot device: /pci@1d,700000/pci@1/scsi@2/sd@0,0
File and args:
Jun 9 12:12:03 2002
Jun 9 12:12:03 2002 Can't locate boot device
Jun 9 12:12:03 2002
Example 3:
==========
The example below shows no disks seen on the
/pci@5d,700000/pci@1/scsi@2 device path even though disks were
attached. Output should have appeared similar to the
/pci@7d,700000/pci@1/scsi@2 device path.
Jul 12 03:19:40 2002 {40} ok probe-scsi-all
Jul 12 03:19:40 2002 /pci@7d,700000/pci@1/scsi@2,1
Jul 12 03:19:57 2002
Jul 12 03:19:57 2002 /pci@7d,700000/pci@1/scsi@2
Jul 12 03:19:59 2002 Target 0
Jul 12 03:20:00 2002 Unit 0 Disk SEAGATE ST318305LSUN18G 0340
Jul 12 03:20:00 2002 Target 1
Jul 12 03:20:01 2002 Unit 0 Disk SEAGATE ST318305LSUN18G 0340
Jul 12 03:20:01 2002 Target 2
Jul 12 03:20:03 2002 Unit 0 Disk SEAGATE ST318305LSUN18G 0340
Jul 12 03:20:16 2002
Jul 12 03:20:16 2002 /pci@5d,700000/pci@1/scsi@2,1
Jul 12 03:20:33 2002
Jul 12 03:20:33 2002 /pci@5d,700000/pci@1/scsi@2
Jul 12 03:20:50 2002
Jul 12 03:21:08 2002 {40} ok
Example 4:
==========
The console output below shows an example where the domain successfully
booted, but network devices (ce interfaces) and disks were not seen
following a successful boot. In this example the failed adapter was a
X2222A (Cauldron) adapter with an attached Netra X1 disk subsystem.
Note the missing ce2 and ce3 network interfaces as well as the disk02
device not found error.
Aug 20 21:33:25 2002 Starting VxVM restore daemon...
Aug 20 21:33:25 2002 VxVM starting in boot mode...
Aug 20 21:33:26 2002 NOTICE: vxvm:vxdmp: added disk array 700292,
datype = EMC
Aug 20 21:33:26 2002
Aug 20 21:33:27 2002 vxvm:vxconfigd: WARNING: Disk disk02 in group rootdg:
Disk device not found
Aug 20 21:33:27 2002 WARNING: Unexpected EOF on line 2 of
/kernel/drv/ce.conf
Aug 20 21:33:28 2002 ip_rput_dlpi(ce2): DL_ERROR_ACK for DL_ATTACH_REQ(11),
errno 8, unix 0
Aug 20 21:33:28 2002 ip_rput_dlpi(ce2): DL_ERROR_ACK for DL_BIND_REQ(1),
errno 3, unix 0
Aug 20 21:33:28 2002 ip_rput_dlpi(ce2): DL_ERROR_ACK for
DL_PHYS_ADDR_REQ(49), errno 3, unix 0
Aug 20 21:33:28 2002 ip_rput_dlpi(ce2): DL_ERROR_ACK for DL_UNBIND_REQ(2),
errno 3, unix 0
Aug 20 21:33:28 2002 ip_rput_dlpi(ce2): DL_ERROR_ACK for DL_DETACH_REQ(12),
errno 3, unix 0
Aug 20 21:33:28 2002 ifconfig: SIOCSLIFNAME for ip: ce2: no such interface
Aug 20 21:33:28 2002 ip_rput_dlpi(ce3): DL_ERROR_ACK for DL_ATTACH_REQ(11),
errno 8, unix 0
Aug 20 21:33:28 2002 ip_rput_dlpi(ce3): DL_ERROR_ACK for DL_BIND_REQ(1),
errno 3, unix 0
Aug 20 21:33:28 2002 ip_rput_dlpi(ce3): DL_ERROR_ACK for
DL_PHYS_ADDR_REQ(49), errno 3, unix 0
Aug 20 21:33:28 2002 ip_rput_dlpi(ce3): DL_ERROR_ACK for DL_UNBIND_REQ(2),
errno 3, unix 0
Aug 20 21:33:28 2002 ip_rput_dlpi(ce3): DL_ERROR_ACK for DL_DETACH_REQ(12),
errno 3, unix 0
Aug 20 21:33:28 2002 ifconfig: SIOCSLIFNAME for ip: ce3: no such interface
Aug 20 21:33:28 2002 configuring IPv4 interfaces: ce0 qfe0.
Aug 20 21:33:28 2002 moving addresses from failed IPv4 interfaces: ce2
(moved to qfe0) ce3 (couldn't move, no alternative interface).
Aug 20 21:33:28 2002 Hostname: host01
Example 5:
==========
It is also possible to see additional failure manifestation of the
missing network adapters from the prior example. In the OBP probing
you may see missing devices if diag-switch? is set true.
Aug 16 10:16:04 2002 Probing PCI A pci
Aug 16 10:16:04 2002 Probing /pci@23d,600000 Device 1 pci
Aug 16 10:16:05 2002 Probing /pci@23d,600000/pci@1 Device 0 network
<---- device 1 missing!
Aug 16 10:16:07 2002 Probing /pci@23d,600000/pci@1 Device 2 scsi disk
tape scsi disk tape
Example 6:
==========
This example demonstrates the panic that may occur during the boot
sequence. The features that distinguish this panic from hardware
failure panics is that it occurs during the boot sequence and will
likely occur before the boot has completed. Second, the panic is
marked by "PCI SERR".
Jul 11 23:43:40 2002 VxVM starting special volumes ( swapvol var )...
Jul 11 23:43:50 2002 WARNING: /pci@1d,700000/pci@1/scsi@2 (glm2):
Jul 11 23:43:50 2002 Unexpected DMA state: WAIT.
dstat=c0<DMA-FIFO-empty,master-data-parity-error>
Jul 11 23:43:50 2002 WARNING: /pci@1d,700000/pci@1/scsi@2 (glm2):
Jul 11 23:43:50 2002 got SCSI bus reset
Jul 11 23:43:50 2002 WARNING: /pci@1d,700000/pci@1/scsi@2/sd@0,0 (sd30):
Jul 11 23:43:50 2002 SCSI transport failed: reason 'reset':
retrying command
Jul 11 23:43:50 2002
Jul 11 23:43:58 2002 VxVM general startup...
Jul 11 23:44:09 2002 The system is coming up. Please wait.
Jul 11 23:44:10 2002 checking vxfs filesystems
Jul 11 23:44:10 2002 Running parallel replay fsck ...
Jul 11 23:44:10 2002 /dev/vx/rdsk/datadg/homevol:log replay in progress
Jul 11 23:44:10 2002 /dev/vx/rdsk/datadg/wwwvol:log replay in progress
Jul 11 23:44:11 2002 /dev/vx/rdsk/datadg/homevol:replay complete -
marking super-block as CLEAN
Jul 11 23:44:11 2002 /dev/vx/rdsk/datadg/wwwvol:replay complete -
marking super-block as CLEAN
Jul 11 23:44:12 2002 mount: nonexistent mount point: /home/oracle
Jul 11 23:44:13 2002 WARNING: pcisch-2: PCI fault log start:
Jul 11 23:44:13 2002 PCI SERR
Jul 11 23:44:13 2002 PCI error ocurred on device #6
Jul 11 23:44:14 2002 dwordmask=0 bytemask=0
Jul 11 23:44:14 2002 pcisch-2: PCI primary error (0):pcisch-2: PCI
secondary error (0):pcisch-2: PBM AFAR 0.00000000:WARN<-->
ING: pcisch2: PCI config space CSR=0x4280<signaled-system-error>
Jul 11 23:44:14 2002 pcisch-2: PCI fault log end.
Jul 11 23:44:14 2002
Jul 11 23:44:14 2002 panic[cpu3]/thread=2a10056bd20: pcisch-2: PCI
bus 2 error(s)!
Field personnel can employ the following techniques to help determine
if the customer is experiencing the problem and collect failure
signatures similar to those above.
Diagnostic Technique 1:
=======================
Set the OBP environment variable "diag-switch?" to true. This will
enable OBP device probing and the output will be logged to the domain
console log (/var/opt/SUNWSMS/adm/[A-R]/console). For example, you can
set the environment variable from the Main SC.
% setobpparams -d A "diag-switch?=true"
The command and device probing will take effect following the next reset.
Output will be logged as is shown in Example 1.
Diagnostic Technique 2:
=======================
Increase the verbosity level of post logging. Either the platform
postrc file (/etc/opt/SUNWSMS/config/platform/.postrc) or domain postrc
file (/etc/opt/SUNWSMS/config/[A-R]/.postrc) may be edited to add the
following directive.
verbose 30
The heightened log level will cause hpost to print a summary of the
GDCD data structure. The resulting log better demonstrates which
adapters are detected by hpost. The log may be used to determine
whether hpost is detecting a physically present adapter.
-------------------------------------------------------------------------
Creating GDCD IOSRAM handoff structures in Slot IO8...
Writing domain information to PCD...
CPU_Brds: Proc Mem P/B: 3/1 3/0 2/1 2/0 1/1 1/0 0/1 0/0
Slot Gen 3210 /L: 10 10 10 10 10 10 10 10 CDC
SB07: P PPPP mm PP mm PP mm PP mm PP P
SB08: P PPPP mm PP mm PP mm PP mm PP P
I/O_Brds: IOC P1/Bus/Adapt IOC P0/Bus/Adapt
Slot Gen Type P1 B1/10 B0/10 P0 B1/eb10 B0/10 (e=ENet, b=BBC)
IO08: P hsPCI P p _p p _p P p PP_m p _p
Configured in 333 with 8 procs, 16.000 GBytes, 3 IO adapters.
Interconnect frequency is 149.993 MHz, Measured.
Golden sram is on Slot IO8.
POST (level=16, verbose=30, -f) execution time 4:38
# SMI Sun Fire 15K POST log closed Wed Jun 26 09:07:37 2002
------------------------------------------------------------------------
This issue is occurring because software bugs exist in hwad and hpost
which cause JTAG signals to float and the adapter reset to be
improperly applied. This affects all available SMS versions, including
those patched with hwad patch 112481 and hpost patch 112488. See
the bugs for further details.
Bugs 4732416 and 4732369 are written against the problems in the hpost
code. The hpost code is responsible for the failures that occur
immediately following a reset. The second component of the fix is for
the hwad daemon. Bugs 4735782 and 4735779 are written against the hwad
bugs. The problems in the hwad code would only be manifested during a
hot-swap operation. At this time we have not had a reported instance
of the hot-swap related bugs.
The fixes for the problem are addressed through separate patches for
hpost and hwad. The fixes for the hpost code and bugs 4732416 and
4732369 reside in patch 112488. The hwad fixes for bugs 4735782 and
4735779 are included in patch 112481. SMS 1.1 will not be patched
for these problems. Customers running SMS 1.1 who experience the
failure will need to upgrade to SMS 1.2.
IMPLEMENTATION:
---
| | MANDATORY (Fully Proactive)
---
---
| X | CONTROLLED PROACTIVE (per Sun Geo Plan)
---
---
| | REACTIVE (As Required)
---
CORRECTIVE ACTION:
The following recommendation is provided as a guideline for authorized
Sun Services Field Representatives who may encounter the above mentioned
problem.
Full resolution for this issue is met by installing the currently
available patch 112488 and the future patch for bug 4735782. The
patch for bug 4735782 has not yet been defined.
Service personnel or customers should not wait for bug 4735782's patch
availability prior to installing patch 112488. Install each patch
as they become available. The fix for bug 4735782 is needed only for
adapter hot-swap operation. At this time, adapter hot-swap is not
available because of OS bug 4496757. Until this OS bug is patched, the
fix for 4735782 will have no value.
It is not recommended to replace adapters that are failing because of
the reset bug. The patches should be installed instead.
COMMENTS:
None.
============================================================================
Implementation Footnote:
i) In case of MANDATORY FINs, Enterprise Services will attempt to
contact all affected customers to recommend implementation of
the FIN.
ii) For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical
support teams will recommend implementation of the FIN (to their
respective accounts), at the convenience of the customer.
iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the
need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:
SunWeb Access:
--------------
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/
* From there, select the appropriate link to query or browse the FIN and
FCO Homepage collections.
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/
* From there, select the appropriate link to browse the FIN or FCO index.
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@sdpsweb.EBay
--------------------------------------------------------------------------
Copyright (c) 1997-2003 Sun Microsystems, Inc.