Document fins/I0765-1
FIN #: I0765-1
SYNOPSIS: Systems based on the UltraSPARC III family of processors may
experience "send mondo" panics for several different reasons
DATE: Jan/31/02
KEYWORDS: Systems based on the UltraSPARC III family of processors may
experience "send mondo" panics for several different reasons
---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------
FIELD INFORMATION NOTICE
(For Authorized Distribution by SunService)
SYNOPSIS: Systems based on the UltraSPARC III family of processors may
experience "send mondo" panics for several different reasons.
Sun Alert: No
TOP FIN/FCO REPORT: No
PRODUCT_REFERENCE: UltraSPARC III family of processors
PRODUCT CATEGORY: Server / SW Admin
PRODUCTS AFFECTED:
Systems Affected:
----------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
- A28 ALL Sun Blade 1000 -
- A35 ALL Sun Fire 280R -
- A30 ALL Sun Fire V880 -
- S8 ALL Sun Fire 3800 -
- S12 ALL Sun Fire 4800 -
- S12i ALL Sun Fire 4810 -
- S24 ALL Sun Fire 6800 -
- F15K ALL Sun Fire 15K -
- N28 ALL Netra 20 -
List X-Options affected:
-----------------------
Mkt_ID Platform Model Description Serial Number
------ -------- ----- ----------- -------------
X4007A - - ASSY CPU-4PROC USIIIP 900MHz -
X4525A - - ASSY MAXCPU 900MHz CNFIG F15K -
X4004A - - ASSY CPU-2PROC USIII 750MHz -
X4005A - - ASSY CPU-4PROC USIII 900MHz -
X4006A - - ASSY CPU-2PROC USIIIP 900MHz -
X4046A - - ASSY CPU DUAL 750MHz AL A30 -
X4047A - - ASSY CPU DUAL 750MHz AL A30 -
XCPUBD-4049 - - ASSY CPU-4GB/4PROC USIII 900+M -
XCPUBD-F4089 - - ASSY CPU-8GB/4PROC USIII 900+M -
XCPUBD-F4169 - - ASSY CPU-16GB/4PROC USIII 900+M -
XCPUBD-F4329 - - ASSY CPU-32GB/4PROC USIII 900+M -
XCPUBD-2029 - - ASSY CPU-2GB/2PROC USIII 900+M -
XCPUBD-2049 - - ASSY CPU-4GB/2PROC USIII 900+M -
XCPUBD-2089 - - ASSY CPU-8GB/2PROC USIII 900+M -
SF-XCPUBD-227 - - ASSY CPU-2GB/2PROC USIII 750MHz -
SF-XCPUBD-447 - - ASSY CPU-4GB/4PROC USIII 750MHz -
SF-XCPUBD-487 - - ASSY CPU-8GB/4PROC 512MB USIII -
PART NUMBERS AFFECTED:
Part Number Description Model
----------- ----------- -----
540-5052-02 or below ASSY CPU-4PROC USIIIP 900+ MHz -
540-4729-04 or below ASSY CPU-2PROC USIII 750MHz -
540-4730-04 or below ASSY CPU-4PROC USIII 750MHz -
540-5051-02 or below ASSY CPU-2PROC USIIIP 900+ MHz -
501-5818-06 or below ASSY CPU DUAL 750MHz AL A30 -
540-4934-03 or below ASSY CPU-4GB/4PROC USIII 900+ MHz -
540-4992-02 or below ASSY CPU-8GB/4PROC USIII 900+ MHz -
540-4990-03 or below ASSY CPU-16GB/4PROC USIII 900+ MHz -
540-4993-02 or below ASSY CPU-32GB/4PROC USIII 900+ MHz -
540-4984-02 or below ASSY CPU-2GB/2PROC USIII 900+ MHz -
REFERENCES:
BugId: 4432461 - memscrubber send mondo panic observed when prtdiag
-v was typed on max mem DS.
4618656 - -Xoptimize in JDK 1.2.2 is too agressive.
PatchId: 112127 - Hardware/PROM: Sun Fire 3800/4800/4810/6800 Systems
flashprom update.
111346 - Hardware/PROM: Sun Fire 3800/4800/4810/6800 Systems
flashprom update.
ESC: 533075
533696
PROBLEM DESCRIPTION:
CPR Engineering has become aware of issues with the diagnosing of "send
mondo" panics in systems based on the UltraSPARC III family of
processors. The UltraSPARC III family of processors includes both the
UltraSPARC III and UltraSPARC III Cu processors. The frequency and
misdiagnosis of the issue seems to be increasing and in many cases has
caused customer dissatisfaction due to parts being replaced
unnecessarily and the occurrence of additional outages.
What is a 'send mondo' panic?
-----------------------------
The following description applies to the issue as seen in the
UltraSPARC III family of processors.
The mondo mechanism is used to send an interrupt to one or more
processors. In a multiprocessor system, when "CPU A" wants to
interrupt "CPU B", CPU A sends a mondo interrupt to CPU B. CPU A is
the
initiator and CPU B is supposed to respond to the mondo dispatched by
CPU A. If CPU B does not respond to the request of CPU A, CPU A keeps
retrying for a specified time.
Once this time limit is reached, a "send mondo timeout" panic is
initiated by CPU A. As part of the panic procedure CPU A will attempt
to stop all other CPUs, and it will send an interrupt to all other CPUs
to request this. If some CPUs fail to stop as requested, then CPU A
will complain with "failed to stop" messages; hence a send mondo
timeout to CPU B is often accompanied by a "failed to stop CPU B"
message. This will likely happen since we know that CPU B has already
failed to receive one interrupt.
The exact panic messages for mondo timeouts differ slightly depending
on whether the initiaing CPU is sending a directed interrupt to one
chosen CPU or an interrupt to a set of CPUs at once. For the former
case, the panic will appear as:
panic[cpu16]/thread=2a100097d40:
send mondo timeout (target 0xb) [694443 NACK 0 BUSY]
The indicated target (0xb) is the non-responsive CPU.
When the mondo is being sent to a set of CPUs, the panic will appear as:
send mondo timeout [833333 NACK 0 BUSY]
IDSR 0x1 aids:
0 <<<<<< hex value of cpu id that didn't respond to
mondo interrupt - CPU 0 in this case
>>>>>>
panic: failed to stop cpu0 <<<<< Non responive CPU
>>>>>>
panic[cpu1]/thread=30006a96840: <<<<< victim CPU which
initiated the mondo
request and didnt get a
response from target CPU
>>>>>
send_mondo_set: timeout
For systems based on the UltraSPARC III family of processors, there
have been three known scenarios identified that can lead to a system
experiencing a send mondo panic. However, it is possible that send
mondo panics may also occur as the result of other issues, and may
in the future be caused by other software issues.
----------------------
| Case 1: Bug 4432461 |
----------------------
On Sun Fire 3800, 4800, 4810, and 6800 systems with a firmware revision
of 5.11.7 or lower, a send mondo panic may occur if the send mondo target
cpu is running prtdiag or prtconf. This is caused by bug 4432461 which
identifies a conflict between Solaris and the OBP. The bug is fixed in
firmware versions 5.11.9 or higher. This case applies to Sun Fire 3800,
4800, 4810, and 6800 ONLY.
To identify if this case is causing the send mondo panic, first check the
firmware revision of the Sun Fire System Controller. "showsc" from the
platform shell of the System Controller will indicate an ScApp revision of
5.11.7 or lower if this case applies.
Example:
System Controller 'heslab-12':
Type 0 for Platform Shell
Type 1 for domain A console
Type 2 for domain B console
Type 3 for domain C console
Type 4 for domain D console
Input: 0
Platform Shell
heslab-12:SC>
heslab-12:SC> showsc
SC: SSC0
SC date: Fri Jan 04 11:54:14 PST 2002
SC uptime: 1 minute 46 seconds
ScApp version: 5.11.7 <--------this will show 5.11.7 or lower
RTOS version: 17
heslab-12:SC>
Next, perform a core dump analysis on the core file. If this case
applies, the analysis will show a stack trace on the unresponsive
CPU which will resemble the following:
2a100f6ac31 client_handler+0x2c
2a100f6ace1 prom_getproplen+0x44
2a100f6adc1 opromioctl_cb+0x24c
2a100f6aea1 prom_tree_access+0x58
2a100f6af51 opromioctl+0x54
2a100f6b031 cdev_ioctl+0x40
--------------------------------------------------------------------
| Case 2: J2SE v1.2.2 Non Standard JIT Compiler Optimization Option |
--------------------------------------------------------------------
If the system in question is using J2SE v1.2.2 (including product family
updates, e.g., J2SE v1.2.2_10) with the non-standard/experimental JIT
compiler optimization option:
-Xoptimize
send mondo panics may occur due to the code optimization strategy employed
by the J2SE v1.2.2 unsupported "-Xoptimize" option.
The non-standard/experimental J2SE v1.2.2 only JIT compiler optimization
option "-Xoptimize" may be removed or be subject to change in a future
update release of the product.
The issue is only known to occur with the J2SE v1.2.2 product family
on systems based on the UltraSPARC III family of processors. A core
dump analysis will show a JVM related thread on the unresponsive CPU.
To see if you are affected by this issue, first determine that you
are using J2SE v1.2.2:
Example:
prompt% /usr/bin/java -version
java version "1.2.2"
Solaris VM (build Solaris_JDK_1.2.2_10, native threads, sunwjit)
^^^^^^^^^^^^
|
Indicates Java version
Next, determine if the runtime option "-Xoptimize" is being used. You
can check
to see if java is running with -Xoptimize with:
Example:
prompt% /usr/ucb/ps auxwww | grep java
root 26388 18.4 0.510164826200 ? S 15:20:35 1:00
/usr/bin/../java/bin/../bin/sparc/native_threads/java -Xms64m -Xmx64m
COM.myapp.Main /var/tmp/testcase/properties.txt
root 26400 16.5 0.54278429432 ? S 15:20:37 0:55
/usr/bin/../java/bin/../bin/sparc/native_threads/java -Xms64m -Xmx64m
-Xgenconfig:4m,4m,semispaces,56m,56m,markcompact -Xoptimize COM.myapp.Mark
-count 100 -file results/hotspot122_NR8_SR16_2.txt
user1 26413 0.0 0.0 944 616 pts/2 S 15:20:50 0:00 grep java
prompt%
In this case, we see PID 26400 is running with -Xoptimize, which makes it
a candidate for this case.
In some cases a product installation that is deployed on the J2SE v1.2.2
family may have been configured to use the non-standard option
"-Xoptimize".
Specific product installation configuration files will need to be examined
for the "-Xoptimize" option and if it is found then it needs to be
removed.
Sample Entry (only in the case of iPlanet Portal Server v3.0):
jvm.option=-Xoptimize
If the initial checks as described above fail to show the use of J2SE
v1.2.2 with the -Xoptimize flag, but a core dump analysis of the system
failure shows a JVM related thread on the unresponsive CPU, proceed to
check with your software vendor regarding the configuration installed
and contact Sun support as required.
-----------------------
| Case 3: Bad Hardware |
-----------------------
send mondo panics can occur as a result of a hardware failure of the CPU
indicated as the send mondo target.
If a core dump analysis on the core file indicates a send mondo
panic has occurred, but cases 1 and 2 do not appear to apply,
a hardware fault can be suspected. Using the previous example:
0x30006db264f: send mondo timeout [833333 NACK 0 BUSY]
IDSR 0x1 aids:
0x30006d4e36f: 0 <<<<<< hex value of cpu id
that didnt respond to
mondo interrupt - CPU 0 in this case
>>>>>>
0x300016260af:
0x30006d3b540: panic: failed to stop cpu0 <<<<< Non
responsive CPU >>>>>>
0x30006d49c20:
panic[cpu1]/thread=30006a96840: <<<<< victim CPU which
initiated the mondo
request and didn't get a
response from target CPU
>>>>>
0x30006d413e0: send_mondo_set: timeout
CPU 1 is the VICTIM in this case. CPU 0 is the non-responsive
CPU and should be replaced.
IMPLEMENTATION:
---
| | MANDATORY (Fully Proactive)
---
---
| | CONTROLLED PROACTIVE (per Sun Geo Plan)
---
---
| X | REACTIVE (As Required)
---
CORRECTIVE ACTION:
The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives who may encounter the above
mentioned problem.
The corrective action taken is dependent upon which of the defined
cases is encountered by the system. Please perform the appropriate
actions as defined below for the three cases identified above.
If a customer is experiencing repeated send mondo panics, the customer
does not fit the profile of Case 1 or 2, and send mondo panics continue
after replacement of the implicated CPU module, the case and any other
data should be escalated to CPRE immediately for more detailed
analysis.
---------
| Case 1: |
---------
Upgrade customer firmware to 5.11.9 or higher. 5.12.5 is highly
recommended. Firmware updates may be obtained from any internal
SunSolve site or http://sunsolve.sun.com.
Preferred:
Patch-ID# 112127
Keywords: Sun_Fire firmware update 5.12.5 ScApp RTOS
Synopsis: Hardware/PROM: Sun Fire 3800/4800/4810/6800 Systems
flashprom update
Date: Oct/18/2001
Acceptable:
Patch-ID# 111346
Keywords: Sun_Fire firmware update 5.11.9 ScApp RTOS
Synopsis: Hardware/PROM: Sun Fire 3800/4800/4810/6800 Systems
flashprom update
Date: Sep/18/2001
For both patches, refer to the Install.info file for instructions on
updating the firmware using the files included in the patch.
---------
| Case 2: |
---------
Remove the -Xoptimize flag from the java command.
Remove or comment out the option "-Xoptimize" string from any J2SE
v1.2.2
runtime configuration files (product specific).
Alternatively switch to the J2SE v1.3.1 product family.
---------
| Case 3: |
---------
Based on the analysis of the panic information, a Sun Authorized
Service Representative should replace the FRU associated with the
non-responding CPU. In some systems, this will be the CPU module
itself. In other systems, the FRU will be the System Board containing
the faulty CPU module.
COMMENTS:
None
===========================================================================
Implementation Footnote:
i) In case of MANDATORY FINs, Enterprise Services will attempt to
contact all affected customers to recommend implementation of
the FIN.
ii) For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical
support teams will recommend implementation of the FIN (to their
respective accounts), at the convenience of the customer.
iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the
need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network
browser as follows:
SunWeb Access:
--------------
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/
* From there, select the appropriate link to query or browse the FIN and
FCO Homepage collections.
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/
* From there, select the appropriate link to browse the FIN or FCO index.
Supporting Documents:
---------------------
* Supporting documents for FIN/FCOs can be found on Edist. Edist can be
accessed internally at the following URL: http://edist.corp/.
* From there, follow the hyperlink path of "Enterprise Services Documenta-
tion" and click on "FIN & FCO attachments", then choose the
appropriate
folder, FIN or FCO. This will display supporting directories/files for
FINs or FCOs.
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@Sun.COM
--------------------------------------------------------------------------
Copyright (c) 1997-2003 Sun Microsystems, Inc.