Document fins/I0765-1


FIN #: I0765-1

SYNOPSIS: Systems based on the UltraSPARC III family of processors may
          experience "send mondo" panics for several different reasons

DATE: Jan/31/02

KEYWORDS: Systems based on the UltraSPARC III family of processors may
          experience "send mondo" panics for several different reasons


---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)



SYNOPSIS: Systems based on the UltraSPARC III family of processors may 
          experience "send mondo" panics for several different reasons.


Sun Alert:          No

TOP FIN/FCO REPORT: No 
 
PRODUCT_REFERENCE:  UltraSPARC III family of processors  
 
PRODUCT CATEGORY:   Server / SW Admin


PRODUCTS AFFECTED:  
 
Systems Affected:
----------------
Mkt_ID   Platform   Model   Description         Serial Number
------   --------   -----   -----------         -------------
  -	 A28	     ALL    Sun Blade 1000            -
  -      A35         ALL    Sun Fire 280R             -
  -      A30	     ALL    Sun Fire V880             -
  -      S8	     ALL    Sun Fire 3800             -
  -      S12	     ALL    Sun Fire 4800             -
  -      S12i	     ALL    Sun Fire 4810             -
  -      S24	     ALL    Sun Fire 6800             -
  -      F15K        ALL    Sun Fire 15K              -
  -      N28         ALL    Netra 20                  -
 

List X-Options affected:
-----------------------
Mkt_ID          Platform   Model   Description                  Serial Number
------          --------   -----   -----------                  -------------
X4007A             -         -     ASSY CPU-4PROC USIIIP 900MHz       -
X4525A             -         -     ASSY MAXCPU 900MHz CNFIG F15K      -
X4004A             -         -     ASSY CPU-2PROC USIII  750MHz       -
X4005A             -         -     ASSY CPU-4PROC USIII  900MHz       -
X4006A             -         -     ASSY CPU-2PROC USIIIP 900MHz       -
X4046A             -         -     ASSY CPU DUAL 750MHz AL A30        -
X4047A             -         -     ASSY CPU DUAL 750MHz AL A30        -
XCPUBD-4049        -         -     ASSY CPU-4GB/4PROC USIII 900+M     -
XCPUBD-F4089       -         -     ASSY CPU-8GB/4PROC USIII 900+M     -
XCPUBD-F4169       -         -     ASSY CPU-16GB/4PROC USIII 900+M    -
XCPUBD-F4329       -         -     ASSY CPU-32GB/4PROC USIII 900+M    -
XCPUBD-2029        -         -     ASSY CPU-2GB/2PROC USIII 900+M     -
XCPUBD-2049        -         -     ASSY CPU-4GB/2PROC USIII 900+M     -
XCPUBD-2089        -         -     ASSY CPU-8GB/2PROC USIII 900+M     -
SF-XCPUBD-227      -         -     ASSY CPU-2GB/2PROC USIII 750MHz    -
SF-XCPUBD-447      -         -     ASSY CPU-4GB/4PROC USIII 750MHz    -
SF-XCPUBD-487      -         -     ASSY CPU-8GB/4PROC 512MB USIII     -


PART NUMBERS AFFECTED: 

Part Number             Description                              Model  
-----------             -----------                              -----
540-5052-02 or below    ASSY CPU-4PROC USIIIP 900+ MHz             -
540-4729-04 or below    ASSY CPU-2PROC USIII 750MHz                -
540-4730-04 or below    ASSY CPU-4PROC USIII 750MHz                -
540-5051-02 or below    ASSY CPU-2PROC USIIIP 900+ MHz             -
501-5818-06 or below    ASSY CPU DUAL 750MHz AL A30                -
540-4934-03 or below    ASSY CPU-4GB/4PROC USIII 900+ MHz          -
540-4992-02 or below    ASSY CPU-8GB/4PROC USIII 900+ MHz          -
540-4990-03 or below    ASSY CPU-16GB/4PROC USIII 900+ MHz         -
540-4993-02 or below    ASSY CPU-32GB/4PROC USIII 900+ MHz         -
540-4984-02 or below    ASSY CPU-2GB/2PROC USIII 900+ MHz          -


REFERENCES:

BugId:   4432461 - memscrubber send mondo panic observed when prtdiag 
                   -v was typed on max mem DS.
         4618656 - -Xoptimize in JDK 1.2.2 is too agressive.

PatchId: 112127 - Hardware/PROM: Sun Fire 3800/4800/4810/6800 Systems 
                     flashprom update.
         111346 - Hardware/PROM: Sun Fire 3800/4800/4810/6800 Systems 
                     flashprom update.

ESC:     533075 
         533696


PROBLEM DESCRIPTION:

CPR Engineering has become aware of issues with the diagnosing of "send
mondo" panics in systems based on the UltraSPARC III family of
processors.  The UltraSPARC III family of processors includes both the
UltraSPARC III and UltraSPARC III Cu processors.  The frequency and
misdiagnosis of the issue seems to be increasing and in many cases has
caused customer dissatisfaction due to parts being replaced
unnecessarily and the occurrence of additional outages.

   What is a 'send mondo' panic?  
   -----------------------------

The following description applies to the issue as seen in the
UltraSPARC III family of processors.

The mondo mechanism is used to send an interrupt to one or more
processors.  In a multiprocessor system, when "CPU A" wants to
interrupt "CPU B", CPU A sends a mondo interrupt to CPU B.  CPU A is
the
initiator and CPU B is supposed to respond to the mondo dispatched by
CPU A.  If CPU B does not respond to the request of CPU A, CPU A keeps
retrying for a specified time.

Once this time limit is reached, a "send mondo timeout" panic is
initiated by CPU A.  As part of the panic procedure CPU A will attempt
to stop all other CPUs, and it will send an interrupt to all other CPUs
to request this.  If some CPUs fail to stop as requested, then CPU A
will complain with "failed to stop" messages; hence a send mondo
timeout to CPU B is often accompanied by a "failed to stop CPU B"
message.  This will likely happen since we know that CPU B has already
failed to receive one interrupt.

The exact panic messages for mondo timeouts differ slightly depending
on whether the initiaing CPU is sending a directed interrupt to one
chosen CPU or an interrupt to a set of CPUs at once.  For the former
case, the panic will appear as:

   panic[cpu16]/thread=2a100097d40: 
	send mondo timeout (target 0xb) [694443 NACK 0 BUSY]

The indicated target (0xb) is the non-responsive CPU.  

When the mondo is being sent to a set of CPUs, the panic will appear as:

   send mondo timeout [833333 NACK 0 BUSY]
   IDSR 0x1  aids:
   0  <<<<<< hex value of cpu id that didn't respond to

                mondo interrupt - CPU 0 in this case
>>>>>>
   panic: failed to stop cpu0 <<<<< Non responive CPU
>>>>>>
   panic[cpu1]/thread=30006a96840:  <<<<< victim CPU which
initiated the mondo
                                                 request and didnt get a
                                                 response from target CPU
>>>>>
   send_mondo_set: timeout

For systems based on the UltraSPARC III family of processors, there 
have been three known scenarios identified that can lead to a system
experiencing a send mondo panic.  However, it is possible that send 
mondo panics may also occur as the result of other issues, and may 
in the future be caused by other software issues.

   ----------------------
  | Case 1:  Bug 4432461 |
   ----------------------

On Sun Fire 3800, 4800, 4810, and 6800 systems with a firmware revision
of 5.11.7 or lower, a send mondo panic may occur if the send mondo target
cpu is running prtdiag or prtconf.  This is caused by bug 4432461 which
identifies a conflict between Solaris and the OBP.  The bug is fixed in
firmware versions 5.11.9 or higher.  This case applies to Sun Fire 3800,
4800, 4810, and 6800 ONLY.

To identify if this case is causing the send mondo panic, first check the
firmware revision of the Sun Fire System Controller.  "showsc" from the
platform shell of the System Controller will indicate an ScApp revision of
5.11.7 or lower if this case applies.

Example:

    System Controller 'heslab-12':

        Type  0  for Platform Shell
    
        Type  1  for domain A console
        Type  2  for domain B console
        Type  3  for domain C console
        Type  4  for domain D console

        Input: 0

    Platform Shell

    heslab-12:SC> 
    heslab-12:SC> showsc

    SC: SSC0  

    SC date: Fri Jan 04 11:54:14 PST 2002
    SC uptime: 1 minute 46 seconds 

    ScApp version: 5.11.7   <--------this will show 5.11.7 or lower
    RTOS version: 17

    heslab-12:SC>

Next, perform a core dump analysis on the core file.  If this case
applies, the analysis will show a stack trace on the unresponsive
CPU which will resemble the following:

   2a100f6ac31   client_handler+0x2c
   2a100f6ace1   prom_getproplen+0x44
   2a100f6adc1   opromioctl_cb+0x24c
   2a100f6aea1   prom_tree_access+0x58
   2a100f6af51   opromioctl+0x54
   2a100f6b031   cdev_ioctl+0x40


   --------------------------------------------------------------------
  | Case 2:  J2SE v1.2.2 Non Standard JIT Compiler Optimization Option |
   --------------------------------------------------------------------

If the system in question is using J2SE v1.2.2 (including product family
updates, e.g., J2SE v1.2.2_10) with the non-standard/experimental JIT 
compiler optimization option:

        -Xoptimize

send mondo panics may occur due to the code optimization strategy employed
by the J2SE v1.2.2 unsupported "-Xoptimize" option.

The non-standard/experimental J2SE v1.2.2 only JIT compiler optimization
option "-Xoptimize" may be removed or be subject to change in a future
update release of the product.

The issue is only known to occur with the J2SE v1.2.2 product family 
on systems based on the UltraSPARC III family of processors.  A core 
dump analysis will show a JVM related thread on the unresponsive CPU.  

To see if you are affected by this issue,  first determine that you 
are using J2SE v1.2.2:

Example:

    prompt% /usr/bin/java -version
    java version "1.2.2"
    Solaris VM (build Solaris_JDK_1.2.2_10, native threads, sunwjit)
                              ^^^^^^^^^^^^
                                   |
                              Indicates Java version

Next, determine if the runtime option "-Xoptimize" is being used. You
can check 
to see if java is running with -Xoptimize with:

Example:

    prompt% /usr/ucb/ps auxwww | grep java
    root     26388 18.4  0.510164826200 ?        S 15:20:35  1:00
/usr/bin/../java/bin/../bin/sparc/native_threads/java -Xms64m -Xmx64m
COM.myapp.Main /var/tmp/testcase/properties.txt
    root     26400 16.5  0.54278429432 ?        S 15:20:37  0:55
/usr/bin/../java/bin/../bin/sparc/native_threads/java -Xms64m -Xmx64m
-Xgenconfig:4m,4m,semispaces,56m,56m,markcompact -Xoptimize COM.myapp.Mark
-count 100 -file results/hotspot122_NR8_SR16_2.txt
    user1   26413  0.0  0.0  944  616 pts/2    S 15:20:50  0:00 grep java
    prompt%

In this case, we see PID 26400 is running with -Xoptimize, which makes it
a candidate for this case.

In some cases a product installation that is deployed on the J2SE v1.2.2 
family may have been configured to use the non-standard option
"-Xoptimize".
Specific product installation configuration files will need to be examined
for the "-Xoptimize" option and if it is found then it needs to be
removed.

Sample Entry (only in the case of iPlanet Portal Server v3.0):

    jvm.option=-Xoptimize

If the initial checks as described above fail to show the use of J2SE
v1.2.2 with the -Xoptimize flag, but a core dump analysis of the system
failure shows a JVM related thread on the unresponsive CPU, proceed to
check with your software vendor regarding the configuration installed
and contact Sun support as required.

    -----------------------
   | Case 3:  Bad Hardware |
    -----------------------

send mondo panics can occur as a result of a hardware failure of the CPU
indicated as the send mondo target.

If a core dump analysis on the core file indicates a send mondo
panic has occurred, but cases 1 and 2 do not appear to apply, 
a hardware fault can be suspected.  Using the previous example:

   0x30006db264f:  send mondo timeout [833333 NACK 0 BUSY]
   IDSR 0x1  aids:
   0x30006d4e36f:   0  <<<<<< hex value of cpu id 
                              that didnt respond to 
                              mondo interrupt - CPU 0 in this case
>>>>>>
   0x300016260af:  
   0x30006d3b540:  panic: failed to stop cpu0 <<<<< Non
responsive CPU >>>>>>
   0x30006d49c20:  
   panic[cpu1]/thread=30006a96840:  <<<<< victim CPU which
initiated the mondo
                                                 request and didn't get a
                                                 response from target CPU
>>>>>
   0x30006d413e0:  send_mondo_set: timeout

CPU 1 is the VICTIM in this case.  CPU 0 is the non-responsive
CPU and should be replaced.


IMPLEMENTATION: 

         ---
        |   |   MANDATORY (Fully Proactive)
         ---    
         
  
         ---
        |   |   CONTROLLED PROACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        | X |   REACTIVE (As Required)
         ---


CORRECTIVE ACTION:

The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives who may encounter the above
mentioned problem.

The corrective action taken is dependent upon which of the defined
cases is encountered by the system.  Please perform the appropriate
actions as defined below for the three cases identified above. 

If a customer is experiencing repeated send mondo panics, the customer
does not fit the profile of Case 1 or 2, and send mondo panics continue
after replacement of the implicated CPU module, the case and any other
data should be escalated to CPRE immediately for more detailed
analysis.
  
   ---------
  | Case 1: |  
   ---------

Upgrade customer firmware to 5.11.9 or higher.  5.12.5 is highly
recommended.  Firmware updates may be obtained from any internal
SunSolve site or http://sunsolve.sun.com.

  Preferred:

    Patch-ID# 112127
    Keywords: Sun_Fire firmware update 5.12.5 ScApp RTOS
    Synopsis: Hardware/PROM: Sun Fire 3800/4800/4810/6800 Systems 
    flashprom update
    Date: Oct/18/2001

  Acceptable:

    Patch-ID# 111346
    Keywords: Sun_Fire firmware update 5.11.9 ScApp RTOS
    Synopsis: Hardware/PROM: Sun Fire 3800/4800/4810/6800 Systems 
    flashprom update
    Date: Sep/18/2001

For both patches, refer to the Install.info file for instructions on 
updating the firmware using the files included in the patch.


   ---------
  | Case 2: |  
   ---------

Remove the -Xoptimize flag from the java command.

Remove or comment out the option "-Xoptimize" string from any J2SE
v1.2.2
runtime configuration files (product specific).

Alternatively switch to the J2SE v1.3.1 product family.


   ---------
  | Case 3: |  
   ---------

Based on the analysis of the panic information, a Sun Authorized
Service Representative should replace the FRU associated with the
non-responding CPU.  In some systems, this will be the CPU module
itself.  In other systems, the FRU will be the System Board containing
the faulty CPU module.


COMMENTS:  

None

===========================================================================

Implementation Footnote:
 
i)   In case of MANDATORY FINs, Enterprise Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
  
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/
 
* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
  
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/
 
* From there, select the appropriate link to browse the FIN or FCO index.
 
Supporting Documents:
---------------------
* Supporting documents for FIN/FCOs can be found on Edist.  Edist can be 
  accessed internally at the following URL: http://edist.corp/.
   
* From there, follow the hyperlink path of "Enterprise Services Documenta- 
  tion" and click on "FIN & FCO attachments", then choose the
appropriate   
  folder, FIN or FCO.  This will display supporting directories/files for 
  FINs or FCOs.
    
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@Sun.COM
--------------------------------------------------------------------------


Copyright (c) 1997-2003 Sun Microsystems, Inc.