InfoDoc ID   Synopsis   Date
48395   Sun Fire[TM] 15K: POST level increments with repeated error   19 Dec 2002

Status Issued

Description

Synopsis: Sun Fire[TM] 15K POST level increments with repeated error

Description:

The POST level on a Sun Fire 15K increments if an error repeats within a given timeframe. Per PTS, this is a desired feature, in which dsmd increases the run level.However, manual intervention is still required to blacklist the component. If the component is not blacklisted, it will continue to be included in the POST configuration. Eventually POST "may" deconfigure the component, but this DOES NOT blacklist the component. The user must place this in the blacklist file manually.

Once the domain recovers and is booted, any subsequent error within a four hour period will be treated as a repeated error. After this 4 hour period the domain will be considered recovered and healthy.

NOTE: The initial run of POST, after the domain is considered "healthy" again, depends on the operation which runs. If it is due to a reset, then the POST run will be a -Q (same as level 7) . If it is due to a operator-initiated action, such as setkeyswitch, then the level will be defined by the contents of the .postrc file used. The default level is 16 if not specified in the .postrc file

The following is an example of this behavior, showing excerpts from the POSTs and dstop(s) where a repeated error occurred within a given time period.

redxl> ld dsmd.dstop.020929.1549.03
Created Sun Sep 29 15:49:04 2002
By hpost v. 1.2 Generic 112488-06 Jun 18 2002 15:53:15 executing as pid=21248
On ssc name = dbsc3-adm.ny.jpmorgan.com.
Domain = 2=C = eqny-comsyb2 Platform = eqny-db2
Boards in dump: master SC CPs/CSBs[1:0]: 3
EXB[17:0]: 02010
Slot0[17:0]: 02010
Slot1[17:0]: 02010
-D option, -d
"DSMD DomainStop Dump"
0 errors occurred while creating this dump.
redxl> wfail
SDI EX04/S0: All SDI is DStopped and RStopped, requested by DARB.
SDI EX13/S0 Master_Stop_Status0[31:0] = 2004004F
MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
SDI EX13/S0 Dstop0[31:0] = 10019000
Dstop0[16]: D DARB texp requests all Dstop (M)
Dstop0[28]: D 1E Slot0 asserted Error, enabled to cause Dstop (M)
EPLD SB13 Err1_Dom0: Mask= 00 Err= 80 1stErr= 80
Err1[7]: 1E+ Error reported by BBC1
BBC SB13/BB1 Device_Err_Stat[31:0] = 80008100
DevErr[ 8]: 1E Port 0 Safari device asserted error
Proc SB13/P2 (13.0.2) EmuShad[0:78] = 0020 00000000 00000000 (Note rev order)
EmuSh[ 9]: THUE: Etag ECC UE due to other access (P$, W$, wrback...).
AFSR [63:0] = 03900000.00000000 AFAR [42:4] = 1A3.9E821CE_
AFSR2[63:0] = 01100000.00000000 AFAR2[42:4] = 1A3.9E821CE_
AFSR[52]: 1E PRIV: Priviledged code access error(s) occurred.
AFSR[55]: TUE: Uncorrectable Ecache tag ECC error.
AFSR[56]: 1E TSCE: SW_handled Correctable Ecache tag ECC error.
AFSR[57]: THCE: Hardware corrected Ecache tag ECC error.
FAIL Proc SB13/P2: Dstop detected by Proc SB13/P2.
Primary service FRU is Slot SB13.
DARB C0: enabled ports (expanders) [17:0]: 07E3F
DARB C0: other darb req Dstop+Rstop for exps[17:0]: 02000
DARB C1: enabled ports (expanders) [17:0]: 07E3F
DARB C1: other darb req Dstop+Rstop for exps[17:0]: 02000                  

This error occurred four times before POST deconfigured it. The level 64 post finally "deconfigured" (NOT Blacklisted) the suspect component. Please note, due to the error, a lesser post level could have caught this, or a greater post level may have been required to fail the component.

Finally, below you can see the post level changing in the various post runs.

NOTE: A "Short" post is the post execution of dumping ASIC state for capture into the rstop or dstop dump file. Successful dump captures have an exit code of 85. Unsuccessful exit codes are 86 or 87, depending on failure. Obviously, regardless of whether an rstop or dstop occurs, we will generate a short post log from the "capture of ASIC state". The only difference is that after a dstop, we will reboot the domain, in which case a full long post log should appear.

Therefore, in this scenario you see eight post files, four of which are generated as a result of capturing the ASIC states.

post020929.1549.04.log:# pid = 21248 level = 16 verbose_level = 20
--Short post

post020929.1550.13.log:# pid = 21409 level = 16 verbose_level = 20
post020929.1550.13.log:# Cmdline: /opt/SUNWSMS/SMS1.2/bin/hpost -d C -a -Palt_level 16
--Configured in 333 with 8 procs, 32.000 GBytes, 6 IO adapters.

post020929.1558.16.log:# pid = 22516 level = 16 verbose_level = 20
--Short post

post020929.1559.12.log:# pid = 22650 level = 16 verbose_level = 20
post020929.1559.12.log:# Cmdline: /opt/SUNWSMS/SMS1.2/bin/hpost -d C -a -Palt_level 16
--Configured in 333 with 8 procs, 32.000 GBytes, 6 IO adapters.

post020929.1607.21.log:# pid = 23770 level = 16 verbose_level = 20
--Short post

post020929.1608.22.log:# pid = 23913 level = 32 verbose_level = 20
post020929.1608.22.log:# Cmdline: /opt/SUNWSMS/SMS1.2/bin/hpost -d C -a -Palt_level 32
--Configured in 333 with 8 procs, 32.000 GBytes, 6 IO adapters.

post020929.1620.10.log:# pid = 25531 level = 16 verbose_level = 20
--Short post

post020929.1621.05.log:# pid = 25660 level = 64 verbose_level = 20
post020929.1621.05.log:# Cmdline: /opt/SUNWSMS/SMS1.2/bin/hpost -d C -a -Palt_level 64
--Configured in 333 with 7 procs, 28.000 GBytes, 6 IO adapters.
                  

NOTE: Again please note that the proc 418 was missing from showdevices and psrinfo, but was NOT blacklisted.

INTERNAL SUMMARY:

INTERNAL SUMMARY:

Submitter Jerry Klohr

jkl@sun.com x50578

Applies To Sun Fire 15K, AFO Vertical Team Docs/HAS

Attachments none

SUBMITTER: Gerald Klohr APPLIES TO: AFO Vertical Team Docs/HAS, Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.