SRDB ID   Synopsis   Date
49205   Sun Fire[TM] 12K/15K: Dstop: Slot1 IO Data Valid phase error   3 Dec 2002

Status Issued

Description
- Problem Statement/Title: SF15K Troubleshooting Article:

        Dstop: Slot1 IO Data Valid phase error (SDI)

- Symptoms:

        'wfail' output reports something similar to the following:

           01  redxl> dumpf load dsmd.dstop.021125.1132.46
           02  Created Mon Nov 25 11:32:47 2002
           03  By hpost v. 1.3 sms1.3_14 Nov 14 2002 15:19:45  executing as pid=5395
           04  On ssc name =  xc15p13-sc1.SD_Lab.West.Sun.COM
           05  Domain =  2=C = b1    Platform = sun15
           06  Boards in dump: master SC    CPs/CSBs[1:0]: 3
           07            EXB[17:0]: 000A0
           08          Slot0[17:0]: 000A0
           09          Slot1[17:0]: 000A0
           10  -D option, -d
           11  "DSMD DomainStop Dump"
           12  Created in a Sun Microsystems Inc. internal environment.
           13  7 errors occurred while creating this dump.
           14  redxl> wfail
           15  SDI EX05/S0  Master_Stop_Status0[31:0] = E004000F
           16          MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
           17  SDI EX05/S0  Dstop0[31:0] = 01018100
           18          Dstop0[16]: D    DARB texp requests all Dstop (M)
           19          Dstop0[24]: D 1E SDI internal Slot1 port requested Dstop
           20  SDI EX05/S0  Slot1_Error0[31:0] = 4000C000  Mask = 3000FFFF
           21          S1Err0[30]: D 1E Slot1 IO Data Valid phase error (M)
           22  FAIL Slot IO5:  Dstop/Rstop detected by SDI
           23  Primary service FRU is Slot IO5.
           24  Secondary service FRU is EXB EX5.
           25  SDI EX07/S0: All SDI is DStopped and RStopped,         requested by DARB.
           26  DARB C0: enabled ports (expanders)          [17:0]: 3FDF1
           27  DARB C0: other darb req Dstop+Rstop for exps[17:0]: 00020
           28  DARB C1: enabled ports (expanders)          [17:0]: 3FDF1
           29  DARB C1: other darb req Dstop+Rstop for exps[17:0]: 00020
                  

SOLUTION SUMMARY:
- Troubleshooting:

        The dump header tells us that this Dstop was generated by dsmd (lines 10,11) 
        while a domain was active. This is also evident by the dumpf file name - 
        dsmd.dstop files are created by dsmd as part of an ASR. Also note that errors
        occurred while creating the dump (line 13). This is typically an indicator that
        a component wasn't available while register information was being collected.
        Walking the error chain:

         - Master SDI on EX5 is directed to Dstop by itself (lines 18,19)
         - Master SDI on EX5 reports a Data Valid phase error (line 21)
         - IO5 is FAILed from the configuration (line 22)
         - IO5 and EX5 are named primary and secondary FRUs (lines 23,24)

        As part of its normal operation, the SDI performs a clock comparison
        of its input clock with that of its Slot 1 board. Specifically, the SDC
        on the Slot 1 board produces an inverse signal (dataid_vld_l) to the 
        input clock signal. If the SDI detects that the inverse signal is no 
        longer in phase with the input signal, an error occurs.

        IO5 and EX5 are named as suspect FRUs because the dataid_vld_l signal
        crosses an interconnect. IO5 is FAILed as overall its removal has less
        impact to the domain/system.

- Resolution:

        In general, a lone "Data Valid phase error" is an indicator that the Slot 1
        board is at fault. However, if multiple components report clocking errors,
        further analysis is required (see SRDB 48293).

        In addition to the dump file, investigate the following to gather additional
        evidence which may indicate a specific FRU.

           - Were any clock input failures reported? (Assumes SMS 1.2 with
             patch 112481-06, or higher)
           - Did a failover occur prior to the Dstop? When an SC becomes
             MAIN, it will attempt to migrate clocks to the MAIN.
           - Was the IO board powered off?
           - Was an SC powered off?
           
        Any of the above can account for an interruption in clock. If running
        an older/unpatched version of SMS, executing 'showclocksrc' (downloadable 
        from CPRE) may also reveal bad inputs.

- Summary of part number and patch ID's 

        http://infoserver.central.sun.com/data/syshbk/Systems/SunFire15K/component.centerplane.html
        Patch ID 112481-06

- References and bug IDs

        SRDB 48122
        SRDB 48293
        http://pts-americas.west.sun.com/esg/hsg/starcat/tools/showclocksrc.html

- Additional background information:

        In this example, the dump header shows errors during dump creating (line 13).
        By trying various "show" commands in 'redx', you may determine which
        component(s) resulted in errors. A good starting point is the suspect
        parts called out by 'wfail'. For this example:

           30  redxl> shar 5 1
           31  Note: Data is displayed from the currently loaded dump file.
           32  AR   IO5 (5.1)   Component ID = DEAD2BBC
           33          NOTE: 3 errors in the process of creating this structure
           34  redxl> shsdc 5 1
           35  Note: Data is displayed from the currently loaded dump file.
           36  SDC IO05   Component ID = DEAD2BBC
           37          NOTE: 3 errors in the process of creating this structure
           38  redxl> shdx 5 1
           39  Note: Data is displayed from the currently loaded dump file.
           40  DXs IO5  Component ID = DEADBAD0
           41          .0    Dev_Temp[8:0] = 000:  Valid  0.51 DegC
           42          .1    Dev_Temp[8:0] = 000:  Valid  0.51 DegC
           43          DX asics on board with non-0 error status [1:0] = 0

        Nothing on IO5 is readable. Those ASICs directly accessable via console 
        bus (AR, SDC, DX) report either DEAD2BBC or DEADBAD0. This is an indicator
        of a power issue. Also, from the platform message log:

           Nov 25 11:21:27 2002 xc15p13-sc1 esmd[27845]: [2000 343712253731524 ERR 
            SysControl.cc 2635] A power failure has been detected on a redundant power 
            supply at +1.5_vdc1_ok; located on HPCI+ at IO15. SCHEDULE REPLACEMENT of 
            HPCI+ at IO15 as soon as possible. If an additional failure occurs on this 
            supply it may crash any dependent domain(s).

        For this example, the HPCI+ board was experiencing power issues. IO5 is the
        target FRU.

Keywords Section
-------------------
15K, 12K, SF15K, SF12K, starcat, dstop, Slot1 IO Data Valid phase error
                  

INTERNAL SUMMARY:

SUBMITTER: Scott Davenport PATCH ID: 112481-06 APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS:


Copyright (c) 1997-2003 Sun Microsystems, Inc.