SRDB ID | Synopsis | Date | ||
49205 | Sun Fire[TM] 12K/15K: Dstop: Slot1 IO Data Valid phase error | 3 Dec 2002 |
Status | Issued |
Description |
- Problem Statement/Title: SF15K Troubleshooting Article: Dstop: Slot1 IO Data Valid phase error (SDI) - Symptoms: 'wfail' output reports something similar to the following: 01 redxl> dumpf load dsmd.dstop.021125.1132.46 02 Created Mon Nov 25 11:32:47 2002 03 By hpost v. 1.3 sms1.3_14 Nov 14 2002 15:19:45 executing as pid=5395 04 On ssc name = xc15p13-sc1.SD_Lab.West.Sun.COM 05 Domain = 2=C = b1 Platform = sun15 06 Boards in dump: master SC CPs/CSBs[1:0]: 3 07 EXB[17:0]: 000A0 08 Slot0[17:0]: 000A0 09 Slot1[17:0]: 000A0 10 -D option, -d 11 "DSMD DomainStop Dump" 12 Created in a Sun Microsystems Inc. internal environment. 13 7 errors occurred while creating this dump. 14 redxl> wfail 15 SDI EX05/S0 Master_Stop_Status0[31:0] = E004000F 16 MStop0[3:0]: All SDI logic is DStopped + Recordstopped. 17 SDI EX05/S0 Dstop0[31:0] = 01018100 18 Dstop0[16]: D DARB texp requests all Dstop (M) 19 Dstop0[24]: D 1E SDI internal Slot1 port requested Dstop 20 SDI EX05/S0 Slot1_Error0[31:0] = 4000C000 Mask = 3000FFFF 21 S1Err0[30]: D 1E Slot1 IO Data Valid phase error (M) 22 FAIL Slot IO5: Dstop/Rstop detected by SDI 23 Primary service FRU is Slot IO5. 24 Secondary service FRU is EXB EX5. 25 SDI EX07/S0: All SDI is DStopped and RStopped, requested by DARB. 26 DARB C0: enabled ports (expanders) [17:0]: 3FDF1 27 DARB C0: other darb req Dstop+Rstop for exps[17:0]: 00020 28 DARB C1: enabled ports (expanders) [17:0]: 3FDF1 29 DARB C1: other darb req Dstop+Rstop for exps[17:0]: 00020
SOLUTION SUMMARY:
- Troubleshooting: The dump header tells us that this Dstop was generated by dsmd (lines 10,11) while a domain was active. This is also evident by the dumpf file name - dsmd.dstop files are created by dsmd as part of an ASR. Also note that errors occurred while creating the dump (line 13). This is typically an indicator that a component wasn't available while register information was being collected. Walking the error chain: - Master SDI on EX5 is directed to Dstop by itself (lines 18,19) - Master SDI on EX5 reports a Data Valid phase error (line 21) - IO5 is FAILed from the configuration (line 22) - IO5 and EX5 are named primary and secondary FRUs (lines 23,24) As part of its normal operation, the SDI performs a clock comparison of its input clock with that of its Slot 1 board. Specifically, the SDC on the Slot 1 board produces an inverse signal (dataid_vld_l) to the input clock signal. If the SDI detects that the inverse signal is no longer in phase with the input signal, an error occurs. IO5 and EX5 are named as suspect FRUs because the dataid_vld_l signal crosses an interconnect. IO5 is FAILed as overall its removal has less impact to the domain/system. - Resolution: In general, a lone "Data Valid phase error" is an indicator that the Slot 1 board is at fault. However, if multiple components report clocking errors, further analysis is required (see SRDB 48293). In addition to the dump file, investigate the following to gather additional evidence which may indicate a specific FRU. - Were any clock input failures reported? (Assumes SMS 1.2 with patch 112481-06, or higher) - Did a failover occur prior to the Dstop? When an SC becomes MAIN, it will attempt to migrate clocks to the MAIN. - Was the IO board powered off? - Was an SC powered off? Any of the above can account for an interruption in clock. If running an older/unpatched version of SMS, executing 'showclocksrc' (downloadable from CPRE) may also reveal bad inputs. - Summary of part number and patch ID's http://infoserver.central.sun.com/data/syshbk/Systems/SunFire15K/component.centerplane.html Patch ID112481-06 - References and bug IDs SRDB 48122 SRDB 48293 http://pts-americas.west.sun.com/esg/hsg/starcat/tools/showclocksrc.html - Additional background information: In this example, the dump header shows errors during dump creating (line 13). By trying various "show" commands in 'redx', you may determine which component(s) resulted in errors. A good starting point is the suspect parts called out by 'wfail'. For this example: 30 redxl> shar 5 1 31 Note: Data is displayed from the currently loaded dump file. 32 AR IO5 (5.1) Component ID = DEAD2BBC 33 NOTE: 3 errors in the process of creating this structure 34 redxl> shsdc 5 1 35 Note: Data is displayed from the currently loaded dump file. 36 SDC IO05 Component ID = DEAD2BBC 37 NOTE: 3 errors in the process of creating this structure 38 redxl> shdx 5 1 39 Note: Data is displayed from the currently loaded dump file. 40 DXs IO5 Component ID = DEADBAD0 41 .0 Dev_Temp[8:0] = 000: Valid 0.51 DegC 42 .1 Dev_Temp[8:0] = 000: Valid 0.51 DegC 43 DX asics on board with non-0 error status [1:0] = 0 Nothing on IO5 is readable. Those ASICs directly accessable via console bus (AR, SDC, DX) report either DEAD2BBC or DEADBAD0. This is an indicator of a power issue. Also, from the platform message log: Nov 25 11:21:27 2002 xc15p13-sc1 esmd[27845]: [2000 343712253731524 ERR SysControl.cc 2635] A power failure has been detected on a redundant power supply at +1.5_vdc1_ok; located on HPCI+ at IO15. SCHEDULE REPLACEMENT of HPCI+ at IO15 as soon as possible. If an additional failure occurs on this supply it may crash any dependent domain(s). For this example, the HPCI+ board was experiencing power issues. IO5 is the target FRU. Keywords Section ------------------- 15K, 12K, SF15K, SF12K, starcat, dstop, Slot1 IO Data Valid phase error
INTERNAL SUMMARY:
SUBMITTER: Scott Davenport PATCH ID: 112481-06 APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS: