SRDB ID | Synopsis | Date | ||
47424 | Sun Fire[TM] 12K/15K: HPOST failure, Error in LBIST signature | 23 Oct 2002 |
Status | Issued |
Description |
During the lbist (local built-in selftest) stage of HPOST, a domain has the following problems:
Example #1
kangaroo-sc0:sms-svc:6> setkeyswitch -d a on Powering on: CSB at CS0 Waiting on exclusive access to EXB(s): 3FFFF. Powering on: CSB at CS1 Powering on: EXB at EX0 Powering on: HPCI at IO0 Powering on: CPU at SB0 Significant contents of .postrc (domain) /etc/opt/SUNWSMS/SMS1.2/config/A/.postrc: allow_us3_cpus Reading domain blacklist file /etc/opt/SUNWSMS/config/A/blacklist ... # ident "@(#)blacklist 1.1 01/04/02 SMI" Reading platform blacklist file /etc/opt/SUNWSMS/config/platform/blacklist ... # ident "@(#)blacklist 1.1 01/04/02 SMI" Reading system ASR blacklist file /etc/opt/SUNWSMS/config/asr/blacklist ... stage lport_reset: Assert reset to IOC ports in -Q mode... stage_lport_reset(): Not -Q mode; Skipping Stage lport_reset stage asic_probe: ASIC probe and JTAG/CBus integrity test... stage brd_rev_eval: Board Revision Evaluation and Compliance... stage cpu_probe: CPU Module probe... stage cdc_probe: CDC DIMM probe... stage mem_probe: Memory dimm probe... Dimm SB0/P0/B1/D0 appears missing. Dimm SB0/P1/B1/D0 appears missing. Dimm SB0/P2/B1/D0 appears missing. Dimm SB0/P3/B1/D0 appears missing. stage adapter_probe: I/O adapter probe... stage cp_shorts: Centerplane Shorts... found zero bit from EXB EX4 xdata_04_0_l. stage lbist: Logic BIST... ERR: DMX C1/D0 Error in LBIST signature. Expected 0xCCCE41E9 Got 0x63D3013A. **** CAUTION: DMX C1/D0 has failed lbist (Logic Built-in Selftest). Since cp_shorts (Centerplane Short test) failed it is probable that this is caused by either a short to ground on a chip interface that is not used or needed in the current hardware configuration, or an error condition in one of the interfacing asics. Cp_shorts has already deconfigured the offender. Current policy is to not deconfigure this component, but to provide this caution that there is some fault present, although possibly in unused hardware. DCDS LBIST DISABLED DX LBIST DISABLED DCDS LBIST DISABLED DX LBIST DISABLED stage ibist: Interconnect BIST... stage field_ict: Field Interconnect Tests... stage mbist1: Internal memory BIST... stage mbist2: External memory BIST... stage cbus_bbsram: Console Bus test of bootbus sram... stage sc_interrupt: DARB to SC interrupt... stage cdc_clear: CDC DIMM clear... stage cpu_lpost: Test all L1 CPU boards... Performing ASIC config with bus config a/d/r = 333... Slot0 in domain: 00001 Slot1 in domain: 00001 EXBs in use: 00000 stage nmb_cpu_lpost: Non-Mem Board Proc tests... Performing ASIC config with bus config a/d/r = 333... Slot0 in domain: 00001 Slot1 in domain: 00001 EXBs in use: 00000 stage_cpu_lpost(): No NMB Boards in config. Skipping Stage nmb_cpu_lpost. stage wib_lpost: Wildcat interface board tests... stage_wib_lpost(): No good Wcis; Skipping Stage wib_lpost stage pci_lpost: Test all L1 I/O boards... stage exp_lpost: Domain-level board and system tests... stage cpu_lpost_II: CPU L1 domain/system tests... stage pci_lpost_Q: Init all L1 I/O boards under -Q... stage cpu_lpost_II_Q: CPU L1 domain/system init under -Q... stage final_config: Final configuration... Creating CPU SRAM handoff structures... Creating GDCD IOSRAM handoff structures in Slot IO0... Writing domain information to PCD... Configured in 333 with 4 procs, 8.000 GBytes, 2 IO adapters. Interconnect frequency is 149.984 MHz, Measured. Golden sram is on Slot IO0. POST (level=16, verbose=20) execution time 4:35 kangaroo-sc0:sms-svc:7>
Example #2
<snipped> stage lbist: Logic BIST... <snipped> ERR: SDC SB14 Error in LBIST signature. Expected 0xEEAEDD73 Got 0xC372CCFF. (416c107d) FAIL Slot SB14: slot failure Primary service FRU is Slot SB14.
SOLUTION SUMMARY:
Explanation:
Specifically, the DMX C1/D0 (example #1; in example #2, it is the SDC for SB14 which had the failure) chip has failed lbist (Logic Built-in Selftest). Since cp_shorts (Centerplane Short test) failed, it is probable that this is caused by either a short to ground on a chip interface that is not used or needed in the current hardware configuration, or an error condition in one of the interfacing asics. Cp_shorts has already deconfigured the offender.
It is possible to further verify if the SB is indeed bad by skipping the LBIST check in the HPOST run and see if the system board reports a failure further in the HPOST testing. To do this, set up the domain's .postrc file (/etc/opt/SUNWSMS/config/<domain_id>/.postrc) with the following entry:
no_asic_lbist sdc
This will cause HPOST to run skipping the LBIST tests. Hopefully a later failure in HPOST will confirm that the SB itself is bad.
It is also important to note that SMS 1.1 does not have SDC LBIST, so a failure of this sort should not occur in SMS 1.1. With SMS 1.2 the SDC LBIST does exist, so a failure of this type in SMS 1.2 does indicate the SB as the primary FRU.
Action:
You can try to isolate further in HPOST testing adding the entry to the .postrc file to make 100% certain that the SB is bad, or if time won't allow further testing, replace the implicated SB as the most likely failed component.
INTERNAL SUMMARY:
SUBMITTER: Joshua Freeman APPLIES TO: AFO Vertical Team Docs/HAS, Hardware/Sun Fire /15000 ATTACHMENTS: