SRDB ID | Synopsis | Date | ||
48123 | Sun Fire[TM] 12K/15K: Troubleshooting the I1 MAN Network | 29 Oct 2002 |
Status | Issued |
Description |
- Problem Statement: Troubleshooting The I1 MAN Network - Symptoms: Link(s) on the MAN Network are either down or unstable. The following appears in the /var/adm/messages on the domain and/or the SCs: SUNW,eri0 : No response from Ethernet network : Link down -- cable problem? Similarly, in a 'console' session, a user may see: using IOSRAM based Console
SOLUTION SUMMARY:
- Troubleshooting: 1. Check that the latest eri are applied to both the domain and SCs. 2. Confirm that SSCPOST passed on the SCs. % prtconf -vp |grep ssc-post ssc-post-results: 'CP1500 POST Passed; SSC POST v1.15 Passed' 3. Confirm that the IP configuration of the domain's dman0 interface matches what the SC expects. The simplest method is the following: domain# ifconfig dman0 dman0: flags=1008843<UP,BROADCAST,RUNNING,MULTICAST,PRIVATE,IPv4> mtu 1500 index 2 inet 10.1.1.3 netmask ffffffe0 broadcast 10.1.1.31 ether 8:0:20:b8:74:d0 domain# ndd /dev/dman man_get_hostinfo manc_magic = 0x4d414e43 manc_version = 01 manc_csum = 0x0 manc_ip_type = AF_INET manc_dom_ipaddr = 10.1.1.3 manc_dom_ip_netmask = 255.255.255.224 manc_dom_ip_netnum = 10.1.1.0 manc_sc_ipaddr = 10.1.1.1 manc_dom_eaddr = 8:0:20:b8:74:d0 manc_sc_eaddr = 8:0:20:da:26:9c manc_iob_bitmap = 0x4000 io boards = 14.1, manc_golden_iob = 14 4. Confirm that the scman driver has been programmed with appropriate path information. For example: # ndd /dev/scman man_pathgroups_report MAN Pathgroup report: (* == failed) =============================================================================== Interface Destination Active Path Alternate Paths ------------------------------------------------------------------------------- scman0 B 8:0:20:b8:74:d0 eri16 eri16 exp 14 Interface Destination Active Path Alternate Paths ------------------------------------------------------------------------------- scman1 Other SSC hme1 eri0 exp 0, hme1 exp 0 The Alternate Paths for the domain in question should list the same 'exp ##' as seen in the manc_iob_bitmap above. 5. Check prior POST logs for any failures of the RIO or hub on an IO Board. Example error messages are: FAIL Man Ether IOx: Error reading Man Ether Hub EXxx component ID FAIL RIO x.1.0: EpiRIOR1_sc_tfunc(): Test FAILED TSTATE_FAILED: Test ID 191: RIO Basic Tests TSTATE_FAILED: Test ID 192: EpiRIOR1 6. Test connectivity with ping, but while doing so, trace the flow of packets through the hardware. Using a combination of Solaris[TM] statistics and hub statistics, a ping is traceable by watching the increments of various counters. The flow is: X X X X +--------------+ X +--------------+ X +--------------+ | Ping from SC |-->| SC hub port |-->| icmpInEchoes |--+ | to Domain | X | count + 1 | X | count + 1 | | +--------------+ X +--------------+ X +--------------+ | X X | +--------------+ X +--------------+ X +--------------+ | |icmpInEchoReps|<--| Domain hub |<--|icmpOutEchoRep|<-+ | count + 1 | X | count + 1 | X | count + 1 | +--------------+ X +--------------+ X +--------------+ X X SC Solaris X IO Board X Domain Solaris To trace the packet through the hardware you'll need: o 1 window on the Domain o 2 windows on the MAIN System Controller o The showhubstats utility 7. In the Domain window, shut down the dman0 interface and plumb up the specific eri (likely the active MAN interface) with an unused IP address. For example: domain# ifconfig dman0 down domain# ifconfig eriX plumb domain# ifconfig eriX inet 200.200.200.1 Then, begin monitoring the ICMP statistics of the system. domain# while 1 ? netstat -s -P icmp | grep Echo (or icmp6 if IPv6) ? sleep 3 ? end 8. In a window on the SC, plumb up the eri corresponding to that of the domain with an unused IP address and start monitoring hub statistics. sc# ifconfig eriX plumb sc# ifconfig eriX inet 200.200.200.2 up sc% showhubstats -d X -b IOx 3 9. In a second window on the SC, issue a ping to the IP address plumbed for the domain. sc% ping 200.200.200.1 From the counters, determine if there is a breakdown in the communication between the SC and domain. Additionally, the kstats for the given interfaces can be examined for excessive errors. - Resolution: Software causes: ---------------- o Any missing/downrev eri driver patches should be addressed to eliminate known instability in the I1 MAN Network. o If the network information delivered to the domain is incorrect (i.e. ndd man_get_hostinfo), re-run 'smsconfig -m' on the SC to correct the configuration. o If the dman0 interface is plumbed with an inappropriate interface, correct the /etc/hostname.dman0 file on the domain. Hardware causes: ---------------- While the theoretical dictates that any component between the two network interfaces may be at fault, experience has shown that the faulty component is either the System Controller or the IO Board. If a particular RIO or hub shows a history of POST failures, suspect a hardware issue with that IO Board. Also, correlate the failure to the MAN path that is active at the time of the network instability. In the cases where breakdowns in packet flow are evident, the items listed below give the more likely of the two end points (SC/ IO Board) for a given counter failure. It does not exclude the other component from suspicion. o If the SC's hub port does not increment the packet count, check the kstats for the SC's eri interface. If errors are incrementing, the problem is likely the SC. o If the domain's icmpInEchoes fails to increment, the IO Board is the likely problem. o If the domain's icmpOutEchoReps fails to increment, this indicates an issue internal to Solaris. No hardware can be conclusively blamed for such an error. o If the domain's hub port does not increment, the IO Board is the likely problem. o If the SC's icmpInEchoReps does not increment, the likely problem is the System Controller. If all counters do increment, but problems are still experienced on the I1 Network, examine the showhubstats information more closely. o If the domain's 'Ierrs' is high, the IO Board is the likely problem. This indicates that Solaris is receiving errors from the network. The hub should discard errored ethernet frames prior to transmitting them to the domain. This implies some problem or interference between the hub and the RIO. Both are on the IO board. o If the domain's hub port 'parts' is high, the IO Board is the likely problem. The hub partitions a port when it receives too many errors. This implies some problem or interference between the RIO and the hub. Both are on the IO board. o If the SC's 'Ierrs' and 'parts' are high, the most likely fault is with the System Controller. To return the system to its original state, tear down the individual eri links and plumb up dman0 on the domain. sc# ifconfig eriX down; ifconfig eriX unplumb domain# ifconfig eriX down; ifconfig eriX unplumb domain# ifconfig dman0 up - Summary of part number and patch ID's 110723-xx SunOS 5.8: /kernel/drv/sparcv9/eri patch 109882-xx SunOS 5.8: eri header files patch - References and bug IDs http://cpre-amer.west.sun.com/esg/hsg/starcat/tools/showhubstats.html http://cpre-amer.west.sun.com/esg/hsg/starcat/binaries/man.pdf SunSolve Article48120 SunSolve Article48136 - Additional background information: The hub on the IO Board is wired as follows: Port +---+ | 0 |-------> to SC0 +---+ | 1 |-------> to SC1 +---+ | 2 |-------> to IO Board RIO +---+ | 3 |--| unused +---+ | 4 |--| unused +---+ Only one SC port is active at any time. This is always the port routing to the MAIN SC. Hubs are only present on hsPCI and wPCI boards. MaxCPU boards do not have hubs. - Meta-Data/Problem categorization: Product/Platform: SF12K/SF15K Category: - Keywords 15K, 12K, SF15K, SF12K, starcat, MAN, I1
INTERNAL SUMMARY:
SUBMITTER: Scott Davenport APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS: