Document intsrdb/48123

SRDB ID		Synopsis		Date
48123		Sun Fire[TM] 12K/15K: Troubleshooting the I1 MAN Network		29 Oct 2002

Status

Issued

Description

- Problem Statement:

Troubleshooting The I1 MAN Network

- Symptoms:

Link(s) on the MAN Network are either down or unstable. The following appears in the /var/adm/messages on the domain and/or the SCs:
SUNW,eri0 : No response from Ethernet network : Link down -- cable problem?
Similarly, in a 'console' session, a user may see:
using IOSRAM based Console

SOLUTION SUMMARY:

- Troubleshooting:

    1. Check that the latest eri are applied to both the domain and SCs.
    2. Confirm that SSCPOST passed on the SCs.

       % prtconf -vp |grep ssc-post
       ssc-post-results: 'CP1500 POST Passed; SSC POST v1.15 Passed'

    3. Confirm that the IP configuration of the domain's dman0 interface
       matches what the SC expects. The simplest method is the following:

       domain# ifconfig dman0
       dman0: flags=1008843<UP,BROADCAST,RUNNING,MULTICAST,PRIVATE,IPv4> mtu 1500 index 2
         inet 10.1.1.3 netmask ffffffe0 broadcast 10.1.1.31
         ether 8:0:20:b8:74:d0 

       domain# ndd /dev/dman man_get_hostinfo
       manc_magic = 0x4d414e43
       manc_version = 01
       manc_csum = 0x0
       manc_ip_type = AF_INET
       manc_dom_ipaddr = 10.1.1.3
       manc_dom_ip_netmask = 255.255.255.224
       manc_dom_ip_netnum = 10.1.1.0
       manc_sc_ipaddr = 10.1.1.1
       manc_dom_eaddr = 8:0:20:b8:74:d0
       manc_sc_eaddr = 8:0:20:da:26:9c
       manc_iob_bitmap = 0x4000        io boards = 14.1, 
       manc_golden_iob = 14

    4. Confirm that the scman driver has been programmed with appropriate 
       path information. For example:

       # ndd /dev/scman man_pathgroups_report
       MAN Pathgroup report: (* == failed)
       ===============================================================================
       Interface       Destination             Active Path     Alternate Paths
       -------------------------------------------------------------------------------
       scman0          B 8:0:20:b8:74:d0       eri16           eri16 exp 14

       Interface       Destination             Active Path     Alternate Paths
       -------------------------------------------------------------------------------
       scman1          Other SSC               hme1            eri0 exp 0, hme1 exp 0

       The Alternate Paths for the domain in question should list the
       same 'exp ##' as seen in the manc_iob_bitmap above.

    5. Check prior POST logs for any failures of the RIO or hub on
       an IO Board. Example error messages are:

         FAIL Man Ether IOx: Error reading Man Ether Hub EXxx component ID

         FAIL RIO x.1.0: EpiRIOR1_sc_tfunc(): Test FAILED
          TSTATE_FAILED:  Test ID 191: RIO Basic Tests
          TSTATE_FAILED:  Test ID 192: EpiRIOR1

    6. Test connectivity with ping, but while doing so, trace the flow
       of packets through the hardware. Using a combination of Solaris[TM]
       statistics and hub statistics, a ping is traceable by watching
       the increments of various counters. The flow is:

                           X                  X
                           X                  X
          +--------------+ X +--------------+ X +--------------+
          | Ping from SC |-->| SC hub port  |-->| icmpInEchoes |--+
          |  to Domain   | X |  count + 1   | X |  count + 1   |  |
          +--------------+ X +--------------+ X +--------------+  |
                           X                  X                   |
          +--------------+ X +--------------+ X +--------------+  |
          |icmpInEchoReps|<--|  Domain hub  |<--|icmpOutEchoRep|<-+
          |   count + 1  | X |  count + 1   | X |  count + 1   |
          +--------------+ X +--------------+ X +--------------+
                           X                  X
             SC Solaris    X     IO Board     X  Domain Solaris

       To trace the packet through the hardware you'll need:
          o 1 window on the Domain
          o 2 windows on the MAIN System Controller
          o The showhubstats utility

    7. In the Domain window, shut down the dman0 interface and plumb
       up the specific eri (likely the active MAN interface) with an
       unused IP address. For example:

          domain# ifconfig dman0 down
          domain# ifconfig eriX plumb
          domain# ifconfig eriX inet 200.200.200.1

       Then, begin monitoring the ICMP statistics of the system.

          domain# while 1
          ? netstat -s -P icmp | grep Echo     (or icmp6 if IPv6)
          ? sleep 3
          ? end

    8. In a window on the SC, plumb up the eri corresponding to that of
       the domain with an unused IP address and start monitoring hub 
       statistics.

          sc# ifconfig eriX plumb
          sc# ifconfig eriX inet 200.200.200.2 up
          sc% showhubstats -d X -b IOx 3

    9. In a second window on the SC, issue a ping to the IP address
       plumbed for the domain.

          sc% ping 200.200.200.1

    From the counters, determine if there is a breakdown in the
    communication between the SC and domain. Additionally, the
    kstats for the given interfaces can be examined for excessive
    errors.

- Resolution:

    Software causes:
    ----------------
    o Any missing/downrev eri driver patches should be addressed to
      eliminate known instability in the I1 MAN Network.
    o If the network information delivered to the domain is incorrect
      (i.e. ndd man_get_hostinfo), re-run 'smsconfig -m' on the SC
      to correct the configuration.
    o If the dman0 interface is plumbed with an inappropriate interface,
      correct the /etc/hostname.dman0 file on the domain.

    Hardware causes:
    ----------------
    While the theoretical dictates that any component between the 
    two network interfaces may be at fault, experience has shown that
    the faulty component is either the System Controller or the IO
    Board. 

    If a particular RIO or hub shows a history of POST failures, suspect
    a hardware issue with that IO Board. Also, correlate the failure to
    the MAN path that is active at the time of the network instability.
    

    In the cases where breakdowns in packet flow are evident, the 
    items listed below give the more likely of the two end points (SC/
    IO Board) for a given counter failure. It does not exclude the 
    other component from suspicion.

       o If the SC's hub port does not increment the packet count, check
         the kstats for the SC's eri interface. If errors are incrementing,
         the problem is likely the SC. 
       o If the domain's icmpInEchoes fails to increment, the IO Board is
         the likely problem.
       o If the domain's icmpOutEchoReps fails to increment, this indicates
         an issue internal to Solaris. No hardware can be conclusively 
         blamed for such an error.
       o If the domain's hub port does not increment, the IO Board is the
         likely problem.
       o If the SC's icmpInEchoReps does not increment, the likely problem
         is the System Controller.

    If all counters do increment, but problems are still experienced on 
    the I1 Network, examine the showhubstats information more closely.
    
       o If the domain's 'Ierrs' is high, the IO Board is the likely 
         problem.  This indicates that Solaris is receiving errors from 
         the network. The hub should discard errored ethernet frames 
         prior to transmitting them to the domain. This implies some 
         problem or interference between the hub and the RIO. Both are 
         on the IO board.
       o If the domain's hub port 'parts' is high, the IO Board is the
         likely problem. The hub partitions a port when it receives too 
         many errors. This implies some problem or interference between 
         the RIO and the hub. Both are on the IO board.
       o If the SC's 'Ierrs' and 'parts' are high, the most likely 
         fault is with the System Controller.

    To return the system to its original state, tear down the individual
    eri links and plumb up dman0 on the domain.

       sc# ifconfig eriX down; ifconfig eriX unplumb
       domain# ifconfig eriX down; ifconfig eriX unplumb
       domain# ifconfig dman0 up

- Summary of part number and patch ID's 

    110723-xx SunOS 5.8: /kernel/drv/sparcv9/eri patch
    109882-xx SunOS 5.8: eri header files patch

- References and bug IDs

    http://cpre-amer.west.sun.com/esg/hsg/starcat/tools/showhubstats.html
    http://cpre-amer.west.sun.com/esg/hsg/starcat/binaries/man.pdf
    SunSolve Article 48120
    SunSolve Article 48136

- Additional background information:

    The hub on the IO Board is wired as follows:

        Port
       +---+
       | 0 |-------> to SC0
       +---+
       | 1 |-------> to SC1
       +---+
       | 2 |-------> to IO Board RIO
       +---+
       | 3 |--| unused
       +---+
       | 4 |--| unused
       +---+

    Only one SC port is active at any time. This is always the
    port routing to the MAIN SC. Hubs are only present on hsPCI and
    wPCI boards. MaxCPU boards do not have hubs.

- Meta-Data/Problem categorization:
Product/Platform: SF12K/SF15K
Category:

- Keywords

15K, 12K, SF15K, SF12K, starcat, MAN, I1

INTERNAL SUMMARY:

SUBMITTER: Scott Davenport APPLIES TO: Hardware/Sun Fire /15000, Hardware/Sun Fire /12000 ATTACHMENTS: