[rancid] Re: Rancid/Expect failing on FreeBSD/SMP systems

Thu Jan 10 15:04:02 UTC 2008

Hi Lars,

You should use expect-devel port to avoid hunging on pty

have a look at
http://www.freebsd.org/cgi/query-pr.cgi?pr=118452

Janos Mohacsi
Network Engineer, Research Associate, Head of Network Planning and Projects
NIIF/HUNGARNET, HUNGARY
Key 70EF9882: DEC2 C685 1ED4 C95A 145F  4300 6F64 7B00 70EF 9882

On Thu, 10 Jan 2008, Lars Erik Gullerud wrote:

> My apologies for posting this both to the Rancid list and FreeBSD-STABLE,
> however I am not sure where to start troubleshooting this issue - I am
> suspecting it is a FreeBSD issue, but I am thinking we are probably not
> the only shop running RANCID (ports/net-mgmt/rancid) on FreeBSD (since it
> is quite popular in ISP environments), so hopefully someone can look at
> it from the RANCID angle and give some helpful input on how to
> troubleshoot this further.
>
> The problem: After finally giving in and starting to phase out some of our
> oldest FreeBSD 4.11 servers and replace them with FreeBSD 6.x on some
> fresh hardware, I got around to start moving our RANCID server. This
> however, has been the start of a real nightmare. I don't think the
> problems I am seeing are in RANCID itself, however it can be reliable
> reproduced every time i run RANCID and I have not been able to reproduce
> it in any other way with pure expect test-cases directly.
>
> What happens:
>
> Expect processes "hang" during RANCID runs, and go into infinite loops
> eating 100% CPU (on one CPU core). The problem is reliably reproduced
> everytime we do a full rancid-run, but the actual device it chokes on
> varies between runs so it is not device-related. It does seem to happen
> most often when collecting Juniper M-series gear with large configurations
> though, using jrancid and ssh.
>
> We can NOT seem to reproduce it by running jrancid (or any other) on a
> single device at at time - which is somewhat confusing at is DOES happen
> when setting PAR_COUNT to 1 and doing a rancid-run (which
> should IMHO be pretty much the same as doing sequential single device
> runs...)
>
> Our environment:
>
> We run RANCID extensively to collect a few hundred devices, including
> Cisco, Cisco-CatOS, Juniper, Extreme, Extreme-XOS, Riverstone,
> FortiNet/FortiGate, etc. We want to start storing CPE configs in addition
> to our own core gear in RANCID now, which means we will be putting several
> thousand routers into RANCID, which also explains the need for fresher
> hardware...
>
> RANCID version does not seem to matter, I have tested with both some
> ancient 2.3.0 scripts and 2.3.2a7, same behaviour.
>
> Using the same RANCID instance (I have tarballed it up and installed it on
> a bunch of servers, i.e. using the same CVS and the same router.db files
> etc.), it fails on:
>
> FreeBSD 7.0-BETA4, amd64, SMP kernel, 8 x CPU cores (2 x quad Xeon 5335)
> FreeBSD 6.2-STABLE, i386, SMP kernel, 2 x CPU cores (2 x single-core Xeon)
>
> Both have perl-5.8.8_1, expect 5.43.0_3 and tcl-8.4.16,1 built from ports.
>
> It however seems to work fine on:
>
> Linux CentOS 4.5 x86-64, 4 x CPU cores (2 x dual Xeon 5130)
> FreeBSD 4.11 i386, UP kernel, 1 x CPU core (1 x single-core Xeon)
> FreeBSD 7.0-RC1, i386, UP kernel, 1 x CPU core (1 x P4)
>
> (Linux box has Expect 5.42 and Tcl 8.3...)
>
> So it only seems to be on newer FreeBSD with SMP. (If anyone have RANCID
> working okay on FreeBSD 6.x/7.x on SMP systems at all, please let me
> know...)
>
> Now, for some details, if anyone has any ideas. What is actually
> happening is this, when truss'ing the stuck Expect-process:
>
> fcntl(4,F_GETFL,)                   = 0 (0x0)
> fcntl(4,F_SETFL,0x0)                ERR#25 'Inappropriate ioctl for device'
> fcntl(4,F_GETFL,)                   = 0 (0x0)
> fcntl(4,F_SETFL,0x0)                ERR#25 'Inappropriate ioctl for device'
> <looping ad nauseum>
>
> So, which device is it trying to fcntl, and what is it trying to do? lsof
> shows the following:
>
> expect     1417 rancid  cwd   VDIR      0,86     2048 7607662 /local/rancid/var/core/configs
> expect     1417 rancid  rtd   VDIR      0,81      512       2 /
> expect     1417 rancid    0r  VCHR      0,24      0t0      24 /dev/null
> expect     1417 rancid    2r  VCHR      0,24      0t0      24 /dev/null
> expect     1417 rancid    3r  VCHR      0,24      0t0      24 /dev/null
> expect     1417 rancid    4r  VCHR      0,24      0t0      24 /dev/null
>
> file descriptor 4 is /dev/null. Why is it trying to F_SETFL /dev/null to
> BLOCKING mode (which is failing)? Why should it be playing with /dev/null
> at all? Well, digging a little, this is what the lsof output looked like
> 10 seconds earlier:
>
> expect     1417 rancid  cwd   VDIR               0,86     2048 7607662 /local/rancid/var/core/configs
> expect     1417 rancid  rtd   VDIR               0,81      512       2 /
> expect     1417 rancid    0r  VCHR               0,24      0t0      24 /dev/null
> expect     1417 rancid    1u  PIPE          0x38bfcf8        0         ->0xffffff00038bfba0
> expect     1417 rancid    2w  VREG               0,86       76 7583772 /local (/dev/mfid0s1f)
> expect     1417 rancid    3u  VCHR              0,108      0t0     108 /dev/ttyp2
> expect     1417 rancid    4u  VCHR              0,117     0t45     117 /dev/ptyp7
> ssh        1418 rancid  cwd   VDIR               0,86     2048 7607662 /local/rancid/var/core/configs
> ssh        1418 rancid  rtd   VDIR               0,81      512       2 /
> ssh        1418 rancid  txt                                            unknown file system type:  8\xb9^_^B\xff\xff\xff^Xb\xab)^B\xff\xff\xffE
> ssh        1418 rancid    0u  VCHR              0,118      0t0     118 /dev/ttyp7
> ssh        1418 rancid    1u  VCHR              0,118      0t0     118 /dev/ttyp7
> ssh        1418 rancid    2u  VCHR              0,118      0t0     118 /dev/ttyp7
> ssh        1418 rancid    3w  VREG               0,86       76 7583772 /local (/dev/mfid0s1f)
> ssh        1418 rancid    4u  IPv4 0xffffff008c030240      0t0     TCP *:27776->*:49323
> ssh        1418 rancid    5u  VCHR              0,118     0t45     118 /dev/ttyp7
>
> Here, fd 4 is actually a pty (pty7), which seems to be a fork to PID 1418,
> the ssh session to the router, and everything seems to be normal.
>
> PID 1418 is no longer there on the most recent lsof, so 1418 seems to
> have died(?) and PID 1417 now has /dev/null on its file descriptor 4. I
> don't know why that is, but why is it trying to fcntl it to Blocking I/O
> mode? Here is a gdb attach to the PID and a backtrace:
>
> (gdb) bt
> #0  0x0000000800aefc9c in fcntl () from /lib/libc.so.7
> #1  0x00000000004072c5 in ?? ()
> #2  0x00000008006a8c18 in StackSetBlockMode ()
>    from /usr/local/lib/libtcl84.so.1
> #3  0x00000008006a8c54 in SetBlockMode () from
> /usr/local/lib/libtcl84.so.1
> #4  0x00000008006acf75 in Tcl_SetChannelOption ()
>    from /usr/local/lib/libtcl84.so.1
> #5  0x00000008006aeda0 in TclFinalizeIOSubsystem ()
>    from /usr/local/lib/libtcl84.so.1
> #6  0x0000000800697f74 in Tcl_FinalizeThread ()
>    from /usr/local/lib/libtcl84.so.1
> #7  0x0000000800698081 in Tcl_Finalize () from
> /usr/local/lib/libtcl84.so.1
> #8  0x000000080069833a in Tcl_Exit () from /usr/local/lib/libtcl84.so.1
> #9  0x0000000000409610 in ?? ()
> #10 0x00000008006742be in TclInvokeStringCommand ()
>    from /usr/local/lib/libtcl84.so.1
> #11 0x0000000800675944 in TclEvalObjvInternal ()
>    from /usr/local/lib/libtcl84.so.1
> #12 0x0000000800675dff in Tcl_EvalEx () from /usr/local/lib/libtcl84.so.1
> #13 0x00000008006b55d9 in Tcl_FSEvalFile () from
> /usr/local/lib/libtcl84.so.1
> #14 0x00000008006b5690 in Tcl_EvalFile () from
> /usr/local/lib/libtcl84.so.1
> #15 0x0000000000404f58 in ?? ()
> #16 0x0000000000404d47 in ?? ()
>
>> From the functions it is running in Tcl, it seems it is Tcl's cleanup
> code that is failing, when it is trying to restore a Tcl "channel" to
> normal mode during an exit event.
>
> This is where my clue runs out, and I am at a loss as to how to proceed
> from here. I have tried digging in both Tcl and Expect source code to see
> if can catch anything obvious, but alas, this is somewhat outside my area
> of expertise (I am a networking guy, not a programmer)...
>
> Any suggestions on how to proceed to find and fix this issue would be
> welcome, as the only other option for us is to abandon FreeBSD and go with
> Linux on the server, and we have already replaced too many FreeBSD boxes
> with Linux for my liking, I don't want to see yet another one go...
>
> Regards,
> Lars Erik Gullerud
> _______________________________________________
> Rancid-discuss mailing list
> Rancid-discuss at shrubbery.net
> http://www.shrubbery.net/mailman/listinfo.cgi/rancid-discuss
>