[rancid] Rancid/Expect failing on FreeBSD/SMP systems

Thu Jan 10 13:26:40 UTC 2008

My apologies for posting this both to the Rancid list and FreeBSD-STABLE, 
however I am not sure where to start troubleshooting this issue - I am 
suspecting it is a FreeBSD issue, but I am thinking we are probably not 
the only shop running RANCID (ports/net-mgmt/rancid) on FreeBSD (since it 
is quite popular in ISP environments), so hopefully someone can look at 
it from the RANCID angle and give some helpful input on how to 
troubleshoot this further.

The problem: After finally giving in and starting to phase out some of our 
oldest FreeBSD 4.11 servers and replace them with FreeBSD 6.x on some 
fresh hardware, I got around to start moving our RANCID server. This 
however, has been the start of a real nightmare. I don't think the 
problems I am seeing are in RANCID itself, however it can be reliable 
reproduced every time i run RANCID and I have not been able to reproduce 
it in any other way with pure expect test-cases directly.

What happens:

Expect processes "hang" during RANCID runs, and go into infinite loops 
eating 100% CPU (on one CPU core). The problem is reliably reproduced 
everytime we do a full rancid-run, but the actual device it chokes on 
varies between runs so it is not device-related. It does seem to happen 
most often when collecting Juniper M-series gear with large configurations 
though, using jrancid and ssh.

We can NOT seem to reproduce it by running jrancid (or any other) on a 
single device at at time - which is somewhat confusing at is DOES happen 
when setting PAR_COUNT to 1 and doing a rancid-run (which 
should IMHO be pretty much the same as doing sequential single device 
runs...)

Our environment:

We run RANCID extensively to collect a few hundred devices, including 
Cisco, Cisco-CatOS, Juniper, Extreme, Extreme-XOS, Riverstone, 
FortiNet/FortiGate, etc. We want to start storing CPE configs in addition 
to our own core gear in RANCID now, which means we will be putting several 
thousand routers into RANCID, which also explains the need for fresher 
hardware...

RANCID version does not seem to matter, I have tested with both some 
ancient 2.3.0 scripts and 2.3.2a7, same behaviour.

Using the same RANCID instance (I have tarballed it up and installed it on 
a bunch of servers, i.e. using the same CVS and the same router.db files 
etc.), it fails on:

FreeBSD 7.0-BETA4, amd64, SMP kernel, 8 x CPU cores (2 x quad Xeon 5335)
FreeBSD 6.2-STABLE, i386, SMP kernel, 2 x CPU cores (2 x single-core Xeon)

Both have perl-5.8.8_1, expect 5.43.0_3 and tcl-8.4.16,1 built from ports.

It however seems to work fine on:

Linux CentOS 4.5 x86-64, 4 x CPU cores (2 x dual Xeon 5130)
FreeBSD 4.11 i386, UP kernel, 1 x CPU core (1 x single-core Xeon)
FreeBSD 7.0-RC1, i386, UP kernel, 1 x CPU core (1 x P4)

(Linux box has Expect 5.42 and Tcl 8.3...)

So it only seems to be on newer FreeBSD with SMP. (If anyone have RANCID 
working okay on FreeBSD 6.x/7.x on SMP systems at all, please let me 
know...)

Now, for some details, if anyone has any ideas. What is actually 
happening is this, when truss'ing the stuck Expect-process:

fcntl(4,F_GETFL,)                   = 0 (0x0)
fcntl(4,F_SETFL,0x0)                ERR#25 'Inappropriate ioctl for device'
fcntl(4,F_GETFL,)                   = 0 (0x0)
fcntl(4,F_SETFL,0x0)                ERR#25 'Inappropriate ioctl for device'
<looping ad nauseum>

So, which device is it trying to fcntl, and what is it trying to do? lsof 
shows the following:

expect     1417 rancid  cwd   VDIR      0,86     2048 7607662 /local/rancid/var/core/configs
expect     1417 rancid  rtd   VDIR      0,81      512       2 /
expect     1417 rancid    0r  VCHR      0,24      0t0      24 /dev/null
expect     1417 rancid    2r  VCHR      0,24      0t0      24 /dev/null
expect     1417 rancid    3r  VCHR      0,24      0t0      24 /dev/null
expect     1417 rancid    4r  VCHR      0,24      0t0      24 /dev/null

file descriptor 4 is /dev/null. Why is it trying to F_SETFL /dev/null to 
BLOCKING mode (which is failing)? Why should it be playing with /dev/null 
at all? Well, digging a little, this is what the lsof output looked like 
10 seconds earlier:

expect     1417 rancid  cwd   VDIR               0,86     2048 7607662 /local/rancid/var/core/configs
expect     1417 rancid  rtd   VDIR               0,81      512       2 /
expect     1417 rancid    0r  VCHR               0,24      0t0      24 /dev/null
expect     1417 rancid    1u  PIPE          0x38bfcf8        0         ->0xffffff00038bfba0
expect     1417 rancid    2w  VREG               0,86       76 7583772 /local (/dev/mfid0s1f)
expect     1417 rancid    3u  VCHR              0,108      0t0     108 /dev/ttyp2
expect     1417 rancid    4u  VCHR              0,117     0t45     117 /dev/ptyp7
ssh        1418 rancid  cwd   VDIR               0,86     2048 7607662 /local/rancid/var/core/configs
ssh        1418 rancid  rtd   VDIR               0,81      512       2 /
ssh        1418 rancid  txt                                            unknown file system type:  8\xb9^_^B\xff\xff\xff^Xb\xab)^B\xff\xff\xffE
ssh        1418 rancid    0u  VCHR              0,118      0t0     118 /dev/ttyp7
ssh        1418 rancid    1u  VCHR              0,118      0t0     118 /dev/ttyp7
ssh        1418 rancid    2u  VCHR              0,118      0t0     118 /dev/ttyp7
ssh        1418 rancid    3w  VREG               0,86       76 7583772 /local (/dev/mfid0s1f)
ssh        1418 rancid    4u  IPv4 0xffffff008c030240      0t0     TCP *:27776->*:49323
ssh        1418 rancid    5u  VCHR              0,118     0t45     118 /dev/ttyp7

Here, fd 4 is actually a pty (pty7), which seems to be a fork to PID 1418, 
the ssh session to the router, and everything seems to be normal.

PID 1418 is no longer there on the most recent lsof, so 1418 seems to 
have died(?) and PID 1417 now has /dev/null on its file descriptor 4. I 
don't know why that is, but why is it trying to fcntl it to Blocking I/O 
mode? Here is a gdb attach to the PID and a backtrace:

(gdb) bt
#0  0x0000000800aefc9c in fcntl () from /lib/libc.so.7
#1  0x00000000004072c5 in ?? ()
#2  0x00000008006a8c18 in StackSetBlockMode ()
    from /usr/local/lib/libtcl84.so.1
#3  0x00000008006a8c54 in SetBlockMode () from 
/usr/local/lib/libtcl84.so.1
#4  0x00000008006acf75 in Tcl_SetChannelOption ()
    from /usr/local/lib/libtcl84.so.1
#5  0x00000008006aeda0 in TclFinalizeIOSubsystem ()
    from /usr/local/lib/libtcl84.so.1
#6  0x0000000800697f74 in Tcl_FinalizeThread ()
    from /usr/local/lib/libtcl84.so.1
#7  0x0000000800698081 in Tcl_Finalize () from 
/usr/local/lib/libtcl84.so.1
#8  0x000000080069833a in Tcl_Exit () from /usr/local/lib/libtcl84.so.1
#9  0x0000000000409610 in ?? ()
#10 0x00000008006742be in TclInvokeStringCommand ()
    from /usr/local/lib/libtcl84.so.1
#11 0x0000000800675944 in TclEvalObjvInternal ()
    from /usr/local/lib/libtcl84.so.1
#12 0x0000000800675dff in Tcl_EvalEx () from /usr/local/lib/libtcl84.so.1
#13 0x00000008006b55d9 in Tcl_FSEvalFile () from 
/usr/local/lib/libtcl84.so.1
#14 0x00000008006b5690 in Tcl_EvalFile () from 
/usr/local/lib/libtcl84.so.1
#15 0x0000000000404f58 in ?? ()
#16 0x0000000000404d47 in ?? ()

>From the functions it is running in Tcl, it seems it is Tcl's cleanup 
code that is failing, when it is trying to restore a Tcl "channel" to 
normal mode during an exit event.

This is where my clue runs out, and I am at a loss as to how to proceed 
from here. I have tried digging in both Tcl and Expect source code to see 
if can catch anything obvious, but alas, this is somewhat outside my area 
of expertise (I am a networking guy, not a programmer)...

Any suggestions on how to proceed to find and fix this issue would be 
welcome, as the only other option for us is to abandon FreeBSD and go with 
Linux on the server, and we have already replaced too many FreeBSD boxes 
with Linux for my liking, I don't want to see yet another one go...

Regards,
Lars Erik Gullerud