[rancid] Rancid/Expect failing on FreeBSD/SMP systems
Lars Erik Gullerud
lerik at nolink.net
Thu Jan 10 13:26:40 UTC 2008
My apologies for posting this both to the Rancid list and FreeBSD-STABLE,
however I am not sure where to start troubleshooting this issue - I am
suspecting it is a FreeBSD issue, but I am thinking we are probably not
the only shop running RANCID (ports/net-mgmt/rancid) on FreeBSD (since it
is quite popular in ISP environments), so hopefully someone can look at
it from the RANCID angle and give some helpful input on how to
troubleshoot this further.
The problem: After finally giving in and starting to phase out some of our
oldest FreeBSD 4.11 servers and replace them with FreeBSD 6.x on some
fresh hardware, I got around to start moving our RANCID server. This
however, has been the start of a real nightmare. I don't think the
problems I am seeing are in RANCID itself, however it can be reliable
reproduced every time i run RANCID and I have not been able to reproduce
it in any other way with pure expect test-cases directly.
What happens:
Expect processes "hang" during RANCID runs, and go into infinite loops
eating 100% CPU (on one CPU core). The problem is reliably reproduced
everytime we do a full rancid-run, but the actual device it chokes on
varies between runs so it is not device-related. It does seem to happen
most often when collecting Juniper M-series gear with large configurations
though, using jrancid and ssh.
We can NOT seem to reproduce it by running jrancid (or any other) on a
single device at at time - which is somewhat confusing at is DOES happen
when setting PAR_COUNT to 1 and doing a rancid-run (which
should IMHO be pretty much the same as doing sequential single device
runs...)
Our environment:
We run RANCID extensively to collect a few hundred devices, including
Cisco, Cisco-CatOS, Juniper, Extreme, Extreme-XOS, Riverstone,
FortiNet/FortiGate, etc. We want to start storing CPE configs in addition
to our own core gear in RANCID now, which means we will be putting several
thousand routers into RANCID, which also explains the need for fresher
hardware...
RANCID version does not seem to matter, I have tested with both some
ancient 2.3.0 scripts and 2.3.2a7, same behaviour.
Using the same RANCID instance (I have tarballed it up and installed it on
a bunch of servers, i.e. using the same CVS and the same router.db files
etc.), it fails on:
FreeBSD 7.0-BETA4, amd64, SMP kernel, 8 x CPU cores (2 x quad Xeon 5335)
FreeBSD 6.2-STABLE, i386, SMP kernel, 2 x CPU cores (2 x single-core Xeon)
Both have perl-5.8.8_1, expect 5.43.0_3 and tcl-8.4.16,1 built from ports.
It however seems to work fine on:
Linux CentOS 4.5 x86-64, 4 x CPU cores (2 x dual Xeon 5130)
FreeBSD 4.11 i386, UP kernel, 1 x CPU core (1 x single-core Xeon)
FreeBSD 7.0-RC1, i386, UP kernel, 1 x CPU core (1 x P4)
(Linux box has Expect 5.42 and Tcl 8.3...)
So it only seems to be on newer FreeBSD with SMP. (If anyone have RANCID
working okay on FreeBSD 6.x/7.x on SMP systems at all, please let me
know...)
Now, for some details, if anyone has any ideas. What is actually
happening is this, when truss'ing the stuck Expect-process:
fcntl(4,F_GETFL,) = 0 (0x0)
fcntl(4,F_SETFL,0x0) ERR#25 'Inappropriate ioctl for device'
fcntl(4,F_GETFL,) = 0 (0x0)
fcntl(4,F_SETFL,0x0) ERR#25 'Inappropriate ioctl for device'
<looping ad nauseum>
So, which device is it trying to fcntl, and what is it trying to do? lsof
shows the following:
expect 1417 rancid cwd VDIR 0,86 2048 7607662 /local/rancid/var/core/configs
expect 1417 rancid rtd VDIR 0,81 512 2 /
expect 1417 rancid 0r VCHR 0,24 0t0 24 /dev/null
expect 1417 rancid 2r VCHR 0,24 0t0 24 /dev/null
expect 1417 rancid 3r VCHR 0,24 0t0 24 /dev/null
expect 1417 rancid 4r VCHR 0,24 0t0 24 /dev/null
file descriptor 4 is /dev/null. Why is it trying to F_SETFL /dev/null to
BLOCKING mode (which is failing)? Why should it be playing with /dev/null
at all? Well, digging a little, this is what the lsof output looked like
10 seconds earlier:
expect 1417 rancid cwd VDIR 0,86 2048 7607662 /local/rancid/var/core/configs
expect 1417 rancid rtd VDIR 0,81 512 2 /
expect 1417 rancid 0r VCHR 0,24 0t0 24 /dev/null
expect 1417 rancid 1u PIPE 0x38bfcf8 0 ->0xffffff00038bfba0
expect 1417 rancid 2w VREG 0,86 76 7583772 /local (/dev/mfid0s1f)
expect 1417 rancid 3u VCHR 0,108 0t0 108 /dev/ttyp2
expect 1417 rancid 4u VCHR 0,117 0t45 117 /dev/ptyp7
ssh 1418 rancid cwd VDIR 0,86 2048 7607662 /local/rancid/var/core/configs
ssh 1418 rancid rtd VDIR 0,81 512 2 /
ssh 1418 rancid txt unknown file system type: 8\xb9^_^B\xff\xff\xff^Xb\xab)^B\xff\xff\xffE
ssh 1418 rancid 0u VCHR 0,118 0t0 118 /dev/ttyp7
ssh 1418 rancid 1u VCHR 0,118 0t0 118 /dev/ttyp7
ssh 1418 rancid 2u VCHR 0,118 0t0 118 /dev/ttyp7
ssh 1418 rancid 3w VREG 0,86 76 7583772 /local (/dev/mfid0s1f)
ssh 1418 rancid 4u IPv4 0xffffff008c030240 0t0 TCP *:27776->*:49323
ssh 1418 rancid 5u VCHR 0,118 0t45 118 /dev/ttyp7
Here, fd 4 is actually a pty (pty7), which seems to be a fork to PID 1418,
the ssh session to the router, and everything seems to be normal.
PID 1418 is no longer there on the most recent lsof, so 1418 seems to
have died(?) and PID 1417 now has /dev/null on its file descriptor 4. I
don't know why that is, but why is it trying to fcntl it to Blocking I/O
mode? Here is a gdb attach to the PID and a backtrace:
(gdb) bt
#0 0x0000000800aefc9c in fcntl () from /lib/libc.so.7
#1 0x00000000004072c5 in ?? ()
#2 0x00000008006a8c18 in StackSetBlockMode ()
from /usr/local/lib/libtcl84.so.1
#3 0x00000008006a8c54 in SetBlockMode () from
/usr/local/lib/libtcl84.so.1
#4 0x00000008006acf75 in Tcl_SetChannelOption ()
from /usr/local/lib/libtcl84.so.1
#5 0x00000008006aeda0 in TclFinalizeIOSubsystem ()
from /usr/local/lib/libtcl84.so.1
#6 0x0000000800697f74 in Tcl_FinalizeThread ()
from /usr/local/lib/libtcl84.so.1
#7 0x0000000800698081 in Tcl_Finalize () from
/usr/local/lib/libtcl84.so.1
#8 0x000000080069833a in Tcl_Exit () from /usr/local/lib/libtcl84.so.1
#9 0x0000000000409610 in ?? ()
#10 0x00000008006742be in TclInvokeStringCommand ()
from /usr/local/lib/libtcl84.so.1
#11 0x0000000800675944 in TclEvalObjvInternal ()
from /usr/local/lib/libtcl84.so.1
#12 0x0000000800675dff in Tcl_EvalEx () from /usr/local/lib/libtcl84.so.1
#13 0x00000008006b55d9 in Tcl_FSEvalFile () from
/usr/local/lib/libtcl84.so.1
#14 0x00000008006b5690 in Tcl_EvalFile () from
/usr/local/lib/libtcl84.so.1
#15 0x0000000000404f58 in ?? ()
#16 0x0000000000404d47 in ?? ()
>From the functions it is running in Tcl, it seems it is Tcl's cleanup
code that is failing, when it is trying to restore a Tcl "channel" to
normal mode during an exit event.
This is where my clue runs out, and I am at a loss as to how to proceed
from here. I have tried digging in both Tcl and Expect source code to see
if can catch anything obvious, but alas, this is somewhat outside my area
of expertise (I am a networking guy, not a programmer)...
Any suggestions on how to proceed to find and fix this issue would be
welcome, as the only other option for us is to abandon FreeBSD and go with
Linux on the server, and we have already replaced too many FreeBSD boxes
with Linux for my liking, I don't want to see yet another one go...
Regards,
Lars Erik Gullerud
More information about the Rancid-discuss
mailing list