[rancid] Re: Rancid/Expect failing on FreeBSD/SMP systems
Mohacsi Janos
mohacsi at niif.hu
Thu Jan 10 15:04:02 UTC 2008
Hi Lars,
You should use expect-devel port to avoid hunging on pty
have a look at
http://www.freebsd.org/cgi/query-pr.cgi?pr=118452
Janos Mohacsi
Network Engineer, Research Associate, Head of Network Planning and Projects
NIIF/HUNGARNET, HUNGARY
Key 70EF9882: DEC2 C685 1ED4 C95A 145F 4300 6F64 7B00 70EF 9882
On Thu, 10 Jan 2008, Lars Erik Gullerud wrote:
> My apologies for posting this both to the Rancid list and FreeBSD-STABLE,
> however I am not sure where to start troubleshooting this issue - I am
> suspecting it is a FreeBSD issue, but I am thinking we are probably not
> the only shop running RANCID (ports/net-mgmt/rancid) on FreeBSD (since it
> is quite popular in ISP environments), so hopefully someone can look at
> it from the RANCID angle and give some helpful input on how to
> troubleshoot this further.
>
> The problem: After finally giving in and starting to phase out some of our
> oldest FreeBSD 4.11 servers and replace them with FreeBSD 6.x on some
> fresh hardware, I got around to start moving our RANCID server. This
> however, has been the start of a real nightmare. I don't think the
> problems I am seeing are in RANCID itself, however it can be reliable
> reproduced every time i run RANCID and I have not been able to reproduce
> it in any other way with pure expect test-cases directly.
>
> What happens:
>
> Expect processes "hang" during RANCID runs, and go into infinite loops
> eating 100% CPU (on one CPU core). The problem is reliably reproduced
> everytime we do a full rancid-run, but the actual device it chokes on
> varies between runs so it is not device-related. It does seem to happen
> most often when collecting Juniper M-series gear with large configurations
> though, using jrancid and ssh.
>
> We can NOT seem to reproduce it by running jrancid (or any other) on a
> single device at at time - which is somewhat confusing at is DOES happen
> when setting PAR_COUNT to 1 and doing a rancid-run (which
> should IMHO be pretty much the same as doing sequential single device
> runs...)
>
> Our environment:
>
> We run RANCID extensively to collect a few hundred devices, including
> Cisco, Cisco-CatOS, Juniper, Extreme, Extreme-XOS, Riverstone,
> FortiNet/FortiGate, etc. We want to start storing CPE configs in addition
> to our own core gear in RANCID now, which means we will be putting several
> thousand routers into RANCID, which also explains the need for fresher
> hardware...
>
> RANCID version does not seem to matter, I have tested with both some
> ancient 2.3.0 scripts and 2.3.2a7, same behaviour.
>
> Using the same RANCID instance (I have tarballed it up and installed it on
> a bunch of servers, i.e. using the same CVS and the same router.db files
> etc.), it fails on:
>
> FreeBSD 7.0-BETA4, amd64, SMP kernel, 8 x CPU cores (2 x quad Xeon 5335)
> FreeBSD 6.2-STABLE, i386, SMP kernel, 2 x CPU cores (2 x single-core Xeon)
>
> Both have perl-5.8.8_1, expect 5.43.0_3 and tcl-8.4.16,1 built from ports.
>
> It however seems to work fine on:
>
> Linux CentOS 4.5 x86-64, 4 x CPU cores (2 x dual Xeon 5130)
> FreeBSD 4.11 i386, UP kernel, 1 x CPU core (1 x single-core Xeon)
> FreeBSD 7.0-RC1, i386, UP kernel, 1 x CPU core (1 x P4)
>
> (Linux box has Expect 5.42 and Tcl 8.3...)
>
> So it only seems to be on newer FreeBSD with SMP. (If anyone have RANCID
> working okay on FreeBSD 6.x/7.x on SMP systems at all, please let me
> know...)
>
> Now, for some details, if anyone has any ideas. What is actually
> happening is this, when truss'ing the stuck Expect-process:
>
> fcntl(4,F_GETFL,) = 0 (0x0)
> fcntl(4,F_SETFL,0x0) ERR#25 'Inappropriate ioctl for device'
> fcntl(4,F_GETFL,) = 0 (0x0)
> fcntl(4,F_SETFL,0x0) ERR#25 'Inappropriate ioctl for device'
> <looping ad nauseum>
>
> So, which device is it trying to fcntl, and what is it trying to do? lsof
> shows the following:
>
> expect 1417 rancid cwd VDIR 0,86 2048 7607662 /local/rancid/var/core/configs
> expect 1417 rancid rtd VDIR 0,81 512 2 /
> expect 1417 rancid 0r VCHR 0,24 0t0 24 /dev/null
> expect 1417 rancid 2r VCHR 0,24 0t0 24 /dev/null
> expect 1417 rancid 3r VCHR 0,24 0t0 24 /dev/null
> expect 1417 rancid 4r VCHR 0,24 0t0 24 /dev/null
>
> file descriptor 4 is /dev/null. Why is it trying to F_SETFL /dev/null to
> BLOCKING mode (which is failing)? Why should it be playing with /dev/null
> at all? Well, digging a little, this is what the lsof output looked like
> 10 seconds earlier:
>
> expect 1417 rancid cwd VDIR 0,86 2048 7607662 /local/rancid/var/core/configs
> expect 1417 rancid rtd VDIR 0,81 512 2 /
> expect 1417 rancid 0r VCHR 0,24 0t0 24 /dev/null
> expect 1417 rancid 1u PIPE 0x38bfcf8 0 ->0xffffff00038bfba0
> expect 1417 rancid 2w VREG 0,86 76 7583772 /local (/dev/mfid0s1f)
> expect 1417 rancid 3u VCHR 0,108 0t0 108 /dev/ttyp2
> expect 1417 rancid 4u VCHR 0,117 0t45 117 /dev/ptyp7
> ssh 1418 rancid cwd VDIR 0,86 2048 7607662 /local/rancid/var/core/configs
> ssh 1418 rancid rtd VDIR 0,81 512 2 /
> ssh 1418 rancid txt unknown file system type: 8\xb9^_^B\xff\xff\xff^Xb\xab)^B\xff\xff\xffE
> ssh 1418 rancid 0u VCHR 0,118 0t0 118 /dev/ttyp7
> ssh 1418 rancid 1u VCHR 0,118 0t0 118 /dev/ttyp7
> ssh 1418 rancid 2u VCHR 0,118 0t0 118 /dev/ttyp7
> ssh 1418 rancid 3w VREG 0,86 76 7583772 /local (/dev/mfid0s1f)
> ssh 1418 rancid 4u IPv4 0xffffff008c030240 0t0 TCP *:27776->*:49323
> ssh 1418 rancid 5u VCHR 0,118 0t45 118 /dev/ttyp7
>
> Here, fd 4 is actually a pty (pty7), which seems to be a fork to PID 1418,
> the ssh session to the router, and everything seems to be normal.
>
> PID 1418 is no longer there on the most recent lsof, so 1418 seems to
> have died(?) and PID 1417 now has /dev/null on its file descriptor 4. I
> don't know why that is, but why is it trying to fcntl it to Blocking I/O
> mode? Here is a gdb attach to the PID and a backtrace:
>
> (gdb) bt
> #0 0x0000000800aefc9c in fcntl () from /lib/libc.so.7
> #1 0x00000000004072c5 in ?? ()
> #2 0x00000008006a8c18 in StackSetBlockMode ()
> from /usr/local/lib/libtcl84.so.1
> #3 0x00000008006a8c54 in SetBlockMode () from
> /usr/local/lib/libtcl84.so.1
> #4 0x00000008006acf75 in Tcl_SetChannelOption ()
> from /usr/local/lib/libtcl84.so.1
> #5 0x00000008006aeda0 in TclFinalizeIOSubsystem ()
> from /usr/local/lib/libtcl84.so.1
> #6 0x0000000800697f74 in Tcl_FinalizeThread ()
> from /usr/local/lib/libtcl84.so.1
> #7 0x0000000800698081 in Tcl_Finalize () from
> /usr/local/lib/libtcl84.so.1
> #8 0x000000080069833a in Tcl_Exit () from /usr/local/lib/libtcl84.so.1
> #9 0x0000000000409610 in ?? ()
> #10 0x00000008006742be in TclInvokeStringCommand ()
> from /usr/local/lib/libtcl84.so.1
> #11 0x0000000800675944 in TclEvalObjvInternal ()
> from /usr/local/lib/libtcl84.so.1
> #12 0x0000000800675dff in Tcl_EvalEx () from /usr/local/lib/libtcl84.so.1
> #13 0x00000008006b55d9 in Tcl_FSEvalFile () from
> /usr/local/lib/libtcl84.so.1
> #14 0x00000008006b5690 in Tcl_EvalFile () from
> /usr/local/lib/libtcl84.so.1
> #15 0x0000000000404f58 in ?? ()
> #16 0x0000000000404d47 in ?? ()
>
>> From the functions it is running in Tcl, it seems it is Tcl's cleanup
> code that is failing, when it is trying to restore a Tcl "channel" to
> normal mode during an exit event.
>
> This is where my clue runs out, and I am at a loss as to how to proceed
> from here. I have tried digging in both Tcl and Expect source code to see
> if can catch anything obvious, but alas, this is somewhat outside my area
> of expertise (I am a networking guy, not a programmer)...
>
> Any suggestions on how to proceed to find and fix this issue would be
> welcome, as the only other option for us is to abandon FreeBSD and go with
> Linux on the server, and we have already replaced too many FreeBSD boxes
> with Linux for my liking, I don't want to see yet another one go...
>
> Regards,
> Lars Erik Gullerud
> _______________________________________________
> Rancid-discuss mailing list
> Rancid-discuss at shrubbery.net
> http://www.shrubbery.net/mailman/listinfo.cgi/rancid-discuss
>
More information about the Rancid-discuss
mailing list