From danm at prime.gushi.org Wed Jun 18 23:22:23 2025 From: danm at prime.gushi.org (Dan Mahoney (Gushi)) Date: Wed, 18 Jun 2025 23:22:23 +0000 (UTC) Subject: [rancid] Intermittent Rancid Failures Message-ID: <1236f54c-2c37-31be-c52f-0e0feb191b4c@prime.gushi.org> Hey there all, Something's driving me batty. My ASR-1001-X is only able to be connected to intermittently. Rancid (run as the rancid user) always works from the command line, but rancid-run fails for some reason. When I watch rancid-run, I see several ssh processes start up, trying to shell to the router in question, but of course, the output of those aren't logged anywhwere? Clogin works. Running all the commands in rancid -d work (though of course there are many extra commands in there). But every time I call rancid-run groupname, I get the "routers have not been contacted in over 24 hours" email. And only intermittently. (It's been a little over 24 hours with no changes now). Rancid is rancid3-3.13_3 under FreeBSD 13.4 I notice that some of the commands that one might need to run these things from the command line get put into /usr/local/libexec/rancid/ which isn't in the default path, but this only breaks command-line testing, not actual production runs. (Maybe the rancid-run script sets the right path?) I'll also note that these routers need some special things in .ssh/config for some older ciphers, etc, but that would be an all-or-nothign type problem, I'd think. -Dan -- --------Dan Mahoney-------- Techie, Sysadmin, WebGeek Gushi on efnet/undernet IRC FB: fb.com/DanielMahoneyIV LI: linkedin.com/in/gushi Site: http://www.gushi.org --------------------------- From me at falz.net Thu Jun 19 20:07:34 2025 From: me at falz.net (Chris Wopat) Date: Thu, 19 Jun 2025 15:07:34 -0500 Subject: [rancid] Rancid-discuss Digest, Vol 163, Issue 1 In-Reply-To: References: Message-ID: While I don't currently have FreeBSD, rancid-run should log to logdir. On a system I have handy these end up in /var/lib/rancid/logs/ from our rancid.conf: BASEDIR=/var/lib/rancid; export BASEDIR LOGDIR=$BASEDIR/logs; export LOGDIR Whatever your issue is, contents of logs is probably vital. -------------- next part -------------- An HTML attachment was scrubbed... URL: From heas at shrubbery.net Sat Jun 21 17:12:47 2025 From: heas at shrubbery.net (heasley) Date: Sat, 21 Jun 2025 17:12:47 +0000 Subject: [rancid] Intermittent Rancid Failures In-Reply-To: <1236f54c-2c37-31be-c52f-0e0feb191b4c@prime.gushi.org> References: <1236f54c-2c37-31be-c52f-0e0feb191b4c@prime.gushi.org> Message-ID: Wed, Jun 18, 2025 at 11:22:23PM +0000, Dan Mahoney (Gushi): > Hey there all, > > Something's driving me batty. > > My ASR-1001-X is only able to be connected to intermittently. Rancid (run > as the rancid user) always works from the command line, but rancid-run fails > for some reason. > > When I watch rancid-run, I see several ssh processes start up, trying to > shell to the router in question, but of course, the output of those aren't > logged anywhwere? Clogin works. Running all the commands in rancid -d work > (though of course there are many extra commands in there). There should only be 1 ssh process per device, though it will try rancid.conf:MAX_ROUNDS times. Much of the output is filtered, but effort is made to log relevant errors to rancid.conf:${LOGDIR}/. It is possible that the device is simply slow executing some commands. This is not unusual for older devices or because of bugs such as memory leaks. Increasing the timeout can test this theory, either increase the timeout for all devices of type cisco, rancid.types.base: cisco;timeout;120 or specific devices, ~rancid/.cloginrc: add timeout {} > But every time I call rancid-run groupname, I get the "routers have not been > contacted in over 24 hours" email. And only intermittently. (It's been a > little over 24 hours with no changes now). Another thing to check, which would also be revealed in the aforemention logs, is that the repository is not buggered in some manner that control_rancid can not resolve. su - rancid cd update or status and look for errors. Those are the things that I would investigate or try first. > Rancid is rancid3-3.13_3 under FreeBSD 13.4 > > I notice that some of the commands that one might need to run these things > from the command line get put into /usr/local/libexec/rancid/ which isn't in > the default path, but this only breaks command-line testing, not actual > production runs. (Maybe the rancid-run script sets the right path?) The PATH for the environment is set in rancid.conf, which should include that directory. You too can source that file if your sell is shell-comapatible. You have seen the ssh processes, so this is not your problem. > I'll also note that these routers need some special things in .ssh/config > for some older ciphers, etc, but that would be an all-or-nothign type > problem, I'd think. that should not matter. From danm at prime.gushi.org Tue Jun 24 18:28:16 2025 From: danm at prime.gushi.org (Dan Mahoney (Gushi)) Date: Tue, 24 Jun 2025 18:28:16 +0000 (UTC) Subject: [rancid] Intermittent Rancid Failures In-Reply-To: References: <1236f54c-2c37-31be-c52f-0e0feb191b4c@prime.gushi.org> Message-ID: On Sat, 21 Jun 2025, heasley wrote: > Wed, Jun 18, 2025 at 11:22:23PM +0000, Dan Mahoney (Gushi): >> Hey there all, >> >> Something's driving me batty. >> >> My ASR-1001-X is only able to be connected to intermittently. Rancid (run >> as the rancid user) always works from the command line, but rancid-run fails >> for some reason. >> >> When I watch rancid-run, I see several ssh processes start up, trying to >> shell to the router in question, but of course, the output of those aren't >> logged anywhwere? Clogin works. Running all the commands in rancid -d work >> (though of course there are many extra commands in there). > > There should only be 1 ssh process per device, though it will try > rancid.conf:MAX_ROUNDS times. > > Much of the output is filtered, but effort is made to log relevant > errors to rancid.conf:${LOGDIR}/. > > It is possible that the device is simply slow executing some commands. > This is not unusual for older devices or because of bugs such as > memory leaks. Increasing the timeout can test this theory, either > increase the timeout for all devices of type cisco, > rancid.types.base: cisco;timeout;120 Interesting, this line wasn't in my existing rancid.types.base for type cisco. I've added it at 300 in both the conf file and cloginrc. But it seems not to be honored. For example, at the time of one of the failures, I get: $ time rancid-run 57.43 real 3.20 user 0.40 sys And also, ps seems to report it's being hard-set at 90: rancid 87909 2.1 0.1 18324 6952 0 S+ 17:40 0:00.06 /usr/local/bin/expect -- /usr/local/libexec/rancid/clogin -t 90 -c show version;show redundancy secondary;show idprom backplane;show install active;show env all;show rsp chassis-info;show gsr chassis;show diag chassis-info;show boot;show bootvar;show variables boot;show license udi;show license feature;show license;show license summary;show activation-key (...) Weirdly, sitting on the router and stalking "who" I see the rancid login happen multiple times. Adding a couple of quotes and running the full clogin command line always runs quickly. > or specific devices, > ~rancid/.cloginrc: add timeout {} > >> But every time I call rancid-run groupname, I get the "routers have not been >> contacted in over 24 hours" email. And only intermittently. (It's been a >> little over 24 hours with no changes now). > > Another thing to check, which would also be revealed in the > aforemention logs, is that the repository is not buggered in > some manner that control_rancid can not resolve. > su - rancid > cd > update or status > and look for errors. > > Those are the things that I would investigate or try first. cvs up/cvs status run clean. I even deleted and re-added the file from cvs. When it works, it works. This is what's confusing me. === (a few hours later) I think I have one (silly) theory about what's going wrong. I have a bit of ASCII art in the motd, and when I removed it, things started running more fluidly. (It has # signs, carets, and slashes in it). https://www.gushi.org/routerferret.png Too many weasels in the router. I still don't know why this would only break things half the time, though. I still don't know why things always work fluidly when I just paste commands in -- perhaps the clogin goes fine, but what happens after is breaking. I also still don't know why -t 90 is being reported if I've set an explicit timeout of longer. I'm also not sure why rancid does something like: more system:running-config;show running-config view full;show running-config;write term -- if multiple of these commands work, are they post-processed/deduplicated down to a single config before they're committed to CVS? Does it make sense to pare these down to a single command-set that works only on my version of IOS-XE, and define my own device type for it? Rancid seems to have a very "throw all the commands at the wall and see what sticks" point of view. -Dan -- --------Dan Mahoney-------- Techie, Sysadmin, WebGeek Gushi on efnet/undernet IRC FB: fb.com/DanielMahoneyIV LI: linkedin.com/in/gushi Site: http://www.gushi.org --------------------------- From heas at shrubbery.net Tue Jun 24 19:09:48 2025 From: heas at shrubbery.net (heasley) Date: Tue, 24 Jun 2025 19:09:48 +0000 Subject: [rancid] Intermittent Rancid Failures In-Reply-To: References: <1236f54c-2c37-31be-c52f-0e0feb191b4c@prime.gushi.org> Message-ID: Tue, Jun 24, 2025 at 06:28:16PM +0000, Dan Mahoney (Gushi): > (a few hours later) > > I think I have one (silly) theory about what's going wrong. I have a bit of > ASCII art in the motd, and when I removed it, things started running more > fluidly. (It has # signs, carets, and slashes in it). clogin(1): BUGS Do not use greater than (>) or pound sign (#) in device banners or hostnames or prompts. These are the normal terminating characters of device prompts and the login scripts need to locate the initial prompt. Afterward, the full prompt is collected and makes a more precise match so that the scripts know when the device is ready for the next command. > https://www.gushi.org/routerferret.png Too many weasels in the router. > > I still don't know why this would only break things half the time, though. that is simply timing. How the input is chucked is largely random. When it first connects, the prompt match is somewhat loose; when it finds the prompt, it adjusts the match to be more precise. You can override this in cloginrc(5): add prompt {} Match login prompt, or initial login prompt in the case of some of the login scripts. This is provided only as a work-around for login banners that contain forbidden characters that conflict with CLI prompt markers. Note that not all login scripts support this. > I still don't know why things always work fluidly when I just paste commands > in -- perhaps the clogin goes fine, but what happens after is breaking. when you use clogin for an interactive login, it connects, logs in, then steps out of the way. > I also still don't know why -t 90 is being reported if I've set an explicit > timeout of longer. This I will have to look into. > I'm also not sure why rancid does something like: > > more system:running-config;show running-config view full;show > running-config;write term -- if multiple of these commands work, are they > post-processed/deduplicated down to a single config before they're committed > to CVS? it is because cisco evolved and/or had trouble being consistent or one version shows more data than the other but is not supported everywhere. You can create per-family device types of your own in rancid.types.conf (separate file in same dir) that exclude those extra commands. > Does it make sense to pare these down to a single command-set that works > only on my version of IOS-XE, and define my own device type for it? Rancid > seems to have a very "throw all the commands at the wall and see what > sticks" point of view. YMMV; collect a bit more output that is discarded or an error that is ignored - meh.