[rancid] archive cisco command and rancid

Tue Mar 24 07:31:32 UTC 2015

On 23/03/2015 20:26, 'Heasley' wrote:
> Mon, Mar 23, 2015 at 06:35:18PM +0100, alligator94:
>> > We use rancid to backup daily around 3700 cisco devices (routers and switches + some WAP and FW) all around the world and let’s say that 10 percent randomly may not be reachable because they are switched off at night or due to any other connectivity issue. As we have the standard rancid configuration, I think that there are 3 retries, so it may take time.
>> > 
>> > I have no access right now to the rancid config, but several clogin run in //.
>> > 
>> > We have a lot of different models of cisco devices, connected through a stable and not overloaded mpls network or using ipsec tunnels. Some use satellite connectivity in the far east countries.
> A few things I can suggest to improve the collection time:
> - since you have a lot of devices (probably) with long RTTs
> 	- increase rancid.conf:PAR_COUNT.  Perhaps double the number of CPUs.
> 	  most processes will be waiting on the network.  if the host *only*
> 	  does rancid, increase it furture - perhaps 4 times.  you will have
> 	  to play with the value a bit to find your acceptable load vs time
> 	  comfort.
> 	- if you can separate topologically distanct devices from near by
> 	  group, you could use <group>/rancid.conf to tailor PAR_COUNT for the
> 	  workload w/ 3.2.
> - if devices may be turned-off or may suffer outages often, these two could be
>   separated into a separate group and use <group>/rancid.conf to lower the
>   MAX_ROUNDS variable.
> - you could also try lowering the timeout in cloginrc for devices that are
>   often inaccessible.
> - you may also consider switching to svn, which is faster than cvs.  or git,
>   but please create a test instance for yourself before moving to git as the
>   support is new.
> - rancid.conf:NOPIPE=YES will improve performance of the perl part of a
>   collection a little.
> - also, see the FAQ for triggering rancid runs from syslog configuration
>   change messages.  Use that for daily activity and run once a week to CYA.
> 

I have some experience with setups like this:

More than 8000 CE devices (mostly Cisco) distributed throughout Africa
over whatever links happened to be available at the time. The list of
devices was constantly changing, they may or may not be up at any given
time, the username/password may or may not follow the standard, and the
device in question may or may not even exist at all in the real world.

With 2.3.8 on a single-CPU VM with 512MB RAM, I got this to run in about
4 hours:

- Crank PAR_COUNT way up. I had mine set to 50 IIRC.
- Split your devices up into groups of a few hundred each
- Set the telnet/ssh timeout as high as you need it to be to work
reliably 95% of the time

The rancid perl processes are all shared, and the code spends most of
it's time waiting on the network as characters come in one by one. The
actual amount of CPU work done per process is miniscule and disk
accesses are so infrequent you can almost ignore their effect, so don't
be scared to set PAR_COUNT very high.

top and load measurements tend to go very high with rancid, ignore those
numbers - it's a false positive because the machine is doing so much
waiting on the network.

I found, somewhat counter-intuitively, that with one large group of 8000
devices, cvs itself was adding a significant amount of time with each
commit. Maybe cvs was misconfigured on my end, but it got much better
when I created 26 groups (initial letter of the hostname) so I never
took the time to investigate cvs further.

I also had MAX_ROUNDS set to 0 so rancid never retried a given device.
My logic was that connectivity to the device was not my problem at all
so I didn't need to deal with it, and rancid would simply poll the
device again in 12 hours (mine ran twice a day). The OP might not have
the freedom to work under a policy like this though.

-- 
Alan McKinnon
alan.mckinnon at gmail.com