[rancid] Improving Rancid's processing speed when having 1k+ devices

Fri Jul 26 11:29:51 UTC 2019

> 9 minutes for 1200 devices seems reasonable to me. :)

Heh - I've got around 3,000.  I'm having an issue with PAR that I haven't fully addressed, so I'm still only doing 5 at a time and getting 4- to 5-hour run times.  We made a choice at one point to put all "do-diff" groups on one line in cron, that didn’t help at all but haven't yet backed that down.  If we were to break that up appropriately, we'd have around 1200 in the largest group, several hundred in a few, and a number of group (about 15 altogether) with <10. We could break things up further, but at some point you have to ust accept large router.db files because there's managerial overhead trying to manage a large number of rancid groups and keeping it synchronized against CDP and LLDP discoveries and CMDB database in a dynamic environment.

Our old server we stood up in 2002 using rancid 1.2 was set to PAR=100 and getting about 45min for the entire suite.  We never actually hit 100 simultaneous connections, we maxed out at around 60-70 because by the time the 71st connection was opened the 1st was completing.  Of course, that was for a server stood-up in 2002, so take that for whatever it's worth.

Is 9 min too long?

weylin

On 7/25/19, 12:55 PM, "john heasley" <heas at shrubbery.net> wrote:

    Thu, Jul 25, 2019 at 02:29:37PM +0200, Florin Vlad Olariu:
    > Well, as per title, is there any way to improve rancid's speed with so many
    > devices? At the moment I set PAR_COUNT to 300, so it will connect in
    > parallel to 300 devices at a time, but the reality is that most time does
    > not seem to be taken by connecting and retrieving config but by what
    > happens next in the file processing and git-comitting.
    > 
    > To give you some stats, with current settings it takes around 9 minutes to
    > do 1200 devices. I have only 1 group with all devices under the same group.
    > 
    > Any trick you might have, please let me know!

    Typically, the network and, more so, the devices are the slow part.  Some
    devices are much slower than others.  more parallelism helps a lot - your
    high PAR_COUNT.  other thoughts:

    - cvs is slow.  use svn or git.  svn is probably faster; but I have not
      benchmarked the two for the functions that rancid uses.
    - make sure that the rancid user is not process rlimited to less than ~605
      processes; or PAR_COUNT * 2 + 5 or so.
    - perl is a meory pig.  if the host/vm has memory pressure, this would be
      something to address.
    - retrieving device output does not require much cpu, but process does use
      some - dont starve it
    - use rancid.conf:NOPIPE=YES; i think this is faster because perl is a pig.
    - if you only need configs, then reduce what is collected to just show version
      and show running.  or have one hourly group that collects that, and a daily
      group that collects everything.  less processing, and esp many fewer regexes.

    multiple groups might help, at least for the SCM part.  split your one large
    group into a few.  make sure to use a separate cron for each so that they run
    in parallel.

    I havent attempted to benchmark or optimize any parts for a while.  There was
    a complaint about the start-up time for control_rancid, which seems to me to
    be inconsequential, but I do not know what the users were attempting to do
    with rancid that made this matter.  There are other benefits to this, so I've
    started to re-write it; this is not ready yet.

    9 minutes for 1200 devices seems reasonable to me. :)