Page MenuHomeSoftware Heritage

Improve handling of recurrent loading tasks in scheduler
Open, HighPublic


The approach we're currently using for recurrent loading tasks in the scheduler has a lot of shortcomings:

  1. does not take into account "freshness" information provided by a lister. Two consequences
    • lots of lag accrued on origins with updates
    • substantial amount of time wasted on origins with no updates
    • some amount of time wasted on completely dead origins
  2. uses (apparently unreliable) scheduler information as feedback loop
    • lots of tasks end up lost in space, when we now have a reliable mechanism (the journal) to subscribe to updates about objects in the archive
  3. feedback loop is very inflexible
    • the "visit interval" target has never been met, or even calibrated to our bandwidth
    • the way we adapt intervals is very stiff (x2 for inactive origins, /2 for active origins); no idea if it's stable or not
    • save code now requests are completely ignored by the recurrent tasks

To handle this functionality better, I propose introducing a separate task scheduler for recurrent origin visits, with a few components:

  • a new table and set of API endpoints in the scheduler backend, to record information about recurrent origin loading tasks, replacing the current contents of the task table in the scheduler (for tasks that come from listers)
  • a new runner for these origin visit tasks
    • TBD: generate one-shot tasks in the swh scheduler? send tasks directly to celery?
  • a journal client, feeding off origin_visits / origin_visit_updates, recording the status of all origin loading tasks

Common goals for a new origin visit scheduler

Some common goals that seem desirable:

  • loading origins at least once, as soon as possible after they appear in a lister
  • minimizing the number of "useless visits" with no updates to integrate
  • "smooth" the size of visits, by visiting active origins more often and doing less work each time (which reduces memory pressure on the workers)
  • make use of forge-provided "last modification times" to reduce the amount of useless work done

Baseline for the recurrence of origin visits

As a common baseline, we can handle the list of origins to be visited as a queue.

The way origins are ordered in this queue can be materialized by using two variables

  • a next visit target, the "time" at which we expect the origin to have new objects, and we should visit it again.
  • a visit interval, which is the duration that we expect to wait between visits of this origin

While conceptually a timestamp, the next visit target value is only really used as a queue index; the visit interval is an offset by which we move the origin within this queue when a visit completes.

As we should keep our infrastructure busy, and attempt origin visits as often as possible (with a minimal cooldown period between visits of a given origin, to avoid DoSing hosters), the next visit target currently being serviced can drift away from the current clock according to the behavior of our infrastructure.

The visit interval is really an index in a list of possible visit intervals. This allows us to set a smooth increase at first, and a stiffer increase later:

  • index 0 and 1, interval 1 day
  • index 2, 3 and 4, interval 2 days
  • index 5 and up, interval 4^(n-4) days (4, 16, 64, 256, 1024)

The breakpoints of this "exponential" can be adjusted to match the reality of our loading infrastructure, by monitoring the skew between the next visit targets and the actual scheduling time of visits.

The next visit target of an origin is updated after each visit completes:

  • if the visit has failed, increase the next visit target by the minimal visit interval (to take into account transient loading issues)
    • if the visit has failed several times in a row, disable the origin until the next run of the lister
  • if the visit is successful, and records some changes, decrease the visit interval index by 2 (visit the origin *way* more often).
  • if the visit is successful, and records no changes, increase the visit interval index by 1 (visit the origin less often).

We set the next visit target to its current value + the new visit interval multiplied by a random fudge factor (picked in the -/+ 10% range).

The fudge factor allows the visits to spread out, avoiding "bursts" of loaded origins e.g. when a number of origins from a single hoster are processed at once.

Bootstrapping of the next visit target and visit interval values for new origins

There is an obvious bootstrap issue: new origins do not have a next visit target. (We can, however, bootstrap the visit interval index to a default value, e.g. 4)

The first of our stated goals is visiting origins at least once as soon as possible.

There's multiple ways to achieve this:

  • we can schedule new origins separately
    • pros
      1. no fudging of the next visit target needed
      2. can precisely control when new origins are loaded
    • cons
      1. makes the scheduling logic more complex
      2. needs careful monitoring to make sure the number doesn't grow unbounded
  • we can generate a next visit target for new origins
    • pros
      1. simpler scheduling logic: origins are picked from a single queue
      2. can handle spreading the visits new origins over a longer time in a "oneshot" fashion
    • cons
      1. needs careful consideration to avoid DoSing new hosters by bursting requests to new origins
  • Once the first visit happens, we can set the next visit target to the "current next visit target being scheduled" + the default visit interval * the random fudge factor.

Optimizations for listers providing a date of last modification

When the lister provides a date of last modification for origins, we can do some more subtle management of the scheduling of origin visits. Instead of scheduling according to the next visit target, we can schedule visits to origins in the following three pools:

  1. origins that have never successfully loaded
    • ordered by increasing date of last modification (oldest first)
NOTE: if the lister returns a creation date for the origin, we could instead use a decreasing interval between creation time to last update time as sorting heuristic: this would favor "more active" origins, but could leave origins only updated once behind.
  1. origins where the date of last modification is more recent than the latest (successful) load date
    • ordered by decreasing difference between last load date and date of last modification
NOTE: this favors more recently active origins to the detriment of origins which had some activity right after being loaded, then went silent. There's a good chance that this heuristic will not converge (and, if the infrastructure struggles, a bunch of modifications to origins that happened right after a visit will never be recorded). To reduce the impact of this, we could mix both ends of the heuristic: do some amount of visits to active origins, and do some amount of "easy" visits to origins that haven't been updated very much.
  1. other origins
    • ordered by next visit target
NOTE: this is only a last-resort strategy:
  • listers might not be running all the time, but we still want a chance to update repositories
  • the modification time information provided by listers might not be reliable
  • if everything works perfectly, these last-resort updates should mostly be short no-ops

Actual visit scheduling policy

The next visit target for origins without date of last modification (group A) give them a total ordering.

For origins with an upstream-provided date of last modification (group B), the three scheduling pools don't give us a total ordering of visits.

To handle both of these groups when the execution queue has n slots free, we can :

  • get the ratio of "last modification time-provided" origins r = #B / (#A + #B)
  • schedule (1-r) n origins from group A according to their next visit target
  • schedule n r/3 origins from each of the 3 pools of origin group B.

We need to monitor the number of origins in each of the pools of group B, to make sure that our sorting heuristics are not making our work diverge.

Potential future improvements

  • Prioritize non-fork repositories over forked repositories?
    • not in the first draft; still, record the fork information if possible so we can adjust the scheduling policies afterwards

Feedback loop in the origin_visit journal client

  1. Update last visit date, status, and eventfulness
    • keep time of last successful visit
    • if status is failed, keep same last visit date, increase failure count
      • if failure count too high (3 ?) disable origin until next run of lister
    • else, reset failure count
  2. Update visit interval index and next visit target

Proposed metrics for the new scheduler

  • number of origins for every scheduling policy group and subgroup (probably grouped by lister).
  • number of active origins (last modification \in [previous listing, current listing]), by lister.
  • drift between real time and current next visit target scheduled
  • ...

Related Objects

Work in Progressolasd
Work in Progressolasd
Work in Progressolasd

Event Timeline

olasd triaged this task as High priority.Apr 1 2020, 8:03 PM
olasd created this task.
olasd created this object with visibility "olasd (Nicolas Dandrimont)".
olasd changed the visibility from "olasd (Nicolas Dandrimont)" to "Public (No Login Required)".
olasd updated the task description. (Show Details)

This task describes in detail what kind of scheduling policy we should implement, but it doesn't help much figure out what the next steps should be.

After some more discussion, we think that this can be broken down and implemented through three separate components:

  • a unified listing api and table within the scheduler, which would collate the information from all listers that create recurrent visit tasks:
    • available origins and expected visit type (plus potential extra arguments for the loading tasks)
    • date of last listing for a given origin
    • date of last update if available
    • whether the origin is declared as a fork
  • a cache for quick, bulk access to the information about the latest visits for a given origin / visit type; a journal client which keeps this cache up to date
    • when was the latest visit for a given origin/type pair? What happened?
    • when was the latest _eventful_ visit for a given origin/type pair?
    • when do we expect the next eventful visit to be ?
  • a scheduling component which merges the information from listers and the visit cache, to handle the actual scheduling of tasks.

There's a good chance that some features of the origin visit cache will need to be tailor made for the policy chosen for the scheduling component.

olasd changed the status of subtask T2973: Implement a scheduler simulator from Open to Work in Progress.Jan 18 2021, 2:12 PM

Here's my understanding of the status of the migration to the next generation scheduler as of today:


  • all the listers have been ported to the new API, and released. The legacy SQLAlchemy based core loader has been removed. We missed a prime opportunity for a 1.0 release ;)
  • all the listers based on the new API have been deployed in staging and in production.
    • Listers now only create entries in the listers table of the scheduler.
    • There are some (small) production issues to solve (T3032)

Scheduler journal client:

  • the scheduler journal client is deployed in staging and production.
    • it feeds the origin_visit_stats table from the origin_visit_status data in swh.journal.
  • the production instance of the journal client has a data consistency issue that is under investigation (T3000).
    • From the investigation, the issue seems to be purely within the journal client settings (i.e. the journal itself has all the expected data). Hopefully we can tune this and get it solved in the next few days.

Origin visit scheduling:

  • only fairly basic scheduling policies have been implemented : a generic FIFO policy, as well as two policies based on the visit status cache and the date of last update of the origin provided by the lister.
    • performance tuning of the existing policies and implementation of other scheduling policies, is blocked on:
      • production-sized lister data (running all listers to completion in prod, at least once)
      • accurate origin_visit_status cache data, from the archive
    • we need to implement the "fallback" scheduling when listers aren't able to provide us with a last update value
  • scheduling of origin visits from the new data / APIs has not yet been deployed, either in staging or production.
  • the production infra is churning on the (very large) "backlog" of recurrent tasks that are available in the "legacy" scheduler, generated by the old listers. These lists of tasks aren't updated anymore, but we have plenty of work to do still.
  • once we have tested scheduling policies against production data, we can disable the old recurrent tasks and subsume them with tasks from the new listers.
  • once scheduling of origin visit tasks is we should consider removing recurrent tasks from the old-style scheduler

Scheduler simulator:

  • it's a neat tool/toy, but it hasn't been confronted to production-scale data yet
  • it will be very useful as is, as an integration testing framework to tune the performance of the new components once we have some production-scale data dumps available
  • we will need to improve the set of metrics (T2974) and reporting to:
    • validate the simulator behavior against prod
    • make the simulator a more useful "management" tool allowing us to tune the parameters of the scheduling policies.