Change Details

The approach we're currently using for recurrent loading tasks in the scheduler has a lot of shortcomings: 1. does not take into account "freshness" information provided by a lister. Two consequences - lots of lag accrued on origins with updates - substantial amount of time wasted on origins with no updates - some amount of time wasted on completely dead origins 1. uses (apparently unreliable) scheduler information as feedback loop - lots of tasks end up lost in space, when we now have a reliable mechanism (the journal) to subscribe to updates about objects in the archive 1. feedback loop is very inflexible - the "visit interval" target has never been met, or even calibrated to our bandwidth - the way we adapt intervals is very stiff (x2 for inactive origins, /2 for active origins); no idea if it's stable or not - save code now requests are completely ignored by the recurrent tasks To handle this functionality better, I propose introducing a separate task scheduler for recurrent origin visits, with a few components: - a new table and set of API endpoints in the scheduler backend, to record information about recurrent origin loading tasks, replacing the current contents of the task table in the scheduler (for tasks that come from listers) - a new runner for these origin visit tasks - TBD: generate one-shot tasks in the swh scheduler? send tasks directly to celery? - a journal client, feeding off origin_visits / origin_visit_updates, recording the status of all origin loading tasks === Common goals for a new origin visit scheduler === Some common goals that seem desirable: - loading origins at least once, as soon as possible after they appear in a lister - minimizing the number of "useless visits" with no updates to integrate - "smooth" the size of visits, by visiting active origins more often and doing less work each time (which reduces memory pressure on the workers) - make use of forge-provided "last modification times" to reduce the amount of useless work done === Baseline for the recurrence of origin visits === As a common baseline, we can handle the list of origins to be visited as a queue. The way origins are ordered in this queue can be materialized by using two variables - a //next visit target//, the "time" at which we expect the origin to have new objects, and we should visit it again. - a //visit interval//, which is the duration that we expect to wait between visits of this origin While conceptually a timestamp, the //next visit target// value is only really used as a queue index; the //visit interval// is an offset by which we move the origin within this queue when a visit completes. As we should keep our infrastructure busy, and attempt origin visits as often as possible (with a minimal cooldown period between visits of a given origin, to avoid DoSing hosters), the //next visit target// currently being serviced can drift away from the current clock according to the behavior of our infrastructure. The visit interval is really an index in a list of possible visit intervals. This allows us to set a smooth increase at first, and a stiffer increase later: - index 0 and 1, interval 1 day - index 2, 3 and 4, interval 2 days - index 5 and up, interval 4^(n-4) days (4, 16, 64, 256, 1024) The breakpoints of this "exponential" can be adjusted to match the reality of our loading infrastructure, by monitoring the skew between the next visit targets and the actual scheduling time of visits. The next visit target of an origin is updated after each visit completes: - if the visit has failed, increase the next visit target by the minimal visit interval (to take into account transient loading issues) - if the visit has failed several times in a row, disable the origin until the next run of the lister - if the visit is successful, and records some changes, decrease the visit interval index by 2 (visit the origin *way* more often). - if the visit is successful, and records no changes, increase the visit interval index by 1 (visit the origin less often). We set the next visit target to its current value + the new visit interval multiplied by a random fudge factor (picked in the -/+ 10% range). The fudge factor allows the visits to spread out, avoiding "bursts" of loaded origins e.g. when a number of origins from a single hoster are processed at once. ==== Bootstrapping of the //next visit target// and //visit interval// values for new origins ==== There is an obvious bootstrap issue: new origins do not have a next visit target. (We can, however, bootstrap the visit interval index to a default value, e.g. 4) The first of our stated goals is visiting origins at least once as soon as possible. There's multiple ways to achieve this: - we can schedule new origins separately - pros # no fudging of the next visit target needed # can precisely control when new origins are loaded - cons # makes the scheduling logic more complex # needs careful monitoring to make sure the number doesn't grow unbounded - we can generate a //next visit target// for new origins - pros # simpler scheduling logic: origins are picked from a single queue # can handle spreading the visits new origins over a longer time in a "oneshot" fashion - cons # needs careful consideration to avoid DoSing new hosters by bursting requests to new origins - Once the first visit happens, we can set the next visit target to the "current next visit target being scheduled" + the default visit interval * the random fudge factor. === Optimizations for listers providing a date of last modification === When the lister provides a date of last modification for origins, we can do some more subtle management of the scheduling of origin visits. Instead of scheduling according to the next visit target, we can schedule visits to origins in the following three pools: 1. origins that have never successfully loaded - ordered by increasing date of last modification (oldest first) NOTE: if the lister returns a creation date for the origin, we could instead use a decreasing interval between //creation time// to //last update time// as sorting heuristic: this would favor "more active" origins, but could leave origins only updated once behind. 2. origins where the date of last modification is more recent than the latest (successful) load date - ordered by decreasing difference between last load date and date of last modification NOTE: this favors more recently active origins to the detriment of origins which had some activity right after being loaded, then went silent. There's a good chance that this heuristic will not converge (and, if the infrastructure struggles, a bunch of modifications to origins that happened right after a visit will never be recorded). To reduce the impact of this, we could mix both ends of the heuristic: do some amount of visits to active origins, and do some amount of "easy" visits to origins that haven't been updated very much. 3. other origins - ordered by next visit target NOTE: this is only a last-resort strategy: - listers might not be running all the time, but we still want a chance to update repositories - the modification time information provided by listers might not be reliable - if everything works perfectly, these last-resort updates should mostly be short no-ops === Actual visit scheduling policy === The //next visit target// for origins without date of last modification (group A) give them a total ordering. For origins with an upstream-provided date of last modification (group B), the three scheduling pools don't give us a total ordering of visits. To handle both of these groups when the execution queue has **n** slots free, we can : - get the ratio of "last modification time-provided" origins **r** = **#**B / (**#**A + **#**B) - schedule (1-**r**) **n** origins from group A according to their //next visit target// - schedule **n** **r**/3 origins from each of the 3 pools of origin group B. We need to monitor the number of origins in each of the pools of group B, to make sure that our sorting heuristics are not making our work diverge. === Potential future improvements === - Prioritize non-fork repositories over forked repositories? - not in the first draft; still, record the fork information if possible so we can adjust the scheduling policies afterwards === Feedback loop in the origin_visit journal client === 1. Update last visit date, status, and eventfulness - keep time of last successful visit - if status is failed, keep same last visit date, increase failure count - if failure count too high (3 ?) disable origin until next run of lister - else, reset failure count 1. Update visit interval index and next visit target === Proposed metrics for the new scheduler === - number of origins for every scheduling policy group and subgroup (probably grouped by lister). - number of active origins (last modification \in [previous listing, current listing]), by lister. - drift between real time and current //next visit target// scheduled - ...

The approach we're currently using for recurrent loading tasks in the scheduler has a lot of shortcomings: 1. does not take into account "freshness" information provided by a lister. Two consequences - lots of lag accrued on origins with updates - substantial amount of time wasted on origins with no updates - some amount of time wasted on completely dead origins 1. uses (apparently unreliable) scheduler information as feedback loop - lots of tasks end up lost in space, when we now have a reliable mechanism (the journal) to subscribe to updates about objects in the archive 1. feedback loop is very inflexible - the "visit interval" target has never been met, or even calibrated to our bandwidth - the way we adapt intervals is very stiff (x2 for inactive origins, /2 for active origins); no idea if it's stable or not - save code now requests are completely ignored by the recurrent tasks To handle this functionality better, I propose introducing a separate task scheduler for recurrent origin visits, with a few components: - a new table and set of API endpoints in the scheduler backend, to record information about recurrent origin loading tasks, replacing the current contents of the task table in the scheduler (for tasks that come from listers) - a new runner for these origin visit tasks - TBD: generate one-shot tasks in the swh scheduler? send tasks directly to celery? - a journal client, feeding off origin_visits / origin_visit_updates, recording the status of all origin loading tasks === Common goals for a new origin visit scheduler === Some common goals that seem desirable: - loading origins at least once, as soon as possible after they appear in a lister - minimizing the number of "useless visits" with no updates to integrate - "smooth" the size of visits, by visiting active origins more often and doing less work each time (which reduces memory pressure on the workers) - make use of forge-provided "last modification times" to reduce the amount of useless work done === Baseline for the recurrence of origin visits === As a common baseline, we can handle the list of origins to be visited as a queue. The way origins are ordered in this queue can be materialized by using two variables - a //next visit target//, the "time" at which we expect the origin to have new objects, and we should visit it again. - a //visit interval//, which is the duration that we expect to wait between visits of this origin While conceptually a timestamp, the //next visit target// value is only really used as a queue index; the //visit interval// is an offset by which we move the origin within this queue when a visit completes. As we should keep our infrastructure busy, and attempt origin visits as often as possible (with a minimal cooldown period between visits of a given origin, to avoid DoSing hosters), the //next visit target// currently being serviced can drift away from the current clock according to the behavior of our infrastructure. The visit interval is really an index in a list of possible visit intervals. This allows us to set a smooth increase at first, and a stiffer increase later: - index 0 and 1, interval 1 day - index 2, 3 and 4, interval 2 days - index 5 and up, interval 4^(n-4) days (4, 16, 64, 256, 1024) The breakpoints of this "exponential" can be adjusted to match the reality of our loading infrastructure, by monitoring the skew between the next visit targets and the actual scheduling time of visits. The next visit target of an origin is updated after each visit completes: - if the visit has failed, increase the next visit target by the minimal visit interval (to take into account transient loading issues) - if the visit has failed several times in a row, disable the origin until the next run of the lister - if the visit is successful, and records some changes, decrease the visit interval index by 2 (visit the origin *way* more often). - if the visit is successful, and records no changes, increase the visit interval index by 1 (visit the origin less often). We set the next visit target to its current value + the new visit interval multiplied by a random fudge factor (picked in the -/+ 10% range). The fudge factor allows the visits to spread out, avoiding "bursts" of loaded origins e.g. when a number of origins from a single hoster are processed at once. ==== Bootstrapping of the //next visit target// and //visit interval// values for new origins ==== There is an obvious bootstrap issue: new origins do not have a next visit target. (We can, however, bootstrap the visit interval index to a default value, e.g. 4) The first of our stated goals is visiting origins at least once as soon as possible. There's multiple ways to achieve this: - we can schedule new origins separately - pros # no fudging of the next visit target needed # can precisely control when new origins are loaded - cons # makes the scheduling logic more complex # needs careful monitoring to make sure the number doesn't grow unbounded - we can generate a //next visit target// for new origins - pros # simpler scheduling logic: origins are picked from a single queue # can handle spreading the visits new origins over a longer time in a "oneshot" fashion - cons # needs careful consideration to avoid DoSing new hosters by bursting requests to new origins - Once the first visit happens, we can set the next visit target to the "current next visit target being scheduled" + the default visit interval * the random fudge factor. === Optimizations for listers providing a date of last modification === When the lister provides a date of last modification for origins, we can do some more subtle management of the scheduling of origin visits. Instead of scheduling according to the next visit target, we can schedule visits to origins in the following three pools: 1. origins that have never successfully loaded - ordered by increasing date of last modification (oldest first) NOTE: if the lister returns a creation date for the origin, we could instead use a decreasing interval between //creation time// to //last update time// as sorting heuristic: this would favor "more active" origins, but could leave origins only updated once behind. 2. origins where the date of last modification is more recent than the latest (successful) load date - ordered by decreasing difference between last load date and date of last modification NOTE: this favors more recently active origins to the detriment of origins which had some activity right after being loaded, then went silent. There's a good chance that this heuristic will not converge (and, if the infrastructure struggles, a bunch of modifications to origins that happened right after a visit will never be recorded). To reduce the impact of this, we could mix both ends of the heuristic: do some amount of visits to active origins, and do some amount of "easy" visits to origins that haven't been updated very much. 3. other origins - ordered by next visit target NOTE: this is only a last-resort strategy: - listers might not be running all the time, but we still want a chance to update repositories - the modification time information provided by listers might not be reliable - if everything works perfectly, these last-resort updates should mostly be short no-ops === Actual visit scheduling policy === The //next visit target// for origins without date of last modification (group A) give them a total ordering. For origins with an upstream-provided date of last modification (group B), the three scheduling pools don't give us a total ordering of visits. To handle both of these groups when the execution queue has **n** slots free, we can : - get the ratio of "last modification time-provided" origins **r** = **#**B / (**#**A + **#**B) - schedule (1-**r**) **n** origins from group A according to their //next visit target// - schedule **n** **r**/3 origins from each of the 3 pools of origin group B. We need to monitor the number of origins in each of the pools of group B, to make sure that our sorting heuristics are not making our work diverge. === Potential future improvements === - Prioritize non-fork repositories over forked repositories? - not in the first draft; still, record the fork information if possible so we can adjust the scheduling policies afterwards === Feedback loop in the origin_visit journal client === 1. Update last visit date, status, and eventfulness - keep time of last successful visit - if status is failed, keep same last visit date, increase failure count - if failure count too high (3 ?) disable origin until next run of lister - else, reset failure count 1. Update visit interval index and next visit target === Proposed metrics for the new scheduler === - number of origins for every scheduling policy group and subgroup (probably grouped by lister). - number of active origins (last modification \in [previous listing, current listing]), by lister. - drift between real time and current //next visit target// scheduled - ... ---- Discussion on implementation detail [1] [1] https://hedgedoc.softwareheritage.org/jF0v5LZITGqhpZ-hgiy8zw