Change Details

The approach we're currently using for recurrent loading tasks in the scheduler has a lot of shortcomings: 1. does not take into account "freshness" information provided by a lister. Two consequences - lots of lag accrued on origins with updates - substantial amount of time wasted on origins with no updates - some amount of time wasted on completely dead origins 1. uses (apparently unreliable) scheduler information as feedback loop - lots of tasks end up lost in space, when we now have a reliable mechanism (the journal) to subscribe to updates about objects in the archive 1. feedback loop is very inflexible - the "visit interval" target has never been met, or even calibrated to our bandwidth - the way we adapt intervals is very stiff (x2 for inactive origins, /2 for active origins); no idea if it's stable or not - save code now requests are completely ignored by the recurrent tasks To handle this functionality better, I propose introducing some new components: - a new table and set of API endpoints in the scheduler backend, to record information about recurrent origin loading tasks, replacing the current contents of the task table in the scheduler - a new runner, which would generate one-shot tasks for origins "ready to be loaded" according to a bespoke policy - a journal client, feeding off origin_visits / origin_visit_updates, recording the status of all origin loading tasks === Policy for priorizing origin loading tasks === ==== If the lister provides a date of last modification ==== 1. schedule origins that have never successfully loaded - ordered by increasing date of last modification (oldest first) - non forks, then forks? maybe not available at the lister level. 1. schedule origins where the date of last modification is more recent than the latest (successful) load date - ordered by decreasing difference between last load date and date of last modification 1. schedule other origins - ordered by next run target ==== If the lister does not provide a date of last modification ==== 1. schedule origins that have never successfully loaded - ordered by increasing date of creation (oldest first) 1. schedule origins that have successfully loaded once - ordered by date of last visit; clamped to $minimum_interval (oldest first) 1. schedule origins that have been visited to completion more than once - order by next run target === Feedback loop in the origin_visit listener === 1. Update last visit date, status, and eventfulness - keep time of last successful visit - if status is failed, keep same last visit date, increase failure count - if failure count too high (3 ?) disable task until next run of lister - else, reset failure count 1. Update next run target - if failed: now + 1 day - else - get duration since last successful visits - if last visit eventful, divide by $adjust_factor; clamp to $minimum_interval (1 day?) - if last visit uneventful, multiply by $adjust_factor - set to now + adjusted interval === Proposed fields for the new table === | column | type | source | attributes | comments | |--------|------|--------|------------|----------| | origin_url | text | lister | not null | | | loader_task_type | text | lister | not null | | | extra_task_kwargs | jsonb | lister | defaults to `{}` | | | enabled | boolean | lister or journal client | |

The approach we're currently using for recurrent loading tasks in the scheduler has a lot of shortcomings: 1. does not take into account "freshness" information provided by a lister. Two consequences - lots of lag accrued on origins with updates - substantial amount of time wasted on origins with no updates - some amount of time wasted on completely dead origins 1. uses (apparently unreliable) scheduler information as feedback loop - lots of tasks end up lost in space, when we now have a reliable mechanism (the journal) to subscribe to updates about objects in the archive 1. feedback loop is very inflexible - the "visit interval" target has never been met, or even calibrated to our bandwidth - the way we adapt intervals is very stiff (x2 for inactive origins, /2 for active origins); no idea if it's stable or not - save code now requests are completely ignored by the recurrent tasks To handle this functionality better, I propose introducing some new components: - a new table and set of API endpoints in the scheduler backend, to record information about recurrent origin loading tasks, replacing the current contents of the task table in the scheduler - a new runner for these origin loading tasks - TBD: generate one-shot tasks in the other scheduler? send directly to celery? - a journal client, feeding off origin_visits / origin_visit_updates, recording the status of all origin loading tasks === Policy for priorizing origin loading tasks === For the two cases of with and without date of last modification, we can use the ratio between the two kinds of origins (there's likely going to be an overwhelming number of origins with a date of last modification). To begin with, we can make each loop of the runner pick an equal number of origins in each subgroup. If the first subgroup is exhausted, pick more tasks from the next ones. ==== If the lister provides a date of last modification ==== 1. schedule origins that have never successfully loaded - ordered by increasing date of last modification (oldest first) - non forks, then forks? maybe not available at the lister level. 1. schedule origins where the date of last modification is more recent than the latest (successful) load date - ordered by decreasing difference between last load date and date of last modification 1. schedule other origins - ordered by next run target ==== If the lister does not provide a date of last modification ==== 1. schedule origins that have never successfully loaded - ordered by increasing date of creation (oldest first) 1. schedule origins that have successfully loaded once - ordered by date of last visit; clamped to $minimum_interval (oldest first) 1. schedule origins that have been visited to completion more than once - order by next run target === Feedback loop in the origin_visit listener === 1. Update last visit date, status, and eventfulness - keep time of last successful visit - if status is failed, keep same last visit date, increase failure count - if failure count too high (3 ?) disable task until next run of lister - else, reset failure count 1. Update next run target - if failed: now + 1 day - else - get duration since last successful visits - if last visit eventful, divide by $adjust_factor; clamp to $minimum_interval (1 day?) - if last visit uneventful, multiply by $adjust_factor - set to now + adjusted interval === Proposed metrics === - number of origins for every policy group and subgroup (probably grouped by lister). - number of active origins (last modification \in [previous listing, current listing]), by lister - ...

The approach we're currently using for recurrent loading tasks in the scheduler has a lot of shortcomings: 1. does not take into account "freshness" information provided by a lister. Two consequences - lots of lag accrued on origins with updates - substantial amount of time wasted on origins with no updates - some amount of time wasted on completely dead origins 1. uses (apparently unreliable) scheduler information as feedback loop - lots of tasks end up lost in space, when we now have a reliable mechanism (the journal) to subscribe to updates about objects in the archive 1. feedback loop is very inflexible - the "visit interval" target has never been met, or even calibrated to our bandwidth - the way we adapt intervals is very stiff (x2 for inactive origins, /2 for active origins); no idea if it's stable or not - save code now requests are completely ignored by the recurrent tasks To handle this functionality better, I propose introducing some new components: - a new table and set of API endpoints in the scheduler backend, to record information about recurrent origin loading tasks, replacing the current contents of the task table in the scheduler - a new runner, for these origin loading tasks - TBD: generate one-shot tasks in the other scheduler? which would generate one-shot tasks for origins "ready to be loaded" according to a bespoke policysend directly to celery? - a journal client, feeding off origin_visits / origin_visit_updates, recording the status of all origin loading tasks === Policy for priorizing origin loading tasks === For the two cases of with and without date of last modification, we can use the ratio between the two kinds of origins (there's likely going to be an overwhelming number of origins with a date of last modification). To begin with, we can make each loop of the runner pick an equal number of origins in each subgroup. If the first subgroup is exhausted, pick more tasks from the next ones. ==== If the lister provides a date of last modification ==== 1. schedule origins that have never successfully loaded - ordered by increasing date of last modification (oldest first) - non forks, then forks? maybe not available at the lister level. 1. schedule origins where the date of last modification is more recent than the latest (successful) load date - ordered by decreasing difference between last load date and date of last modification 1. schedule other origins - ordered by next run target ==== If the lister does not provide a date of last modification ==== 1. schedule origins that have never successfully loaded - ordered by increasing date of creation (oldest first) 1. schedule origins that have successfully loaded once - ordered by date of last visit; clamped to $minimum_interval (oldest first) 1. schedule origins that have been visited to completion more than once - order by next run target === Feedback loop in the origin_visit listener === 1. Update last visit date, status, and eventfulness - keep time of last successful visit - if status is failed, keep same last visit date, increase failure count - if failure count too high (3 ?) disable task until next run of lister - else, reset failure count 1. Update next run target - if failed: now + 1 day - else - get duration since last successful visits - if last visit eventful, divide by $adjust_factor; clamp to $minimum_interval (1 day?) - if last visit uneventful, multiply by $adjust_factor - set to now + adjusted interval === Proposed fields for the new tablemetrics === | column | type | source | attributes | comments | |--------|------|--------|------------|----------| | origin_url | text | lister | not null | | | loader_task_type | text | lister | not null | |- number of origins for every policy group and subgroup (probably grouped by lister). | extra_task_kwargs | jsonb |- number of active origins (last modification \in [previous lister | defaults to `{}` | |ing, current listing]), by lister | enabled | boolean | lister or journal client | |- ...