The approach we're currently using for recurrent loading tasks in the scheduler has a lot of shortcomings:
1. does not take into account "freshness" information provided by a lister. Two consequences
- lots of lag accrued on origins with updates
- substantial amount of time wasted on origins with no updates
- some amount of time wasted on completely dead origins
1. uses (apparently unreliable) scheduler information as feedback loop
- lots of tasks end up lost in space, when we now have a reliable mechanism (the journal) to subscribe to updates about objects in the archive
1. feedback loop is very inflexible
- the "visit interval" target has never been met, or even calibrated to our bandwidth
- the way we adapt intervals is very stiff (x2 for inactive origins, /2 for active origins); no idea if it's stable or not
- save code now requests are completely ignored by the recurrent tasks
To handle this functionality better, I propose introducing some new components:
- a new table and set of API endpoints in the scheduler backend, to record information about recurrent origin loading tasks, replacing the current contents of the task table in the scheduler
- a new runner, which would generate one-shot tasks for origins "ready to be loaded" according to a bespoke policy
- a journal client, feeding off origin_visits / origin_visit_updates, recording the status of all origin loading tasks
=== Policy for priorizing origin loading tasks ===
==== If the lister provides a date of last modification ====
1. schedule origins that have never successfully loaded
- ordered by increasing date of last modification (oldest first)
- non forks, then forks? maybe not available at the lister level.
1. schedule origins where the date of last modification is more recent than the latest (successful) load date
- ordered by decreasing difference between last load date and date of last modification
1. schedule other origins
- ordered by next run target
==== If the lister does not provide a date of last modification ====
1. schedule origins that have never successfully loaded
- ordered by increasing date of creation (oldest first)
1. schedule origins that have successfully loaded once
- ordered by date of last visit; clamped to $minimum_interval (oldest first)
1. schedule origins that have been visited to completion more than once
- order by next run target
=== Feedback loop in the origin_visit listener ===
1. Update last visit date, status, and eventfulness
- keep time of last successful visit
- if status is failed, keep same last visit date, increase failure count
- if failure count too high (3 ?) disable task until next run of lister
- else, reset failure count
1. Update next run target
- if failed: now + 1 day
- else
- get duration since last successful visits
- if last visit eventful, divide by $adjust_factor; clamp to $minimum_interval (1 day?)
- if last visit uneventful, multiply by $adjust_factor
- set to now + adjusted interval
=== Proposed fields for the new table ===
| column | type | source | attributes | comments |
|--------|------|--------|------------|----------|
| origin_url | text | lister | not null | |
| loader_task_type | text | lister | not null | |
| extra_task_kwargs | jsonb | lister | defaults to `{}` | |
| enabled | boolean | lister or journal client | |