Improve handling of recurrent loading tasks in scheduler
Closed, MigratedEdits Locked
Actions

Description

The approach we're currently using for recurrent loading tasks in the scheduler has a lot of shortcomings:

does not take into account "freshness" information provided by a lister. Two consequences
- lots of lag accrued on origins with updates
- substantial amount of time wasted on origins with no updates
- some amount of time wasted on completely dead origins
uses (apparently unreliable) scheduler information as feedback loop
- lots of tasks end up lost in space, when we now have a reliable mechanism (the journal) to subscribe to updates about objects in the archive
feedback loop is very inflexible
- the "visit interval" target has never been met, or even calibrated to our bandwidth
- the way we adapt intervals is very stiff (x2 for inactive origins, /2 for active origins); no idea if it's stable or not
- save code now requests are completely ignored by the recurrent tasks

To handle this functionality better, I propose introducing a separate task scheduler for recurrent origin visits, with a few components:

a new table and set of API endpoints in the scheduler backend, to record information about recurrent origin loading tasks, replacing the current contents of the task table in the scheduler (for tasks that come from listers)
a new runner for these origin visit tasks
- TBD: generate one-shot tasks in the swh scheduler? send tasks directly to celery?
a journal client, feeding off origin_visits / origin_visit_updates, recording the status of all origin loading tasks

Common goals for a new origin visit scheduler

Some common goals that seem desirable:

loading origins at least once, as soon as possible after they appear in a lister
minimizing the number of "useless visits" with no updates to integrate
"smooth" the size of visits, by visiting active origins more often and doing less work each time (which reduces memory pressure on the workers)
make use of forge-provided "last modification times" to reduce the amount of useless work done

Baseline for the recurrence of origin visits

As a common baseline, we can handle the list of origins to be visited as a queue.

The way origins are ordered in this queue can be materialized by using two variables

a next visit target, the "time" at which we expect the origin to have new objects, and we should visit it again.
a visit interval, which is the duration that we expect to wait between visits of this origin

While conceptually a timestamp, the next visit target value is only really used as a queue index; the visit interval is an offset by which we move the origin within this queue when a visit completes.

As we should keep our infrastructure busy, and attempt origin visits as often as possible (with a minimal cooldown period between visits of a given origin, to avoid DoSing hosters), the next visit target currently being serviced can drift away from the current clock according to the behavior of our infrastructure.

The visit interval is really an index in a list of possible visit intervals. This allows us to set a smooth increase at first, and a stiffer increase later:

index 0 and 1, interval 1 day
index 2, 3 and 4, interval 2 days
index 5 and up, interval 4^(n-4) days (4, 16, 64, 256, 1024)

The breakpoints of this "exponential" can be adjusted to match the reality of our loading infrastructure, by monitoring the skew between the next visit targets and the actual scheduling time of visits.

The next visit target of an origin is updated after each visit completes:

if the visit has failed, increase the next visit target by the minimal visit interval (to take into account transient loading issues)
- if the visit has failed several times in a row, disable the origin until the next run of the lister
if the visit is successful, and records some changes, decrease the visit interval index by 2 (visit the origin *way* more often).
if the visit is successful, and records no changes, increase the visit interval index by 1 (visit the origin less often).

We set the next visit target to its current value + the new visit interval multiplied by a random fudge factor (picked in the -/+ 10% range).

The fudge factor allows the visits to spread out, avoiding "bursts" of loaded origins e.g. when a number of origins from a single hoster are processed at once.

Bootstrapping of the next visit target and visit interval values for new origins

There is an obvious bootstrap issue: new origins do not have a next visit target. (We can, however, bootstrap the visit interval index to a default value, e.g. 4)

The first of our stated goals is visiting origins at least once as soon as possible.

There's multiple ways to achieve this:

we can schedule new origins separately
- pros
  1. no fudging of the next visit target needed
  2. can precisely control when new origins are loaded
- cons
  1. makes the scheduling logic more complex
  2. needs careful monitoring to make sure the number doesn't grow unbounded
we can generate a next visit target for new origins
- pros
  1. simpler scheduling logic: origins are picked from a single queue
  2. can handle spreading the visits new origins over a longer time in a "oneshot" fashion
- cons
  1. needs careful consideration to avoid DoSing new hosters by bursting requests to new origins
Once the first visit happens, we can set the next visit target to the "current next visit target being scheduled" + the default visit interval * the random fudge factor.

Optimizations for listers providing a date of last modification

When the lister provides a date of last modification for origins, we can do some more subtle management of the scheduling of origin visits. Instead of scheduling according to the next visit target, we can schedule visits to origins in the following three pools:

origins that have never successfully loaded
- ordered by increasing date of last modification (oldest first)

NOTE: if the lister returns a creation date for the origin, we could instead use a decreasing interval between creation time to last update time as sorting heuristic: this would favor "more active" origins, but could leave origins only updated once behind.

origins where the date of last modification is more recent than the latest (successful) load date
- ordered by decreasing difference between last load date and date of last modification

NOTE: this favors more recently active origins to the detriment of origins which had some activity right after being loaded, then went silent. There's a good chance that this heuristic will not converge (and, if the infrastructure struggles, a bunch of modifications to origins that happened right after a visit will never be recorded). To reduce the impact of this, we could mix both ends of the heuristic: do some amount of visits to active origins, and do some amount of "easy" visits to origins that haven't been updated very much.

other origins
- ordered by next visit target

NOTE: this is only a last-resort strategy:

listers might not be running all the time, but we still want a chance to update repositories
the modification time information provided by listers might not be reliable
if everything works perfectly, these last-resort updates should mostly be short no-ops

Actual visit scheduling policy

The next visit target for origins without date of last modification (group A) give them a total ordering.

For origins with an upstream-provided date of last modification (group B), the three scheduling pools don't give us a total ordering of visits.

To handle both of these groups when the execution queue has n slots free, we can :

get the ratio of "last modification time-provided" origins r = #B / (#A + #B)
schedule (1-r) n origins from group A according to their next visit target
schedule n r/3 origins from each of the 3 pools of origin group B.

We need to monitor the number of origins in each of the pools of group B, to make sure that our sorting heuristics are not making our work diverge.

Potential future improvements

Prioritize non-fork repositories over forked repositories?
- not in the first draft; still, record the fork information if possible so we can adjust the scheduling policies afterwards

Feedback loop in the origin_visit journal client

Update last visit date, status, and eventfulness
- keep time of last successful visit
- if status is failed, keep same last visit date, increase failure count
  - if failure count too high (3 ?) disable origin until next run of lister
- else, reset failure count
Update visit interval index and next visit target

Proposed metrics for the new scheduler

number of origins for every scheduling policy group and subgroup (probably grouped by lister).
number of active origins (last modification \in [previous listing, current listing]), by lister.
drift between real time and current next visit target scheduled
...

Discussion on implementation detail [1]

[1] https://hedgedoc.softwareheritage.org/jF0v5LZITGqhpZ-hgiy8zw

Revisions and Commits

rDSCH Scheduling utilities
	D5980	rDSCH8281e351d6a1 journal_client: Disable origins when too many visited attempts failed
	D5978	rDSCH1bcf84d5e66d Add a successive_visits counter to origin visit stats
	D5956	rDSCHd58776ab0b41 Introduce new scheduling policy to grab origins without last update
	D5956	rDSCH825e8cfe7d24 grab_next_visits: make the handling of CTEs more modular
	D5950	rDSCH8c4ae9f14d6a journal_client: Compute next position for origin visit
	D5919	rDSCHcb1edf1ab24d Introduce storage for the recurrent visit scheduler queue position
	D5919	rDSCHc486b28ece7c journal_client: Explicit docstring
	D5919	rDSCHec6e69f6415a Start handling of recurrent loading tasks in scheduler
	D5914	rDSCH1006f0aee494 journal_client: Auto-generate the empty object from model fields
	D5914	rDSCH6400cc2b95cb backend: Auto-generate origin visit stats upsert query

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T2345 Improve handling of recurrent loading tasks in scheduler
Migrated	gitlab-migration	T2444 Implement the scheduling policy for the recurrent visit scheduler
Migrated	gitlab-migration	T2442 Provide a unified API for listers to interact with the scheduler
Migrated	gitlab-migration	T2955 Port Bitbucket lister to the new Lister API
Migrated	gitlab-migration	T2956 Port PyPI lister to the new Lister API
Migrated	gitlab-migration	T2972 Port npm lister to the new Lister API
Migrated	gitlab-migration	T2979 Port debian lister to the new Lister API
Migrated	gitlab-migration	T2989 Port CRAN lister to the new Lister API
Migrated	gitlab-migration	T2990 Port GNU lister to the new Lister API
Migrated	gitlab-migration	T2991 Port packagist lister to the new Lister API
Migrated	gitlab-migration	T2992 Port launchpad lister to the new Lister API
Migrated	gitlab-migration	T2987 Port gitlab lister to the new `swh.lister.pattern.Lister` API
Migrated	gitlab-migration	T2984 Port cgit lister to the new Lister API
Migrated	gitlab-migration	T3073 Properly document the new unified API lister
Migrated	gitlab-migration	T2443 Implement a bulk-queryable cache of latest visits for use by the recurrent visit scheduler
Migrated	gitlab-migration	T2963 Add visit_type field to OriginVisitStatus model object
Migrated	gitlab-migration	T2964 Adapt origin_visit_status_(get\|add) api to deal with the visit_type
Migrated	gitlab-migration	T2965 Adapt storage to actually write the visit_type in the origin_visit_status topic
Migrated	gitlab-migration	T2966 Backfill origin_visit_status with the `visit_type` field properly given
Migrated	gitlab-migration	T2967 Write journal client subcribed to origin_visit_status topics
Migrated	gitlab-migration	T2978 Deploy visit-stats journal client on staging
Migrated	gitlab-migration	T2993 Deploy visit-stats journal client on production
Migrated	gitlab-migration	T2973 Implement a scheduler simulator
Migrated	gitlab-migration	T2974 Define (and implement) scheduler performance metrics
Migrated	gitlab-migration	T3399 Improve PyPI lister to pull last update information when running incrementally
Migrated	gitlab-migration	T3456 staging: Deploy scheduler v0.17
Migrated	gitlab-migration	T3471 production: Deploy swh.scheduler v0.17
Migrated	gitlab-migration	T3538 Send scheduler metrics to prometheus
Migrated	gitlab-migration	T3667 Orchestrate origins scheduling according to scheduler metrics feedback
Migrated	gitlab-migration	T3674 Determine how to reference manually listed/ingested origins in the scheduler metrics

Event Timeline

olasd triaged this task as High priority.Apr 1 2020, 8:03 PM

olasd created this task.

olasd created this object with visibility "olasd (Nicolas Dandrimont)".

olasd updated the task description. (Show Details)Apr 1 2020, 11:31 PM

olasd updated the task description. (Show Details)Apr 2 2020, 12:11 AM

olasd changed the visibility from "olasd (Nicolas Dandrimont)" to "Public (No Login Required)".

olasd added projects: Scheduling utilities, Archive coverage.Apr 2 2020, 9:39 AM

olasd updated the task description. (Show Details)Apr 2 2020, 10:02 AM

olasd mentioned this in T2346: Decide on the semantics of origin-visit status(es).Apr 2 2020, 11:00 AM

olasd updated the task description. (Show Details)Apr 2 2020, 7:23 PM

olasd updated the task description. (Show Details)

olasd updated the task description. (Show Details)Apr 3 2020, 9:34 AM

olasd updated the task description. (Show Details)Apr 6 2020, 7:06 PM

This task describes in detail what kind of scheduling policy we should implement, but it doesn't help much figure out what the next steps should be.

After some more discussion, we think that this can be broken down and implemented through three separate components:

a unified listing api and table within the scheduler, which would collate the information from all listers that create recurrent visit tasks:
- available origins and expected visit type (plus potential extra arguments for the loading tasks)
- date of last listing for a given origin
- date of last update if available
- whether the origin is declared as a fork
a cache for quick, bulk access to the information about the latest visits for a given origin / visit type; a journal client which keeps this cache up to date
- when was the latest visit for a given origin/type pair? What happened?
- when was the latest _eventful_ visit for a given origin/type pair?
- when do we expect the next eventful visit to be ?
a scheduling component which merges the information from listers and the visit cache, to handle the actual scheduling of tasks.

There's a good chance that some features of the origin visit cache will need to be tailor made for the policy chosen for the scheduling component.

olasd mentioned this in T723: General improvements to the scheduler.Sep 22 2020, 6:20 PM

ardumont added a project: Sprint 2021 01.Jan 11 2021, 12:36 PM

ardumont changed the status of subtask T2443: Implement a bulk-queryable cache of latest visits for use by the recurrent visit scheduler from Open to Work in Progress.Jan 11 2021, 2:03 PM

vlorentz changed the status of subtask T2444: Implement the scheduling policy for the recurrent visit scheduler from Open to Work in Progress.Jan 18 2021, 2:08 PM

olasd changed the status of subtask T2973: Implement a scheduler simulator from Open to Work in Progress.Jan 18 2021, 2:12 PM

olasd created subtask T2974: Define (and implement) scheduler performance metrics.Jan 18 2021, 2:17 PM

ardumont closed subtask T2443: Implement a bulk-queryable cache of latest visits for use by the recurrent visit scheduler as Resolved.Jan 25 2021, 8:42 AM

anlambert closed subtask T2442: Provide a unified API for listers to interact with the scheduler as Resolved.Feb 2 2021, 4:08 PM

Here's my understanding of the status of the migration to the next generation scheduler as of today:

Listers:

all the listers have been ported to the new API, and released. The legacy SQLAlchemy based core loader has been removed. We missed a prime opportunity for a 1.0 release ;)
all the listers based on the new API have been deployed in staging and in production.
- Listers now only create entries in the listers table of the scheduler.
- There are some (small) production issues to solve (T3032)

Scheduler journal client:

the scheduler journal client is deployed in staging and production.
- it feeds the origin_visit_stats table from the origin_visit_status data in swh.journal.
the production instance of the journal client has a data consistency issue that is under investigation (T3000).
- From the investigation, the issue seems to be purely within the journal client settings (i.e. the journal itself has all the expected data). Hopefully we can tune this and get it solved in the next few days.

Origin visit scheduling:

only fairly basic scheduling policies have been implemented : a generic FIFO policy, as well as two policies based on the visit status cache and the date of last update of the origin provided by the lister.
- performance tuning of the existing policies and implementation of other scheduling policies, is blocked on:
  - production-sized lister data (running all listers to completion in prod, at least once)
  - accurate origin_visit_status cache data, from the archive
- we need to implement the "fallback" scheduling when listers aren't able to provide us with a last update value
scheduling of origin visits from the new data / APIs has not yet been deployed, either in staging or production.
the production infra is churning on the (very large) "backlog" of recurrent tasks that are available in the "legacy" scheduler, generated by the old listers. These lists of tasks aren't updated anymore, but we have plenty of work to do still.
once we have tested scheduling policies against production data, we can disable the old recurrent tasks and subsume them with tasks from the new listers.
once scheduling of origin visit tasks is we should consider removing recurrent tasks from the old-style scheduler

Scheduler simulator:

it's a neat tool/toy, but it hasn't been confronted to production-scale data yet
it will be very useful as is, as an integration testing framework to tune the performance of the new components once we have some production-scale data dumps available
we will need to improve the set of metrics (T2974) and reporting to:
- validate the simulator behavior against prod
- make the simulator a more useful "management" tool allowing us to tune the parameters of the scheduling policies.

ardumont mentioned this in T376: ingest git.eclipse.org repositories.Feb 10 2021, 9:20 AM

amottier added a subscriber: amottier.Mar 24 2021, 11:51 AM

jayeshv added a subscriber: jayeshv.May 17 2021, 6:15 PM

ardumont mentioned this in T1524: save code now: also add new origins for unknown repos.Jun 16 2021, 3:26 PM

Summary of the data available in the listed_origins table, broken down by lister and "known state" of origins:

14:12 guest@softwareheritage-scheduler => select id, name, instance_name, visit_type, count(*) as total, count(*) filter (where last_scheduled is NULL) as not_scheduled, count(*) filter (where last_snapshot is null) as no_snapshot, count(*) filter(where last_update is null) as no_last_update from listed_origins left join origin_visit_stats using (url, visit_type) inner join listers on listed_origins.lister_id = listers.id group by id, visit_type;
                  id                  │     name      │        instance_name         │ visit_type │   total   │ not_scheduled │ no_snapshot │ no_last_update 
──────────────────────────────────────┼───────────────┼──────────────────────────────┼────────────┼───────────┼───────────────┼─────────────┼────────────────
 0bac0a61-1ee1-45ad-b37e-13a38a0fb8f4 │ CRAN          │ cran                         │ tar        │     18152 │         18152 │       18152 │             11
 0d4fa765-e989-40e1-b91f-8fe217729946 │ phabricator   │ blender                      │ git        │        41 │            41 │          41 │             41
 0d4fa765-e989-40e1-b91f-8fe217729946 │ phabricator   │ blender                      │ svn        │         6 │             6 │           1 │              6
 194b1af4-ba03-438a-961d-7b0be7cdc7f3 │ cgit          │ gnu-savannah                 │ git        │      1029 │          1028 │          39 │              0
 1ac30f61-69a8-44b8-ae44-7e21514f0c2e │ cgit          │ fedora                       │ git        │       866 │           851 │          54 │            110
 25d2aec0-91ce-446c-be64-d1ff9a79dc4e │ gitea         │ git.fsfe.org                 │ git        │       401 │           400 │          51 │              0
 29c69bc1-e815-4f5a-b009-c6854697fec7 │ pypi          │ pypi                         │ pypi       │    313887 │          4523 │       11844 │         313887
 31df8830-df11-4120-8e76-323981a9c88d │ phabricator   │ swh                          │ git        │       189 │           189 │          29 │            189
 3310a3f7-5f6e-4367-b93f-659033b1e735 │ cgit          │ baserock                     │ git        │      1524 │          1500 │          70 │              1
 41f76395-d290-4610-9584-3b4a8be548c5 │ cgit          │ yoctoproject                 │ git        │       175 │            72 │          10 │              0
 43b9a56a-b6ff-4e43-b9ca-53e231715713 │ cgit          │ zx2c4                        │ git        │       159 │           159 │          14 │              0
 4d7a9674-e2c4-40ce-94b3-677e3a4e8995 │ debian        │ Debian                       │ deb        │     35100 │         35100 │          85 │          35100
 59354ffc-0a34-4140-8503-5f398a763097 │ cgit          │ git-kernel                   │ git        │      1091 │           824 │         615 │              0
 5bb5ddb7-2a78-4051-9d84-ec07a3834031 │ debian        │ Debian-Security              │ deb        │       779 │           779 │         259 │            779
 6632ef5e-322b-402b-8f28-d090f76ed6b7 │ github        │ github                       │ git        │ 180292687 │     174024061 │    62966568 │         118140
 7338c20c-ffda-4a75-88fe-099f619a0fe2 │ GNU           │ GNU                          │ tar        │       386 │           386 │          32 │              0
 7378f526-bb27-46c9-9940-25c240509dc6 │ cgit          │ git.gnu.org.ua               │ git        │       145 │           124 │         145 │              7
 75627aa9-58c6-4ca6-ae11-eee5026bcefc │ gitlab        │ framagit                     │ git        │     20283 │         19735 │        5169 │              0
 7a775770-2b2f-4139-aacb-ad715c022b9d │ cgit          │ eclipse                      │ git        │      1375 │          1312 │        1314 │              3
 7b7ef365-b065-4e46-be98-19d3ca6a1633 │ gitlab        │ common-lisp                  │ git        │       825 │           801 │          65 │              0
 7fb5da29-6b90-4ce6-af17-5b1e2a56f794 │ npm           │ npm                          │ npm        │   1629224 │       1448537 │        3335 │            209
 860d41f8-d0c0-4733-a4d8-437c386bc31f │ save-code-now │ archive.softwareheritage.org │ git        │       694 │           633 │           0 │            694
 860d41f8-d0c0-4733-a4d8-437c386bc31f │ save-code-now │ archive.softwareheritage.org │ hg         │         8 │             8 │           0 │              8
 860d41f8-d0c0-4733-a4d8-437c386bc31f │ save-code-now │ archive.softwareheritage.org │ svn        │         1 │             1 │           0 │              1
 9de8141b-e441-4ffd-b40d-d438b29c03fc │ launchpad     │ launchpad                    │ git        │     23724 │         22642 │        4667 │              0
 a60261fe-1125-4a46-bd0b-c914709ca10e │ bitbucket     │ bitbucket                    │ git        │   2779174 │       2742821 │     1096182 │              0
 a96dea47-13c0-4f11-bf96-70b576b604a0 │ gitlab        │ gite.lirmm                   │ git        │       638 │           606 │         255 │              0
 b35c74ea-b1b4-4dfc-858e-80809f6b5790 │ cgit          │ qt.io                        │ git        │       278 │           266 │          13 │             50
 b50360dc-2f66-4e40-a789-b22d9375e875 │ gitea         │ codeberg.org                 │ git        │      8233 │          8233 │        4930 │              0
 b678cfc3-2780-4186-9186-d78a14bd4958 │ sourceforge   │ main                         │ bzr        │       290 │           290 │         290 │              0
 b678cfc3-2780-4186-9186-d78a14bd4958 │ sourceforge   │ main                         │ cvs        │     28622 │         28622 │       28622 │              0
 b678cfc3-2780-4186-9186-d78a14bd4958 │ sourceforge   │ main                         │ git        │    180740 │         58760 │       68531 │              0
 b678cfc3-2780-4186-9186-d78a14bd4958 │ sourceforge   │ main                         │ hg         │     27550 │         27550 │       27541 │              0
 b678cfc3-2780-4186-9186-d78a14bd4958 │ sourceforge   │ main                         │ svn        │    101722 │         38228 │       40853 │              0
 b9b8e226-3452-4812-a6df-cab546b9ee11 │ gitlab        │ inria                        │ git        │      3286 │          3106 │        1387 │              0
 baf89663-feae-4850-a8ec-3a21e699cc0b │ gitlab        │ gitlab                       │ git        │    200200 │        177160 │       20696 │              0
 ca2fcc66-8844-480c-b0ba-a91f890b0554 │ gitlab        │ gnome                        │ git        │     13176 │         12814 │        4772 │              0
 ceecf814-da90-43a4-8de0-dd853072145f │ gitlab        │ lip6                         │ git        │        69 │            58 │          65 │              0
 cef20b61-30d4-4526-baa1-5db8a78f2b57 │ cgit          │ openembedded                 │ git        │        16 │             9 │          16 │              1
 d229771c-1610-4e2c-a67b-a8d5b6f1b43c │ cgit          │ tor                          │ git        │       519 │           514 │         519 │             26
 df07990a-5e00-4ea7-af6e-2c03bcee028a │ cgit          │ alpinelinux                  │ git        │         6 │             5 │           0 │              0
 f4b55119-837d-461b-bd3a-0e07d324aabf │ gitlab        │ riseup                       │ git        │      1255 │          1221 │         428 │              0
 f4ea15d9-97b5-4bdc-a5d6-2062fe0acfef │ cgit          │ git.joeyh.name               │ git        │        62 │            51 │          62 │              0
 f788a57f-6c9b-4c5e-8200-3359d5bf4405 │ gitlab        │ ow2                          │ git        │      1297 │          1158 │         284 │              0
 ff34a2b5-2e81-4566-9627-61fab06f8f52 │ phabricator   │ kde                          │ git        │      1036 │          1036 │        1025 │           1036
 fffaba23-b6ad-4c02-a6e7-dcff8170b6f0 │ gitlab        │ freedesktop                  │ git        │      8008 │          7828 │        3448 │              0
(46 lignes)

Durée : 611092,651 ms (10:11,093)

ardumont updated the task description. (Show Details)Jun 21 2021, 5:50 PM

ardumont added a revision: D5914: backend: Auto-generate origin visit stats upsert query.Jun 23 2021, 3:32 PM

ardumont added a revision: D5919: Start handling of recurrent loading tasks in scheduler.Jun 23 2021, 6:11 PM

ardumont added a commit: rDSCH6400cc2b95cb: backend: Auto-generate origin visit stats upsert query.Jun 25 2021, 1:51 PM

ardumont added a commit: rDSCH1006f0aee494: journal_client: Auto-generate the empty object from model fields.

ardumont added a revision: D5950: journal_client: Compute next position for origin visit.Jul 1 2021, 10:14 AM

ardumont added a revision: D5956: Introduce new scheduling policy to grab origins without last update.Jul 1 2021, 12:34 PM

ardumont mentioned this in D4895: Add a successive_visits counter to OriginVisitStats.Jul 7 2021, 4:54 PM

ardumont added a revision: D5978: Add a successive_visits counter to origin visit stats.Jul 7 2021, 5:26 PM

ardumont added a revision: D5980: journal_client: Disable origins when too many visited attempts failed.Jul 8 2021, 11:26 AM

Status on the latest development for this task, "Baseline for the recurrence of origin
visits" chapter has been implemented in the following stacked diffs (in review):

D5919: Start handling of recurrent loading tasks in scheduler
D5950: journal_client: Compute next position for origin visit
D5956: Introduce new scheduling policy to grab origins without last update
D5978: Add successive visits counter to origin visit stats (out of D4895)
D5980: journal_client: Deactivate origins when too many visited attempts failed

Relatedly to this task, some work has been started to make the pypi lister list its
origins with the last_update information in the diff D5977 / T3399 (review got done and
the implementation needs to be improved but still ;).

ardumont closed subtask T3399: Improve PyPI lister to pull last update information when running incrementally as Resolved.Jul 9 2021, 2:52 PM

Relatedly to this task, some work has been started to make the pypi lister list its
origins with the last_update information in the diff D5977 / T3399 (review got done
and the implementation needs to be improved but still ;).

Done and deployed.

And now, heads up, with the new pypi lister, the most proeminent 'pypi' entry (at the
time, no_last_update to 313887) [1] decreased to 8 entries [2]:

So now what remains is the github entry with 118150 [1] but if I remember correctly,
@olasd mentions that it was "old/invalid" origins from github. They may most likely
subside when the new scheduling policy [3] lands (if those origins are actually no
longer existing ones, they will get disabled eventually).

[1] T2345#66559

[2]

14:25:30 softwareheritage-scheduler@belvedere:5432=> select now(), id, name, instance_name, visit_type, count(*) as total, count(*) filter (where last_scheduled is NULL) as not_scheduled, count(*) filter (where last_snapshot is null) as no
_snapshot, count(*) filter(where last_update is null) as no_last_update from listed_origins left join origin_visit_stats using (url, visit_type) inner join listers on listed_origins.lister_id = listers.id group by id, visit_type having nam
e='pypi';

+-------------------------------+--------------------------------------+------+---------------+------------+--------+---------------+-------------+----------------+
|              now              |                  id                  | name | instance_name | visit_type | total  | not_scheduled | no_snapshot | no_last_update |
+-------------------------------+--------------------------------------+------+---------------+------------+--------+---------------+-------------+----------------+
| 2021-07-09 12:52:09.438329+00 | 29c69bc1-e815-4f5a-b009-c6854697fec7 | pypi | pypi          | pypi       | 390659 |         75449 |       73223 |              8 |
+-------------------------------+--------------------------------------+------+---------------+------------+--------+---------------+-------------+----------------+
(1 row)

[3] T2345#67279

Updated stats in descending order on the no_last_update column:

15:01:21 softwareheritage-scheduler@belvedere:5432=> select now(), id, name, instance_name, visit_type, count(*) as total, count(*) filter (where last_scheduled is NULL) as not_scheduled, count(*) filter (where last_snapshot is null) as no_snapshot, count(*) filter(where last_update is null) as no_last_update from listed_origins left join origin_visit_stats using (url, visit_type) inner join listers on listed_origins.lister_id = listers.id group by id, visit_type order by no_last_update desc;
+-------------------------------+--------------------------------------+---------------+------------------------------+------------+-----------+---------------+-------------+----------------+
|              now              |                  id                  |     name      |        instance_name         | visit_type |   total   | not_scheduled | no_snapshot | no_last_update |
+-------------------------------+--------------------------------------+---------------+------------------------------+------------+-----------+---------------+-------------+----------------+
| 2021-07-09 13:01:37.678047+00 | 6632ef5e-322b-402b-8f28-d090f76ed6b7 | github        | github                       | git        | 180392112 |     170538428 |    61596806 |         118150 |
| 2021-07-09 13:01:37.678047+00 | 4d7a9674-e2c4-40ce-94b3-677e3a4e8995 | debian        | Debian                       | deb        |     35100 |         35100 |          85 |          35100 |
| 2021-07-09 13:01:37.678047+00 | 860d41f8-d0c0-4733-a4d8-437c386bc31f | save-code-now | archive.softwareheritage.org | git        |      2020 |          1855 |           0 |           2020 |
| 2021-07-09 13:01:37.678047+00 | ff34a2b5-2e81-4566-9627-61fab06f8f52 | phabricator   | kde                          | git        |      1036 |          1036 |        1025 |           1036 |
| 2021-07-09 13:01:37.678047+00 | 5bb5ddb7-2a78-4051-9d84-ec07a3834031 | debian        | Debian-Security              | deb        |       787 |           787 |         267 |            787 |
| 2021-07-09 13:01:37.678047+00 | 7fb5da29-6b90-4ce6-af17-5b1e2a56f794 | npm           | npm                          | npm        |   1629224 |       1448537 |        3302 |            209 |
| 2021-07-09 13:01:37.678047+00 | 31df8830-df11-4120-8e76-323981a9c88d | phabricator   | swh                          | git        |       189 |           189 |          29 |            189 |
| 2021-07-09 13:01:37.678047+00 | 1ac30f61-69a8-44b8-ae44-7e21514f0c2e | cgit          | fedora                       | git        |       866 |           850 |          53 |            110 |
| 2021-07-09 13:01:37.678047+00 | b35c74ea-b1b4-4dfc-858e-80809f6b5790 | cgit          | qt.io                        | git        |       278 |           234 |          13 |             50 |
| 2021-07-09 13:01:37.678047+00 | 0d4fa765-e989-40e1-b91f-8fe217729946 | phabricator   | blender                      | git        |        41 |            41 |          41 |             41 |
| 2021-07-09 13:01:37.678047+00 | d229771c-1610-4e2c-a67b-a8d5b6f1b43c | cgit          | tor                          | git        |       519 |           514 |         519 |             26 |
| 2021-07-09 13:01:37.678047+00 | 860d41f8-d0c0-4733-a4d8-437c386bc31f | save-code-now | archive.softwareheritage.org | hg         |        13 |            13 |           0 |             13 |
| 2021-07-09 13:01:37.678047+00 | 0bac0a61-1ee1-45ad-b37e-13a38a0fb8f4 | CRAN          | cran                         | tar        |     18276 |         18276 |       18276 |             11 |
| 2021-07-09 13:01:37.678047+00 | 29c69bc1-e815-4f5a-b009-c6854697fec7 | pypi          | pypi                         | pypi       |    390659 |         75449 |       73223 |              8 |
| 2021-07-09 13:01:37.678047+00 | 7378f526-bb27-46c9-9940-25c240509dc6 | cgit          | git.gnu.org.ua               | git        |       145 |           124 |         145 |              7 |
| 2021-07-09 13:01:37.678047+00 | 0d4fa765-e989-40e1-b91f-8fe217729946 | phabricator   | blender                      | svn        |         6 |             6 |           1 |              6 |
| 2021-07-09 13:01:37.678047+00 | 860d41f8-d0c0-4733-a4d8-437c386bc31f | save-code-now | archive.softwareheritage.org | svn        |         6 |             6 |           0 |              6 |
| 2021-07-09 13:01:37.678047+00 | 7a775770-2b2f-4139-aacb-ad715c022b9d | cgit          | eclipse                      | git        |      1375 |          1305 |        1307 |              3 |
| 2021-07-09 13:01:37.678047+00 | cef20b61-30d4-4526-baa1-5db8a78f2b57 | cgit          | openembedded                 | git        |        16 |             9 |          16 |              1 |
| 2021-07-09 13:01:37.678047+00 | 3310a3f7-5f6e-4367-b93f-659033b1e735 | cgit          | baserock                     | git        |      1524 |          1500 |          70 |              1 |
| 2021-07-09 13:01:37.678047+00 | ca2fcc66-8844-480c-b0ba-a91f890b0554 | gitlab        | gnome                        | git        |     13176 |         12395 |        4771 |              0 |
| 2021-07-09 13:01:37.678047+00 | ceecf814-da90-43a4-8de0-dd853072145f | gitlab        | lip6                         | git        |        69 |            58 |          65 |              0 |
| 2021-07-09 13:01:37.678047+00 | df07990a-5e00-4ea7-af6e-2c03bcee028a | cgit          | alpinelinux                  | git        |         6 |             4 |           0 |              0 |
| 2021-07-09 13:01:37.678047+00 | f4b55119-837d-461b-bd3a-0e07d324aabf | gitlab        | riseup                       | git        |      1255 |          1207 |         428 |              0 |
| 2021-07-09 13:01:37.678047+00 | f4ea15d9-97b5-4bdc-a5d6-2062fe0acfef | cgit          | git.joeyh.name               | git        |        62 |            51 |          62 |              0 |
| 2021-07-09 13:01:37.678047+00 | f788a57f-6c9b-4c5e-8200-3359d5bf4405 | gitlab        | ow2                          | git        |      1297 |          1076 |         284 |              0 |
| 2021-07-09 13:01:37.678047+00 | fffaba23-b6ad-4c02-a6e7-dcff8170b6f0 | gitlab        | freedesktop                  | git        |      8008 |          7415 |        3448 |              0 |
| 2021-07-09 13:01:37.678047+00 | 194b1af4-ba03-438a-961d-7b0be7cdc7f3 | cgit          | gnu-savannah                 | git        |      1029 |          1027 |          39 |              0 |
| 2021-07-09 13:01:37.678047+00 | 25d2aec0-91ce-446c-be64-d1ff9a79dc4e | gitea         | git.fsfe.org                 | git        |       401 |           346 |          51 |              0 |
| 2021-07-09 13:01:37.678047+00 | 41f76395-d290-4610-9584-3b4a8be548c5 | cgit          | yoctoproject                 | git        |       175 |            26 |          10 |              0 |
| 2021-07-09 13:01:37.678047+00 | 43b9a56a-b6ff-4e43-b9ca-53e231715713 | cgit          | zx2c4                        | git        |       159 |           159 |          14 |              0 |
| 2021-07-09 13:01:37.678047+00 | 59354ffc-0a34-4140-8503-5f398a763097 | cgit          | git-kernel                   | git        |      1091 |           817 |         377 |              0 |
| 2021-07-09 13:01:37.678047+00 | 7338c20c-ffda-4a75-88fe-099f619a0fe2 | GNU           | GNU                          | tar        |       386 |           386 |          32 |              0 |
| 2021-07-09 13:01:37.678047+00 | 75627aa9-58c6-4ca6-ae11-eee5026bcefc | gitlab        | framagit                     | git        |     20399 |         19599 |        5238 |              0 |
| 2021-07-09 13:01:37.678047+00 | 7b7ef365-b065-4e46-be98-19d3ca6a1633 | gitlab        | common-lisp                  | git        |       825 |           780 |          65 |              0 |
| 2021-07-09 13:01:37.678047+00 | 9de8141b-e441-4ffd-b40d-d438b29c03fc | launchpad     | launchpad                    | git        |     23998 |         17054 |        4938 |              0 |
| 2021-07-09 13:01:37.678047+00 | a60261fe-1125-4a46-bd0b-c914709ca10e | bitbucket     | bitbucket                    | git        |   2803985 |       2762110 |     1118887 |              0 |
| 2021-07-09 13:01:37.678047+00 | a96dea47-13c0-4f11-bf96-70b576b604a0 | gitlab        | gite.lirmm                   | git        |       638 |           588 |         255 |              0 |
| 2021-07-09 13:01:37.678047+00 | b50360dc-2f66-4e40-a789-b22d9375e875 | gitea         | codeberg.org                 | git        |      8233 |          7632 |        4928 |              0 |
| 2021-07-09 13:01:37.678047+00 | b678cfc3-2780-4186-9186-d78a14bd4958 | sourceforge   | main                         | bzr        |       290 |           290 |         290 |              0 |
| 2021-07-09 13:01:37.678047+00 | b678cfc3-2780-4186-9186-d78a14bd4958 | sourceforge   | main                         | cvs        |     28622 |         28622 |       28622 |              0 |
| 2021-07-09 13:01:37.678047+00 | b678cfc3-2780-4186-9186-d78a14bd4958 | sourceforge   | main                         | git        |    181171 |         49365 |       57848 |              0 |
| 2021-07-09 13:01:37.678047+00 | b678cfc3-2780-4186-9186-d78a14bd4958 | sourceforge   | main                         | hg         |     27601 |         27601 |       27592 |              0 |
| 2021-07-09 13:01:37.678047+00 | b678cfc3-2780-4186-9186-d78a14bd4958 | sourceforge   | main                         | svn        |    101823 |          3221 |        5195 |              0 |
| 2021-07-09 13:01:37.678047+00 | b9b8e226-3452-4812-a6df-cab546b9ee11 | gitlab        | inria                        | git        |      3341 |          3056 |        1438 |              0 |
| 2021-07-09 13:01:37.678047+00 | baf89663-feae-4850-a8ec-3a21e699cc0b | gitlab        | gitlab                       | git        |    200200 |        176191 |       20643 |              0 |
+-------------------------------+--------------------------------------+---------------+------------------------------+------------+-----------+---------------+-------------+----------------+
(46 rows)

Time: 393492.776 ms (06:33.493)

ardumont added a commit: rDSCHc486b28ece7c: journal_client: Explicit docstring.Jul 22 2021, 2:22 PM

ardumont added a commit: rDSCHec6e69f6415a: Start handling of recurrent loading tasks in scheduler.

ardumont added a commit: rDSCHcb1edf1ab24d: Introduce storage for the recurrent visit scheduler queue position.

ardumont added a commit: rDSCH8c4ae9f14d6a: journal_client: Compute next position for origin visit.

olasd added a commit: rDSCH825e8cfe7d24: grab_next_visits: make the handling of CTEs more modular.Jul 22 2021, 2:22 PM

olasd added a commit: rDSCHd58776ab0b41: Introduce new scheduling policy to grab origins without last update.

ardumont mentioned this in rDSCH3b929d0bd9dc: journal_client: Refactor by inlining the update_position_offset.Jul 30 2021, 3:41 PM

ardumont mentioned this in rDSCH4fa29fe128c2: journal_client: Update get_last_status docstring.

Status on this, after the recent refactoring we did with @olasd to simplify the actual
implementation (backend and journal client). There remains to:

Refactor a bit the journal client to update a docstring and inline one function (done, that'd be the 2 previous commits mentioned here just below that comment ^).

Deactivate failing visits (delegating to listers the act of activating back those origins which gets live again). I have diffs which deal with this that needs some rebase and work according to latest change (I need to get back to it) [1].

Deploy the current scheduler implementation (master) when that previous point is done. (That's gonna be my goal to reach prior to some vacation break).

Then later, try to implement an orchestrator of scheduling policies according to feedback loop on stats from that query [2]

Update scheduler's listed_origins / scheduler_metrics tables with origins that are not coming from actual listers (manual ingestion and what not).

[1] D5978 D5980

[2] Well or some form of that query, maybe that could be directly from the
scheduler_metrics table (whatever makes more sense) T2345#67328.

(^ for a while ;)

Deactivate failing visits (delegating to listers the act of activating back those
origins which gets live again). I have diffs which deal with this that needs some
rebase and work according to latest change (I need to get back to it) [1].

Done, diffs D5978 and D5980 updated.

ardumont added a commit: rDSCH1bcf84d5e66d: Add a successive_visits counter to origin visit stats.Aug 4 2021, 10:06 AM

ardumont added a commit: rDSCH8281e351d6a1: journal_client: Disable origins when too many visited attempts failed.

ardumont changed the status of subtask T3456: staging: Deploy scheduler v0.17 from Open to Work in Progress.Aug 6 2021, 12:11 PM

ardumont closed subtask T3456: staging: Deploy scheduler v0.17 as Resolved.Aug 9 2021, 11:07 AM

ardumont changed the status of subtask T3471: production: Deploy swh.scheduler v0.17 from Open to Work in Progress.Aug 13 2021, 10:33 AM

ardumont closed subtask T3471: production: Deploy swh.scheduler v0.17 as Resolved.Aug 13 2021, 3:48 PM

Refactor a bit the journal client to update a docstring and inline one function (done, that'd be the 2 previous commits mentioned here just below that comment ^).

Deactivate failing visits (delegating to listers the act of activating back those origins which gets live again). I have diffs which deal with this that needs some rebase and work according to latest change (I need to get back to it) [1].

Deploy the current scheduler implementation (master) when that previous point is done. (That's gonna be my goal to reach prior to some vacation break).

done

What's next, as a summary, subsequent subtasks should be created later:

T3674: Fill-in the blanks for old manual ingestion
T3538: Send scheduler metrics to prometheus to be able to analyse ingestion tendency
T3667: Determine scheduling/ratio per visit type prior to actually orchestrate those

|----------------------------------------------------------------+-----------------------------------+-------|
| visit_type                                                     | scheduling policies               | ratio |
|----------------------------------------------------------------+-----------------------------------+-------|
| package-loader: archive, cran, debian, npm, nixguix, pypi, ... | already_visited_order_by_lag      |    50 |
|                                                                | never_visited_oldest_update_first |    50 |
|----------------------------------------------------------------+-----------------------------------+-------|
| git, svn, hg                                                   | already_visited_order_by_lag      |    49 |
|                                                                | never_visited_oldest_update_first |    49 |
|                                                                | origins_without_last_update       |     2 |
|----------------------------------------------------------------+-----------------------------------+-------|

T3667: Orchestrator of scheduler runner next-gen, the algo would be something along the line of the following:

for each visit type:
  - queue filled-in: monitored, when under a given threshold event is reached,
    this triggers a fill-in the void state. goto "fill-in the void state"
  - fill-in the void state:
    for each policy (per ratio) for that visit type:
    - fetch origins and schedule
    - continue until threshold is reached or policy has no results
    - when threshold is reached, goto "queue filled-in"

ardumont added a subtask: T3538: Send scheduler metrics to prometheus.Sep 3 2021, 5:17 PM

ardumont closed subtask T3538: Send scheduler metrics to prometheus as Resolved.Sep 14 2021, 11:00 AM

ardumont mentioned this in T3674: Determine how to reference manually listed/ingested origins in the scheduler metrics.Oct 20 2021, 12:15 PM

ardumont added a subtask: T3674: Determine how to reference manually listed/ingested origins in the scheduler metrics.Oct 20 2021, 12:19 PM

ardumont changed the status of subtask T3667: Orchestrate origins scheduling according to scheduler metrics feedback from Open to Work in Progress.Oct 28 2021, 4:34 PM

amottier removed a subscriber: amottier.Oct 28 2021, 4:54 PM

ardumont closed subtask T3667: Orchestrate origins scheduling according to scheduler metrics feedback as Resolved.Oct 28 2021, 5:13 PM

gitlab-migration changed the status of subtask T3456: staging: Deploy scheduler v0.17 from Resolved to Migrated.Oct 19 2022, 6:03 PM

gitlab-migration changed the status of subtask T3471: production: Deploy swh.scheduler v0.17 from Resolved to Migrated.

gitlab-migration changed the status of subtask T3538: Send scheduler metrics to prometheus from Resolved to Migrated.

gitlab-migration changed the status of subtask T3667: Orchestrate origins scheduling according to scheduler metrics feedback from Resolved to Migrated.

gitlab-migration changed the status of subtask T2442: Provide a unified API for listers to interact with the scheduler from Resolved to Migrated.Jan 8 2023, 4:30 PM