Page MenuHomeSoftware Heritage

Improve handling of recurrent loading tasks in scheduler
Started, Work in Progress, HighPublic

Description

The approach we're currently using for recurrent loading tasks in the scheduler has a lot of shortcomings:

  1. does not take into account "freshness" information provided by a lister. Two consequences
    • lots of lag accrued on origins with updates
    • substantial amount of time wasted on origins with no updates
    • some amount of time wasted on completely dead origins
  2. uses (apparently unreliable) scheduler information as feedback loop
    • lots of tasks end up lost in space, when we now have a reliable mechanism (the journal) to subscribe to updates about objects in the archive
  3. feedback loop is very inflexible
    • the "visit interval" target has never been met, or even calibrated to our bandwidth
    • the way we adapt intervals is very stiff (x2 for inactive origins, /2 for active origins); no idea if it's stable or not
    • save code now requests are completely ignored by the recurrent tasks

To handle this functionality better, I propose introducing a separate task scheduler for recurrent origin visits, with a few components:

  • a new table and set of API endpoints in the scheduler backend, to record information about recurrent origin loading tasks, replacing the current contents of the task table in the scheduler (for tasks that come from listers)
  • a new runner for these origin visit tasks
    • TBD: generate one-shot tasks in the swh scheduler? send tasks directly to celery?
  • a journal client, feeding off origin_visits / origin_visit_updates, recording the status of all origin loading tasks

Common goals for a new origin visit scheduler

Some common goals that seem desirable:

  • loading origins at least once, as soon as possible after they appear in a lister
  • minimizing the number of "useless visits" with no updates to integrate
  • "smooth" the size of visits, by visiting active origins more often and doing less work each time (which reduces memory pressure on the workers)
  • make use of forge-provided "last modification times" to reduce the amount of useless work done

Baseline for the recurrence of origin visits

As a common baseline, we can handle the list of origins to be visited as a queue.

The way origins are ordered in this queue can be materialized by using two variables

  • a next visit target, the "time" at which we expect the origin to have new objects, and we should visit it again.
  • a visit interval, which is the duration that we expect to wait between visits of this origin

While conceptually a timestamp, the next visit target value is only really used as a queue index; the visit interval is an offset by which we move the origin within this queue when a visit completes.

As we should keep our infrastructure busy, and attempt origin visits as often as possible (with a minimal cooldown period between visits of a given origin, to avoid DoSing hosters), the next visit target currently being serviced can drift away from the current clock according to the behavior of our infrastructure.

The visit interval is really an index in a list of possible visit intervals. This allows us to set a smooth increase at first, and a stiffer increase later:

  • index 0 and 1, interval 1 day
  • index 2, 3 and 4, interval 2 days
  • index 5 and up, interval 4^(n-4) days (4, 16, 64, 256, 1024)

The breakpoints of this "exponential" can be adjusted to match the reality of our loading infrastructure, by monitoring the skew between the next visit targets and the actual scheduling time of visits.

The next visit target of an origin is updated after each visit completes:

  • if the visit has failed, increase the next visit target by the minimal visit interval (to take into account transient loading issues)
    • if the visit has failed several times in a row, disable the origin until the next run of the lister
  • if the visit is successful, and records some changes, decrease the visit interval index by 2 (visit the origin *way* more often).
  • if the visit is successful, and records no changes, increase the visit interval index by 1 (visit the origin less often).

We set the next visit target to its current value + the new visit interval multiplied by a random fudge factor (picked in the -/+ 10% range).

The fudge factor allows the visits to spread out, avoiding "bursts" of loaded origins e.g. when a number of origins from a single hoster are processed at once.

Bootstrapping of the next visit target and visit interval values for new origins

There is an obvious bootstrap issue: new origins do not have a next visit target. (We can, however, bootstrap the visit interval index to a default value, e.g. 4)

The first of our stated goals is visiting origins at least once as soon as possible.

There's multiple ways to achieve this:

  • we can schedule new origins separately
    • pros
      1. no fudging of the next visit target needed
      2. can precisely control when new origins are loaded
    • cons
      1. makes the scheduling logic more complex
      2. needs careful monitoring to make sure the number doesn't grow unbounded
  • we can generate a next visit target for new origins
    • pros
      1. simpler scheduling logic: origins are picked from a single queue
      2. can handle spreading the visits new origins over a longer time in a "oneshot" fashion
    • cons
      1. needs careful consideration to avoid DoSing new hosters by bursting requests to new origins
  • Once the first visit happens, we can set the next visit target to the "current next visit target being scheduled" + the default visit interval * the random fudge factor.

Optimizations for listers providing a date of last modification

When the lister provides a date of last modification for origins, we can do some more subtle management of the scheduling of origin visits. Instead of scheduling according to the next visit target, we can schedule visits to origins in the following three pools:

  1. origins that have never successfully loaded
    • ordered by increasing date of last modification (oldest first)
NOTE: if the lister returns a creation date for the origin, we could instead use a decreasing interval between creation time to last update time as sorting heuristic: this would favor "more active" origins, but could leave origins only updated once behind.
  1. origins where the date of last modification is more recent than the latest (successful) load date
    • ordered by decreasing difference between last load date and date of last modification
NOTE: this favors more recently active origins to the detriment of origins which had some activity right after being loaded, then went silent. There's a good chance that this heuristic will not converge (and, if the infrastructure struggles, a bunch of modifications to origins that happened right after a visit will never be recorded). To reduce the impact of this, we could mix both ends of the heuristic: do some amount of visits to active origins, and do some amount of "easy" visits to origins that haven't been updated very much.
  1. other origins
    • ordered by next visit target
NOTE: this is only a last-resort strategy:
  • listers might not be running all the time, but we still want a chance to update repositories
  • the modification time information provided by listers might not be reliable
  • if everything works perfectly, these last-resort updates should mostly be short no-ops

Actual visit scheduling policy

The next visit target for origins without date of last modification (group A) give them a total ordering.

For origins with an upstream-provided date of last modification (group B), the three scheduling pools don't give us a total ordering of visits.

To handle both of these groups when the execution queue has n slots free, we can :

  • get the ratio of "last modification time-provided" origins r = #B / (#A + #B)
  • schedule (1-r) n origins from group A according to their next visit target
  • schedule n r/3 origins from each of the 3 pools of origin group B.

We need to monitor the number of origins in each of the pools of group B, to make sure that our sorting heuristics are not making our work diverge.

Potential future improvements

  • Prioritize non-fork repositories over forked repositories?
    • not in the first draft; still, record the fork information if possible so we can adjust the scheduling policies afterwards

Feedback loop in the origin_visit journal client

  1. Update last visit date, status, and eventfulness
    • keep time of last successful visit
    • if status is failed, keep same last visit date, increase failure count
      • if failure count too high (3 ?) disable origin until next run of lister
    • else, reset failure count
  2. Update visit interval index and next visit target

Proposed metrics for the new scheduler

  • number of origins for every scheduling policy group and subgroup (probably grouped by lister).
  • number of active origins (last modification \in [previous listing, current listing]), by lister.
  • drift between real time and current next visit target scheduled
  • ...

Discussion on implementation detail [1]

[1] https://hedgedoc.softwareheritage.org/jF0v5LZITGqhpZ-hgiy8zw

Related Objects

StatusAssignedTask
Work in Progressolasd
Work in Progressolasd
Resolvedolasd
Resolvedtenma
Resolvedtenma
Resolvedanlambert
Resolvedanlambert
Resolvedanlambert
Resolvedanlambert
Resolvedanlambert
Resolvedanlambert
Resolvedardumont
Resolvedardumont
Resolvedardumont
Resolveddouardda
Resolvedardumont
Resolveddouardda
Resolvedvsellier
Resolvedvsellier
Resolvedardumont
Resolvedvsellier
Resolvedardumont
Work in Progressolasd
Work in Progressolasd
Resolvedardumont
Resolvedardumont
Resolvedardumont
Resolvedardumont
OpenNone
OpenNone

Event Timeline

olasd triaged this task as High priority.Apr 1 2020, 8:03 PM
olasd created this task.
olasd created this object with visibility "olasd (Nicolas Dandrimont)".
olasd changed the visibility from "olasd (Nicolas Dandrimont)" to "Public (No Login Required)".
olasd updated the task description. (Show Details)

This task describes in detail what kind of scheduling policy we should implement, but it doesn't help much figure out what the next steps should be.

After some more discussion, we think that this can be broken down and implemented through three separate components:

  • a unified listing api and table within the scheduler, which would collate the information from all listers that create recurrent visit tasks:
    • available origins and expected visit type (plus potential extra arguments for the loading tasks)
    • date of last listing for a given origin
    • date of last update if available
    • whether the origin is declared as a fork
  • a cache for quick, bulk access to the information about the latest visits for a given origin / visit type; a journal client which keeps this cache up to date
    • when was the latest visit for a given origin/type pair? What happened?
    • when was the latest _eventful_ visit for a given origin/type pair?
    • when do we expect the next eventful visit to be ?
  • a scheduling component which merges the information from listers and the visit cache, to handle the actual scheduling of tasks.

There's a good chance that some features of the origin visit cache will need to be tailor made for the policy chosen for the scheduling component.

olasd changed the status of subtask T2973: Implement a scheduler simulator from Open to Work in Progress.Jan 18 2021, 2:12 PM

Here's my understanding of the status of the migration to the next generation scheduler as of today:

Listers:

  • all the listers have been ported to the new API, and released. The legacy SQLAlchemy based core loader has been removed. We missed a prime opportunity for a 1.0 release ;)
  • all the listers based on the new API have been deployed in staging and in production.
    • Listers now only create entries in the listers table of the scheduler.
    • There are some (small) production issues to solve (T3032)

Scheduler journal client:

  • the scheduler journal client is deployed in staging and production.
    • it feeds the origin_visit_stats table from the origin_visit_status data in swh.journal.
  • the production instance of the journal client has a data consistency issue that is under investigation (T3000).
    • From the investigation, the issue seems to be purely within the journal client settings (i.e. the journal itself has all the expected data). Hopefully we can tune this and get it solved in the next few days.

Origin visit scheduling:

  • only fairly basic scheduling policies have been implemented : a generic FIFO policy, as well as two policies based on the visit status cache and the date of last update of the origin provided by the lister.
    • performance tuning of the existing policies and implementation of other scheduling policies, is blocked on:
      • production-sized lister data (running all listers to completion in prod, at least once)
      • accurate origin_visit_status cache data, from the archive
    • we need to implement the "fallback" scheduling when listers aren't able to provide us with a last update value
  • scheduling of origin visits from the new data / APIs has not yet been deployed, either in staging or production.
  • the production infra is churning on the (very large) "backlog" of recurrent tasks that are available in the "legacy" scheduler, generated by the old listers. These lists of tasks aren't updated anymore, but we have plenty of work to do still.
  • once we have tested scheduling policies against production data, we can disable the old recurrent tasks and subsume them with tasks from the new listers.
  • once scheduling of origin visit tasks is we should consider removing recurrent tasks from the old-style scheduler

Scheduler simulator:

  • it's a neat tool/toy, but it hasn't been confronted to production-scale data yet
  • it will be very useful as is, as an integration testing framework to tune the performance of the new components once we have some production-scale data dumps available
  • we will need to improve the set of metrics (T2974) and reporting to:
    • validate the simulator behavior against prod
    • make the simulator a more useful "management" tool allowing us to tune the parameters of the scheduling policies.

Summary of the data available in the listed_origins table, broken down by lister and "known state" of origins:

14:12 guest@softwareheritage-scheduler => select id, name, instance_name, visit_type, count(*) as total, count(*) filter (where last_scheduled is NULL) as not_scheduled, count(*) filter (where last_snapshot is null) as no_snapshot, count(*) filter(where last_update is null) as no_last_update from listed_origins left join origin_visit_stats using (url, visit_type) inner join listers on listed_origins.lister_id = listers.id group by id, visit_type;
                  id                  │     name      │        instance_name         │ visit_type │   total   │ not_scheduled │ no_snapshot │ no_last_update 
──────────────────────────────────────┼───────────────┼──────────────────────────────┼────────────┼───────────┼───────────────┼─────────────┼────────────────
 0bac0a61-1ee1-45ad-b37e-13a38a0fb8f4 │ CRAN          │ cran                         │ tar        │     18152 │         18152 │       18152 │             11
 0d4fa765-e989-40e1-b91f-8fe217729946 │ phabricator   │ blender                      │ git        │        41 │            41 │          41 │             41
 0d4fa765-e989-40e1-b91f-8fe217729946 │ phabricator   │ blender                      │ svn        │         6 │             6 │           1 │              6
 194b1af4-ba03-438a-961d-7b0be7cdc7f3 │ cgit          │ gnu-savannah                 │ git        │      1029 │          1028 │          39 │              0
 1ac30f61-69a8-44b8-ae44-7e21514f0c2e │ cgit          │ fedora                       │ git        │       866 │           851 │          54 │            110
 25d2aec0-91ce-446c-be64-d1ff9a79dc4e │ gitea         │ git.fsfe.org                 │ git        │       401 │           400 │          51 │              0
 29c69bc1-e815-4f5a-b009-c6854697fec7 │ pypi          │ pypi                         │ pypi       │    313887 │          4523 │       11844 │         313887
 31df8830-df11-4120-8e76-323981a9c88d │ phabricator   │ swh                          │ git        │       189 │           189 │          29 │            189
 3310a3f7-5f6e-4367-b93f-659033b1e735 │ cgit          │ baserock                     │ git        │      1524 │          1500 │          70 │              1
 41f76395-d290-4610-9584-3b4a8be548c5 │ cgit          │ yoctoproject                 │ git        │       175 │            72 │          10 │              0
 43b9a56a-b6ff-4e43-b9ca-53e231715713 │ cgit          │ zx2c4                        │ git        │       159 │           159 │          14 │              0
 4d7a9674-e2c4-40ce-94b3-677e3a4e8995 │ debian        │ Debian                       │ deb        │     35100 │         35100 │          85 │          35100
 59354ffc-0a34-4140-8503-5f398a763097 │ cgit          │ git-kernel                   │ git        │      1091 │           824 │         615 │              0
 5bb5ddb7-2a78-4051-9d84-ec07a3834031 │ debian        │ Debian-Security              │ deb        │       779 │           779 │         259 │            779
 6632ef5e-322b-402b-8f28-d090f76ed6b7 │ github        │ github                       │ git        │ 180292687 │     174024061 │    62966568 │         118140
 7338c20c-ffda-4a75-88fe-099f619a0fe2 │ GNU           │ GNU                          │ tar        │       386 │           386 │          32 │              0
 7378f526-bb27-46c9-9940-25c240509dc6 │ cgit          │ git.gnu.org.ua               │ git        │       145 │           124 │         145 │              7
 75627aa9-58c6-4ca6-ae11-eee5026bcefc │ gitlab        │ framagit                     │ git        │     20283 │         19735 │        5169 │              0
 7a775770-2b2f-4139-aacb-ad715c022b9d │ cgit          │ eclipse                      │ git        │      1375 │          1312 │        1314 │              3
 7b7ef365-b065-4e46-be98-19d3ca6a1633 │ gitlab        │ common-lisp                  │ git        │       825 │           801 │          65 │              0
 7fb5da29-6b90-4ce6-af17-5b1e2a56f794 │ npm           │ npm                          │ npm        │   1629224 │       1448537 │        3335 │            209
 860d41f8-d0c0-4733-a4d8-437c386bc31f │ save-code-now │ archive.softwareheritage.org │ git        │       694 │           633 │           0 │            694
 860d41f8-d0c0-4733-a4d8-437c386bc31f │ save-code-now │ archive.softwareheritage.org │ hg         │         8 │             8 │           0 │              8
 860d41f8-d0c0-4733-a4d8-437c386bc31f │ save-code-now │ archive.softwareheritage.org │ svn        │         1 │             1 │           0 │              1
 9de8141b-e441-4ffd-b40d-d438b29c03fc │ launchpad     │ launchpad                    │ git        │     23724 │         22642 │        4667 │              0
 a60261fe-1125-4a46-bd0b-c914709ca10e │ bitbucket     │ bitbucket                    │ git        │   2779174 │       2742821 │     1096182 │              0
 a96dea47-13c0-4f11-bf96-70b576b604a0 │ gitlab        │ gite.lirmm                   │ git        │       638 │           606 │         255 │              0
 b35c74ea-b1b4-4dfc-858e-80809f6b5790 │ cgit          │ qt.io                        │ git        │       278 │           266 │          13 │             50
 b50360dc-2f66-4e40-a789-b22d9375e875 │ gitea         │ codeberg.org                 │ git        │      8233 │          8233 │        4930 │              0
 b678cfc3-2780-4186-9186-d78a14bd4958 │ sourceforge   │ main                         │ bzr        │       290 │           290 │         290 │              0
 b678cfc3-2780-4186-9186-d78a14bd4958 │ sourceforge   │ main                         │ cvs        │     28622 │         28622 │       28622 │              0
 b678cfc3-2780-4186-9186-d78a14bd4958 │ sourceforge   │ main                         │ git        │    180740 │         58760 │       68531 │              0
 b678cfc3-2780-4186-9186-d78a14bd4958 │ sourceforge   │ main                         │ hg         │     27550 │         27550 │       27541 │              0
 b678cfc3-2780-4186-9186-d78a14bd4958 │ sourceforge   │ main                         │ svn        │    101722 │         38228 │       40853 │              0
 b9b8e226-3452-4812-a6df-cab546b9ee11 │ gitlab        │ inria                        │ git        │      3286 │          3106 │        1387 │              0
 baf89663-feae-4850-a8ec-3a21e699cc0b │ gitlab        │ gitlab                       │ git        │    200200 │        177160 │       20696 │              0
 ca2fcc66-8844-480c-b0ba-a91f890b0554 │ gitlab        │ gnome                        │ git        │     13176 │         12814 │        4772 │              0
 ceecf814-da90-43a4-8de0-dd853072145f │ gitlab        │ lip6                         │ git        │        69 │            58 │          65 │              0
 cef20b61-30d4-4526-baa1-5db8a78f2b57 │ cgit          │ openembedded                 │ git        │        16 │             9 │          16 │              1
 d229771c-1610-4e2c-a67b-a8d5b6f1b43c │ cgit          │ tor                          │ git        │       519 │           514 │         519 │             26
 df07990a-5e00-4ea7-af6e-2c03bcee028a │ cgit          │ alpinelinux                  │ git        │         6 │             5 │           0 │              0
 f4b55119-837d-461b-bd3a-0e07d324aabf │ gitlab        │ riseup                       │ git        │      1255 │          1221 │         428 │              0
 f4ea15d9-97b5-4bdc-a5d6-2062fe0acfef │ cgit          │ git.joeyh.name               │ git        │        62 │            51 │          62 │              0
 f788a57f-6c9b-4c5e-8200-3359d5bf4405 │ gitlab        │ ow2                          │ git        │      1297 │          1158 │         284 │              0
 ff34a2b5-2e81-4566-9627-61fab06f8f52 │ phabricator   │ kde                          │ git        │      1036 │          1036 │        1025 │           1036
 fffaba23-b6ad-4c02-a6e7-dcff8170b6f0 │ gitlab        │ freedesktop                  │ git        │      8008 │          7828 │        3448 │              0
(46 lignes)

Durée : 611092,651 ms (10:11,093)

Status on the latest development for this task, "Baseline for the recurrence of origin
visits" chapter has been implemented in the following stacked diffs (in review):

  • D5919: Start handling of recurrent loading tasks in scheduler
  • D5950: journal_client: Compute next position for origin visit
  • D5956: Introduce new scheduling policy to grab origins without last update
  • D5978: Add successive visits counter to origin visit stats (out of D4895)
  • D5980: journal_client: Deactivate origins when too many visited attempts failed

Relatedly to this task, some work has been started to make the pypi lister list its
origins with the last_update information in the diff D5977 / T3399 (review got done and
the implementation needs to be improved but still ;).

Relatedly to this task, some work has been started to make the pypi lister list its
origins with the last_update information in the diff D5977 / T3399 (review got done
and the implementation needs to be improved but still ;).

Done and deployed.

And now, heads up, with the new pypi lister, the most proeminent 'pypi' entry (at the
time, no_last_update to 313887) [1] decreased to 8 entries [2]:

So now what remains is the github entry with 118150 [1] but if I remember correctly,
@olasd mentions that it was "old/invalid" origins from github. They may most likely
subside when the new scheduling policy [3] lands (if those origins are actually no
longer existing ones, they will get disabled eventually).

[1] T2345#66559

[2]

14:25:30 softwareheritage-scheduler@belvedere:5432=> select now(), id, name, instance_name, visit_type, count(*) as total, count(*) filter (where last_scheduled is NULL) as not_scheduled, count(*) filter (where last_snapshot is null) as no
_snapshot, count(*) filter(where last_update is null) as no_last_update from listed_origins left join origin_visit_stats using (url, visit_type) inner join listers on listed_origins.lister_id = listers.id group by id, visit_type having nam
e='pypi';

+-------------------------------+--------------------------------------+------+---------------+------------+--------+---------------+-------------+----------------+
|              now              |                  id                  | name | instance_name | visit_type | total  | not_scheduled | no_snapshot | no_last_update |
+-------------------------------+--------------------------------------+------+---------------+------------+--------+---------------+-------------+----------------+
| 2021-07-09 12:52:09.438329+00 | 29c69bc1-e815-4f5a-b009-c6854697fec7 | pypi | pypi          | pypi       | 390659 |         75449 |       73223 |              8 |
+-------------------------------+--------------------------------------+------+---------------+------------+--------+---------------+-------------+----------------+
(1 row)

[3] T2345#67279

Updated stats in descending order on the no_last_update column:

15:01:21 softwareheritage-scheduler@belvedere:5432=> select now(), id, name, instance_name, visit_type, count(*) as total, count(*) filter (where last_scheduled is NULL) as not_scheduled, count(*) filter (where last_snapshot is null) as no_snapshot, count(*) filter(where last_update is null) as no_last_update from listed_origins left join origin_visit_stats using (url, visit_type) inner join listers on listed_origins.lister_id = listers.id group by id, visit_type order by no_last_update desc;
+-------------------------------+--------------------------------------+---------------+------------------------------+------------+-----------+---------------+-------------+----------------+
|              now              |                  id                  |     name      |        instance_name         | visit_type |   total   | not_scheduled | no_snapshot | no_last_update |
+-------------------------------+--------------------------------------+---------------+------------------------------+------------+-----------+---------------+-------------+----------------+
| 2021-07-09 13:01:37.678047+00 | 6632ef5e-322b-402b-8f28-d090f76ed6b7 | github        | github                       | git        | 180392112 |     170538428 |    61596806 |         118150 |
| 2021-07-09 13:01:37.678047+00 | 4d7a9674-e2c4-40ce-94b3-677e3a4e8995 | debian        | Debian                       | deb        |     35100 |         35100 |          85 |          35100 |
| 2021-07-09 13:01:37.678047+00 | 860d41f8-d0c0-4733-a4d8-437c386bc31f | save-code-now | archive.softwareheritage.org | git        |      2020 |          1855 |           0 |           2020 |
| 2021-07-09 13:01:37.678047+00 | ff34a2b5-2e81-4566-9627-61fab06f8f52 | phabricator   | kde                          | git        |      1036 |          1036 |        1025 |           1036 |
| 2021-07-09 13:01:37.678047+00 | 5bb5ddb7-2a78-4051-9d84-ec07a3834031 | debian        | Debian-Security              | deb        |       787 |           787 |         267 |            787 |
| 2021-07-09 13:01:37.678047+00 | 7fb5da29-6b90-4ce6-af17-5b1e2a56f794 | npm           | npm                          | npm        |   1629224 |       1448537 |        3302 |            209 |
| 2021-07-09 13:01:37.678047+00 | 31df8830-df11-4120-8e76-323981a9c88d | phabricator   | swh                          | git        |       189 |           189 |          29 |            189 |
| 2021-07-09 13:01:37.678047+00 | 1ac30f61-69a8-44b8-ae44-7e21514f0c2e | cgit          | fedora                       | git        |       866 |           850 |          53 |            110 |
| 2021-07-09 13:01:37.678047+00 | b35c74ea-b1b4-4dfc-858e-80809f6b5790 | cgit          | qt.io                        | git        |       278 |           234 |          13 |             50 |
| 2021-07-09 13:01:37.678047+00 | 0d4fa765-e989-40e1-b91f-8fe217729946 | phabricator   | blender                      | git        |        41 |            41 |          41 |             41 |
| 2021-07-09 13:01:37.678047+00 | d229771c-1610-4e2c-a67b-a8d5b6f1b43c | cgit          | tor                          | git        |       519 |           514 |         519 |             26 |
| 2021-07-09 13:01:37.678047+00 | 860d41f8-d0c0-4733-a4d8-437c386bc31f | save-code-now | archive.softwareheritage.org | hg         |        13 |            13 |           0 |             13 |
| 2021-07-09 13:01:37.678047+00 | 0bac0a61-1ee1-45ad-b37e-13a38a0fb8f4 | CRAN          | cran                         | tar        |     18276 |         18276 |       18276 |             11 |
| 2021-07-09 13:01:37.678047+00 | 29c69bc1-e815-4f5a-b009-c6854697fec7 | pypi          | pypi                         | pypi       |    390659 |         75449 |       73223 |              8 |
| 2021-07-09 13:01:37.678047+00 | 7378f526-bb27-46c9-9940-25c240509dc6 | cgit          | git.gnu.org.ua               | git        |       145 |           124 |         145 |              7 |
| 2021-07-09 13:01:37.678047+00 | 0d4fa765-e989-40e1-b91f-8fe217729946 | phabricator   | blender                      | svn        |         6 |             6 |           1 |              6 |
| 2021-07-09 13:01:37.678047+00 | 860d41f8-d0c0-4733-a4d8-437c386bc31f | save-code-now | archive.softwareheritage.org | svn        |         6 |             6 |           0 |              6 |
| 2021-07-09 13:01:37.678047+00 | 7a775770-2b2f-4139-aacb-ad715c022b9d | cgit          | eclipse                      | git        |      1375 |          1305 |        1307 |              3 |
| 2021-07-09 13:01:37.678047+00 | cef20b61-30d4-4526-baa1-5db8a78f2b57 | cgit          | openembedded                 | git        |        16 |             9 |          16 |              1 |
| 2021-07-09 13:01:37.678047+00 | 3310a3f7-5f6e-4367-b93f-659033b1e735 | cgit          | baserock                     | git        |      1524 |          1500 |          70 |              1 |
| 2021-07-09 13:01:37.678047+00 | ca2fcc66-8844-480c-b0ba-a91f890b0554 | gitlab        | gnome                        | git        |     13176 |         12395 |        4771 |              0 |
| 2021-07-09 13:01:37.678047+00 | ceecf814-da90-43a4-8de0-dd853072145f | gitlab        | lip6                         | git        |        69 |            58 |          65 |              0 |
| 2021-07-09 13:01:37.678047+00 | df07990a-5e00-4ea7-af6e-2c03bcee028a | cgit          | alpinelinux                  | git        |         6 |             4 |           0 |              0 |
| 2021-07-09 13:01:37.678047+00 | f4b55119-837d-461b-bd3a-0e07d324aabf | gitlab        | riseup                       | git        |      1255 |          1207 |         428 |              0 |
| 2021-07-09 13:01:37.678047+00 | f4ea15d9-97b5-4bdc-a5d6-2062fe0acfef | cgit          | git.joeyh.name               | git        |        62 |            51 |          62 |              0 |
| 2021-07-09 13:01:37.678047+00 | f788a57f-6c9b-4c5e-8200-3359d5bf4405 | gitlab        | ow2                          | git        |      1297 |          1076 |         284 |              0 |
| 2021-07-09 13:01:37.678047+00 | fffaba23-b6ad-4c02-a6e7-dcff8170b6f0 | gitlab        | freedesktop                  | git        |      8008 |          7415 |        3448 |              0 |
| 2021-07-09 13:01:37.678047+00 | 194b1af4-ba03-438a-961d-7b0be7cdc7f3 | cgit          | gnu-savannah                 | git        |      1029 |          1027 |          39 |              0 |
| 2021-07-09 13:01:37.678047+00 | 25d2aec0-91ce-446c-be64-d1ff9a79dc4e | gitea         | git.fsfe.org                 | git        |       401 |           346 |          51 |              0 |
| 2021-07-09 13:01:37.678047+00 | 41f76395-d290-4610-9584-3b4a8be548c5 | cgit          | yoctoproject                 | git        |       175 |            26 |          10 |              0 |
| 2021-07-09 13:01:37.678047+00 | 43b9a56a-b6ff-4e43-b9ca-53e231715713 | cgit          | zx2c4                        | git        |       159 |           159 |          14 |              0 |
| 2021-07-09 13:01:37.678047+00 | 59354ffc-0a34-4140-8503-5f398a763097 | cgit          | git-kernel                   | git        |      1091 |           817 |         377 |              0 |
| 2021-07-09 13:01:37.678047+00 | 7338c20c-ffda-4a75-88fe-099f619a0fe2 | GNU           | GNU                          | tar        |       386 |           386 |          32 |              0 |
| 2021-07-09 13:01:37.678047+00 | 75627aa9-58c6-4ca6-ae11-eee5026bcefc | gitlab        | framagit                     | git        |     20399 |         19599 |        5238 |              0 |
| 2021-07-09 13:01:37.678047+00 | 7b7ef365-b065-4e46-be98-19d3ca6a1633 | gitlab        | common-lisp                  | git        |       825 |           780 |          65 |              0 |
| 2021-07-09 13:01:37.678047+00 | 9de8141b-e441-4ffd-b40d-d438b29c03fc | launchpad     | launchpad                    | git        |     23998 |         17054 |        4938 |              0 |
| 2021-07-09 13:01:37.678047+00 | a60261fe-1125-4a46-bd0b-c914709ca10e | bitbucket     | bitbucket                    | git        |   2803985 |       2762110 |     1118887 |              0 |
| 2021-07-09 13:01:37.678047+00 | a96dea47-13c0-4f11-bf96-70b576b604a0 | gitlab        | gite.lirmm                   | git        |       638 |           588 |         255 |              0 |
| 2021-07-09 13:01:37.678047+00 | b50360dc-2f66-4e40-a789-b22d9375e875 | gitea         | codeberg.org                 | git        |      8233 |          7632 |        4928 |              0 |
| 2021-07-09 13:01:37.678047+00 | b678cfc3-2780-4186-9186-d78a14bd4958 | sourceforge   | main                         | bzr        |       290 |           290 |         290 |              0 |
| 2021-07-09 13:01:37.678047+00 | b678cfc3-2780-4186-9186-d78a14bd4958 | sourceforge   | main                         | cvs        |     28622 |         28622 |       28622 |              0 |
| 2021-07-09 13:01:37.678047+00 | b678cfc3-2780-4186-9186-d78a14bd4958 | sourceforge   | main                         | git        |    181171 |         49365 |       57848 |              0 |
| 2021-07-09 13:01:37.678047+00 | b678cfc3-2780-4186-9186-d78a14bd4958 | sourceforge   | main                         | hg         |     27601 |         27601 |       27592 |              0 |
| 2021-07-09 13:01:37.678047+00 | b678cfc3-2780-4186-9186-d78a14bd4958 | sourceforge   | main                         | svn        |    101823 |          3221 |        5195 |              0 |
| 2021-07-09 13:01:37.678047+00 | b9b8e226-3452-4812-a6df-cab546b9ee11 | gitlab        | inria                        | git        |      3341 |          3056 |        1438 |              0 |
| 2021-07-09 13:01:37.678047+00 | baf89663-feae-4850-a8ec-3a21e699cc0b | gitlab        | gitlab                       | git        |    200200 |        176191 |       20643 |              0 |
+-------------------------------+--------------------------------------+---------------+------------------------------+------------+-----------+---------------+-------------+----------------+
(46 rows)

Time: 393492.776 ms (06:33.493)

Status on this, after the recent refactoring we did with @olasd to simplify the actual
implementation (backend and journal client). There remains to:

  • Refactor a bit the journal client to update a docstring and inline one function (done, that'd be the 2 previous commits mentioned here just below that comment ^).
  • Deactivate failing visits (delegating to listers the act of activating back those origins which gets live again). I have diffs which deal with this that needs some rebase and work according to latest change (I need to get back to it) [1].
  • Deploy the current scheduler implementation (master) when that previous point is done. (That's gonna be my goal to reach prior to some vacation break).
  • Then later, try to implement an orchestrator of scheduling policies according to feedback loop on stats from that query [2]
  • Update scheduler's listed_origins / scheduler_metrics tables with origins that are not coming from actual listers (manual ingestion and what not).

[1] D5978 D5980

[2] Well or some form of that query, maybe that could be directly from the
scheduler_metrics table (whatever makes more sense) T2345#67328.

ardumont changed the task status from Open to Work in Progress.Jul 30 2021, 3:55 PM

(^ for a while ;)

Deactivate failing visits (delegating to listers the act of activating back those
origins which gets live again). I have diffs which deal with this that needs some
rebase and work according to latest change (I need to get back to it) [1].

Done, diffs D5978 and D5980 updated.

ardumont changed the status of subtask T3456: staging: Deploy scheduler v0.17 from Open to Work in Progress.Aug 6 2021, 12:11 PM
  • Refactor a bit the journal client to update a docstring and inline one function (done, that'd be the 2 previous commits mentioned here just below that comment ^).
  • Deactivate failing visits (delegating to listers the act of activating back those origins which gets live again). I have diffs which deal with this that needs some rebase and work according to latest change (I need to get back to it) [1].
  • Deploy the current scheduler implementation (master) when that previous point is done. (That's gonna be my goal to reach prior to some vacation break).

done

What's next, as a summary, subsequent subtasks should be created later:

  • T3674: Fill-in the blanks for old manual ingestion
  • T3538: Send scheduler metrics to prometheus to be able to analyse ingestion tendency
  • T3667: Determine scheduling/ratio per visit type prior to actually orchestrate those
|----------------------------------------------------------------+-----------------------------------+-------|
| visit_type                                                     | scheduling policies               | ratio |
|----------------------------------------------------------------+-----------------------------------+-------|
| package-loader: archive, cran, debian, npm, nixguix, pypi, ... | already_visited_order_by_lag      |    50 |
|                                                                | never_visited_oldest_update_first |    50 |
|----------------------------------------------------------------+-----------------------------------+-------|
| git, svn, hg                                                   | already_visited_order_by_lag      |    49 |
|                                                                | never_visited_oldest_update_first |    49 |
|                                                                | origins_without_last_update       |     2 |
|----------------------------------------------------------------+-----------------------------------+-------|
  • T3667: Orchestrator of scheduler runner next-gen, the algo would be something along the line of the following:
for each visit type:
  - queue filled-in: monitored, when under a given threshold event is reached,
    this triggers a fill-in the void state. goto "fill-in the void state"
  - fill-in the void state:
    for each policy (per ratio) for that visit type:
    - fetch origins and schedule
    - continue until threshold is reached or policy has no results
    - when threshold is reached, goto "queue filled-in"