Page MenuHomeSoftware Heritage

Direct scheduling of origin visits in celery
AbandonedPublicDraft

Authored by ardumont on Jun 1 2021, 8:29 PM.

Details

Summary

This stack of changes builds up to a CLI endpoint allowing us to schedule origin
visits directly in Celery, bypassing the legacy scheduler entirely.

This has zero test coverage save from old tests still passing, which is already
something... It's being used on the actual production database to schedule
actual tasks for git, npm and pypi.

Included changes:

  • Drop duplicate docstring from backend
  • Make the origin visit scheduling cooldown configurable

(Cosmetic changes)

  • Add a (longer) specific cooldown for failed origin visits
  • Add a specific cooldown for notfound origins

Both of these changes prevent repeating visits on failing origins. This is
necessary because, as we're using a consistent ordering with respect to the
upstream information, we'd always be trying to load them, never reaching origins
further down the stack. Listers should eventually disable these origins.

  • Add table sampling option to grab_next_visits

Running common operations on all git origins is pretty intense. Using
table sampling gives us the opportunity to at least schedule some jobs
in (decently small) time.

  • Add a (very basic) scheduling policy for origins with no known last update

This is especially useful for pypi, as well as some git hosters that do not
provide the right info in their APIs. We will need to implement smarter
heuristics to avoid repeated uneventful visits on these origins.

  • Split off the helper for available slots in a celery queue

This is needed for the send-to-celery subcommand as well, so split it off of the
runner module.

  • Add a swh scheduler origin send-to-celery subcommand

Yes, finally!

Test Plan

obviously needs at least /some/ test coverage.

Event Timeline

Build is green

Patch application report for D5809 (id=20743)

Rebasing onto 9f7ab8fcdc...

Current branch diff-target is up to date.
Changes applied before test
commit 0d9470049d9d703df2a904ba433f6ef63b3617e7
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 20:04:11 2021 +0200

    Add a swh scheduler origin send-to-celery subcommand
    
    The subcommand bypasses the legacy task-based mechanism to directly send
    new origin visits to celery

commit 3a41707a404911faf211d1800cd484129bd8fe0f
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 20:03:24 2021 +0200

    Split off the helper for available slots in a celery queue

commit 4c8854b6bf29f48afc3ac1b1d8ca9a23782dee3a
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 19:17:16 2021 +0200

    Add a scheduling policy for origins with no known last update

commit c576ff58f2188664a5f5b59db65d53acfa00093e
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 15:48:05 2021 +0200

    Add table sampling option to grab_next_visits
    
    Running common operations on all git origins is pretty intense. Using
    table sampling gives us the opportunity to at least schedule some jobs
    in (decently small) time.

commit e6b384a7310fd18f16fc9e8019ea9c352d48b28b
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 15:47:19 2021 +0200

    Add a specific cooldown for notfound origins
    
    This allows us to avoid repeating visits on them, until a next pass of
    the lister can mark them as disabled.

commit e015f17aa030dce39508b9ea909e102ad25be089
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 15:46:19 2021 +0200

    Add a (longer) specific cooldown for failed origin visits

commit 8b54e308271b813ebceb1940e300cc2d82e9321f
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 15:44:59 2021 +0200

    Make the origin visit scheduling cooldown configurable

commit 66a3edb2525b488aabeee8755e48768ed132a959
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 15:43:32 2021 +0200

    Drop duplicate docstring from backend

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/350/ for more details.

looks like a promising start ;)

Rebase:

  • Drop duplicate docstring from backend
  • Make the origin visit scheduling cooldown configurable
  • Add a (longer) specific cooldown for failed origin visits
  • Add a specific cooldown for notfound origins
  • Add table sampling option to grab_next_visits
  • Add a scheduling policy for origins with no known last update
  • Add a swh scheduler origin send-to-celery subcommand

Build is green

Patch application report for D5809 (id=21156)

Rebasing onto c7707b5c83...

Current branch diff-target is up to date.
Changes applied before test
commit 131592079af53ca71a4287b1a23c78fc19d27eb1
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 20:04:11 2021 +0200

    Add a swh scheduler origin send-to-celery subcommand
    
    The subcommand bypasses the legacy task-based mechanism to directly send
    new origin visits to celery

commit 9281c13ba1b16958cd5bbc7416818e9dc76d2313
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 19:17:16 2021 +0200

    Add a scheduling policy for origins with no known last update

commit b12a60871073be6d68639c6605df153563c8f5bf
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 15:48:05 2021 +0200

    Add table sampling option to grab_next_visits
    
    Running common operations on all git origins is pretty intense. Using
    table sampling gives us the opportunity to at least schedule some jobs
    in (decently small) time.

commit ca582400fef3b4be8cc864c99679f7d7a4732da5
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 15:47:19 2021 +0200

    Add a specific cooldown for notfound origins
    
    This allows us to avoid repeating visits on them, until a next pass of
    the lister can mark them as disabled.

commit 9fecfd159c3ff07a19b4c41df0986e59d05c7788
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 15:46:19 2021 +0200

    Add a (longer) specific cooldown for failed origin visits

commit 32521b918443ae5300a19ce54d4ade7f4d2f272d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 15:44:59 2021 +0200

    Make the origin visit scheduling cooldown configurable

commit 9e1b4145fe178e4bb178bb21895bf294afbb4e58
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 15:43:32 2021 +0200

    Drop duplicate docstring from backend

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/365/ for more details.

Reduce stack on top of D5901

  • Add a (longer) specific cooldown for failed origin visits
  • Add a specific cooldown for notfound origins
  • Add table sampling option to grab_next_visits
  • Add a scheduling policy for origins with no known last update
  • Add a swh scheduler origin send-to-celery subcommand

Build is green

Patch application report for D5809 (id=21164)

Could not rebase; Attempt merge onto 9e1b4145fe...

Updating 9e1b414..8669024
Fast-forward
 swh/scheduler/backend.py              | 69 ++++++++++++++++++++++------
 swh/scheduler/cli/origin.py           | 46 +++++++++++++++++++
 swh/scheduler/interface.py            | 12 +++++
 swh/scheduler/tests/test_scheduler.py | 86 +++++++++++++++++++++++++++++++----
 4 files changed, 189 insertions(+), 24 deletions(-)
Changes applied before test
commit 866902452dd62c87fafcd47bca791a1950852756
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 20:04:11 2021 +0200

    Add a swh scheduler origin send-to-celery subcommand
    
    The subcommand bypasses the legacy task-based mechanism to directly send
    new origin visits to celery

commit 6092cfb12314788c06697ba10f535af37fd726ec
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 19:17:16 2021 +0200

    Add a scheduling policy for origins with no known last update

commit fbccf2f7138f2c5b099e0947aa6362a9f719c3d3
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 15:48:05 2021 +0200

    Add table sampling option to grab_next_visits
    
    Running common operations on all git origins is pretty intense. Using
    table sampling gives us the opportunity to at least schedule some jobs
    in (decently small) time.

commit 33394cffdf79d9ee8a8f7aaff20cd1425359f5ec
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 15:47:19 2021 +0200

    Add a specific cooldown for notfound origins
    
    This allows us to avoid repeating visits on them, until a next pass of
    the lister can mark them as disabled.

commit db31605d23a17e3e1e8049100f4d11e67c4d427f
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 15:46:19 2021 +0200

    Add a (longer) specific cooldown for failed origin visits

commit 4027b3ef7b832036146525849faee78cfbc0091c
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Mon Jun 21 16:34:21 2021 +0200

    Make the origin visit scheduling cooldown configurable

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/368/ for more details.

Build is green

Patch application report for D5809 (id=21220)

Could not rebase; Attempt merge onto 7f51f274ed...

Updating 7f51f27..4b8ab17
Fast-forward
 swh/scheduler/backend.py              |  69 ++++++++++++++++-----
 swh/scheduler/cli/origin.py           |  46 ++++++++++++++
 swh/scheduler/interface.py            |  12 ++++
 swh/scheduler/tests/test_scheduler.py | 112 +++++++++++++++++++++++++++++++---
 4 files changed, 215 insertions(+), 24 deletions(-)
Changes applied before test
commit 4b8ab179bbcb660df944e44a18908c82c92bf28f
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 20:04:11 2021 +0200

    Add a swh scheduler origin send-to-celery subcommand
    
    The subcommand bypasses the legacy task-based mechanism to directly send
    new origin visits to celery

commit 049ef2704ccb4183771f5410fbe106588d3b5c79
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 19:17:16 2021 +0200

    Add a scheduling policy for origins with no known last update

commit 4f11e8edca8ad0ffe4885cc2fff737c40aa8e4a5
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 15:48:05 2021 +0200

    Add table sampling option to grab_next_visits
    
    Running common operations on all git origins is pretty intense. Using
    table sampling gives us the opportunity to at least schedule some jobs
    in (decently small) time.

commit ed818702c49c4c29ce8f648050a92e28873944d0
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 1 15:47:19 2021 +0200

    Add a specific cooldown for notfound origins
    
    This allows us to avoid repeating visits on them, until a next pass of
    the lister can mark them as disabled.

commit 651ddcc6cec829429f3e449e77f2250fa1ff2a24
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Mon Jun 21 17:34:00 2021 +0200

    Add a (longer) specific cooldown for failed origin visits

commit ce8608d1f8887993ae6ddf56169cb4c117243461
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Mon Jun 21 17:36:00 2021 +0200

    Make the origin visit scheduling cooldown configurable

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/378/ for more details.

ardumont added a reviewer: olasd.

I'll commandeer this as i need to rebase it on top of v0.17.

Rebase on top of v0.17 (current origin/master)

Build is green

Patch application report for D5809 (id=22034)

Rebasing onto 8281e351d6...

Current branch diff-target is up to date.
Changes applied before test
commit ecb0843c44ce37d178f058855ac64b0410092478
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jun 2 15:59:15 2021 +0200

    Direct scheduling of origin visits in celery
    
    Summary:
    This stack of changes builds up to a CLI endpoint allowing us to schedule origin
    visits directly in Celery, bypassing the legacy scheduler entirely.
    
    This has zero test coverage save from old tests still passing, which is already
    something... It's being used on the actual production database to schedule
    actual tasks for git, npm and pypi.
    
    Included changes:
    
    - Drop duplicate docstring from backend
    - Make the origin visit scheduling cooldown configurable
    
    (Cosmetic changes)
    
    - Add a (longer) specific cooldown for failed origin visits
    - Add a specific cooldown for notfound origins
    
    Both of these changes prevent repeating visits on failing origins. This is
    necessary because, as we're using a consistent ordering with respect to the
    upstream information, we'd always be trying to load them, never reaching origins
    further down the stack. Listers should eventually disable these origins.
    
    - Add table sampling option to grab_next_visits
    
    Running common operations on all git origins is pretty intense. Using
    table sampling gives us the opportunity to at least schedule some jobs
    in (decently small) time.
    
    - Add a (very basic) scheduling policy for origins with no known last update
    
    This is especially useful for pypi, as well as some git hosters that do not
    provide the right info in their APIs. We will need to implement smarter
    heuristics to avoid repeated uneventful visits on these origins.
    
    - Split off the helper for available slots in a celery queue
    
    This is needed for the send-to-celery subcommand as well, so split it off of the
    runner module.
    
    - Add a swh scheduler origin send-to-celery subcommand
    
    Yes, finally!
    
    Test Plan: obviously needs at least /some/ test coverage.
    
    Reviewers: #reviewers
    
    Subscribers: ardumont
    
    Differential Revision: https://forge.softwareheritage.org/D5809

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/433/ for more details.

Abandoning this in preference to dedicated diffs [1]

[1] D6145 D6146 D6147