Page MenuHomeSoftware Heritage

runner: Separate scheduling tasks with and without priority concerns
ClosedPublic

Authored by ardumont on Jun 8 2021, 5:39 PM.

Details

Summary

In effect, this will allow to run 2 runners:

  • one for recurring tasks
  • one for the save code now

This should decrease the probability of the scheduling tasks for the save code now to be
stuck behind the main scheduler runner.

Related to T3367

Test Plan

tox

Diff Detail

Repository
rDSCH Scheduling utilities
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 21859
Build 33989: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 33988: arc lint + arc unit

Event Timeline

ardumont published this revision for review.Jun 8 2021, 5:40 PM
ardumont planned changes to this revision.
ardumont retitled this revision from wip/poc: runner: Separate scheduling tasks with and without priority concern to wip/poc: runner: Separate scheduling tasks with and without priority concerns.

Build is green

Patch application report for D5826 (id=20846)

Could not rebase; Attempt merge onto 9f7ab8fcdc...

Updating 9f7ab8f..b76c647
Fast-forward
 swh/scheduler/backend.py                 | 90 ++++++++++++++++++++++----------
 swh/scheduler/celery_backend/config.py   | 23 +++++++-
 swh/scheduler/celery_backend/runner.py   | 89 +++++++++++++++----------------
 swh/scheduler/cli/admin.py               | 38 ++++++++++++--
 swh/scheduler/cli/origin.py              | 65 +++++++++++++++++++++++
 swh/scheduler/interface.py               | 19 +++++++
 swh/scheduler/tests/test_celery_tasks.py | 14 +++--
 7 files changed, 252 insertions(+), 86 deletions(-)
Changes applied before test
commit b76c647b4fedb6ad3811a2f3c034b996db7c2a79
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jun 8 17:36:28 2021 +0200

    runner: Separate scheduling tasks with and without priority concern
    
    Related to T3367

commit 974475fa08ebf9a31e68f89398633f97040f0d3e
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jun 3 16:03:26 2021 +0200

    send-to-celery: Add more options to allow scheduling of edge cases
    
    In the non optimal case, we may want to trigger specific case (not-yet enabled origins,
    origin from specific lister...).
    
    Related to T3350

commit 370ec4d66da913b409784bc949db402392594b0d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jun 2 15:59:15 2021 +0200

    Direct scheduling of origin visits in celery
    
    Summary:
    This stack of changes builds up to a CLI endpoint allowing us to schedule origin
    visits directly in Celery, bypassing the legacy scheduler entirely.
    
    This has zero test coverage save from old tests still passing, which is already
    something... It's being used on the actual production database to schedule
    actual tasks for git, npm and pypi.
    
    Included changes:
    
    - Drop duplicate docstring from backend
    - Make the origin visit scheduling cooldown configurable
    
    (Cosmetic changes)
    
    - Add a (longer) specific cooldown for failed origin visits
    - Add a specific cooldown for notfound origins
    
    Both of these changes prevent repeating visits on failing origins. This is
    necessary because, as we're using a consistent ordering with respect to the
    upstream information, we'd always be trying to load them, never reaching origins
    further down the stack. Listers should eventually disable these origins.
    
    - Add table sampling option to grab_next_visits
    
    Running common operations on all git origins is pretty intense. Using
    table sampling gives us the opportunity to at least schedule some jobs
    in (decently small) time.
    
    - Add a (very basic) scheduling policy for origins with no known last update
    
    This is especially useful for pypi, as well as some git hosters that do not
    provide the right info in their APIs. We will need to implement smarter
    heuristics to avoid repeated uneventful visits on these origins.
    
    - Split off the helper for available slots in a celery queue
    
    This is needed for the send-to-celery subcommand as well, so split it off of the
    runner module.
    
    - Add a swh scheduler origin send-to-celery subcommand
    
    Yes, finally!
    
    Test Plan: obviously needs at least /some/ test coverage.
    
    Reviewers: #reviewers
    
    Subscribers: ardumont
    
    Differential Revision: https://forge.softwareheritage.org/D5809

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/354/ for more details.

Build is green

Patch application report for D5826 (id=20886)

Rebasing onto 9d2618db8f...

Current branch diff-target is up to date.
Changes applied before test
commit 091336179afad8c4f4b97ffed18644a076893efc
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jun 8 17:36:28 2021 +0200

    runner: Separate scheduling tasks with and without priority concern
    
    Related to T3367

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/358/ for more details.

ardumont edited the summary of this revision. (Show Details)
  • Rework docstring
  • Fetch all the task types from within the run_ready_tasks function

Build is green

Patch application report for D5826 (id=20897)

Rebasing onto 9d2618db8f...

Current branch diff-target is up to date.
Changes applied before test
commit 4a2adc01fcfe4a63bbf06ae406a87851a12b931b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jun 8 17:36:28 2021 +0200

    runner: Separate scheduling tasks with and without priority concern
    
    In effect, this will allow to run 2 runners:
    - one for recurring tasks
    - one for the save code now
    
    This should decrease the probability of the scheduling tasks for the save code now to be
    stuck behind the main scheduler runner.
    
    Related to T3367

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/359/ for more details.

Build is green

Patch application report for D5826 (id=20908)

Rebasing onto 21c4279b99...

Current branch diff-target is up to date.
Changes applied before test
commit 0bafdccd09333aae5bdb81e496f0a09eabe51b35
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jun 8 17:36:28 2021 +0200

    runner: Separate scheduling tasks with and without priority concern
    
    In effect, this will allow to run 2 runners:
    - one for recurring tasks
    - one for the save code now
    
    This should decrease the probability of the scheduling tasks for the save code now to be
    stuck behind the main scheduler runner.
    
    Related to T3367

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/361/ for more details.

ardumont retitled this revision from wip/poc: runner: Separate scheduling tasks with and without priority concerns to runner: Separate scheduling tasks with and without priority concerns.
ardumont edited the summary of this revision. (Show Details)
ardumont edited the test plan for this revision. (Show Details)

Build is green

Patch application report for D5826 (id=20917)

Rebasing onto 21c4279b99...

Current branch diff-target is up to date.
Changes applied before test
commit f71a716f478ee8bfcf7f4e26f387768a89276deb
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jun 8 17:36:28 2021 +0200

    runner: Separate scheduling tasks with and without priority concern
    
    In effect, this will allow to run 2 runners:
    - one for recurring tasks
    - one for the save code now
    
    This should decrease the probability of the scheduling tasks for the save code now to be
    stuck behind the main scheduler runner.
    
    Related to T3367

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/362/ for more details.

ardumont edited the test plan for this revision. (Show Details)

Update tests and comments

Build is green

Patch application report for D5826 (id=20925)

Rebasing onto 21c4279b99...

Current branch diff-target is up to date.
Changes applied before test
commit c7707b5c836c3f58bace115eb398599a989845aa
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jun 8 17:36:28 2021 +0200

    runner: Separate scheduling tasks with and without priority concern
    
    In effect, this will allow to run 2 runners:
    - one for recurring tasks
    - one for the save code now
    
    This should decrease the probability of the scheduling tasks for the save code now to be
    stuck behind the main scheduler runner.
    
    Related to T3367

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/363/ for more details.

vsellier added a subscriber: vsellier.

LGTM, still not a big fan of the usage of random in the tests ;), but otherwise, it matches what you explain to me this morning

This revision is now accepted and ready to land.Jun 10 2021, 3:56 PM

LGTM,

\o/

still not a big fan of the usage of random in the tests ;), but otherwise, it matches what you explain to me this morning

lol, yeah but i'm not a big of hard-coding say the first element for example here.
Hence why i did that, i did not want to choose myself ;)
but meh ;)