Page MenuHomeSoftware Heritage

send-to-celery: Add more options to allow scheduling of edge case origins
ClosedPublic

Authored by ardumont on Jun 3 2021, 4:07 PM.

Details

Summary

In some cases, we may want to trigger specific origins (not-yet enabled, from a specific
lister...).

Example use case:

swh scheduler -C $SWH_CONFIG_FILENAME  \
  origin send-to-celery \
    --policy never_visited_oldest_update_first  \
    --only-disabled \
    --lister-uuid 'b678cfc3-2780-4186-9186-d78a14bd4958' \
    --queue oneshot:swh.loader.git.tasks.UpdateGitRepository \
    git

Related to T3350
Depends on D5809

Test Plan

none

Diff Detail

Repository
rDSCH Scheduling utilities
Branch
send-to-celery
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 22993
Build 35849: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 35848: arc lint + arc unit

Unit TestsFailed

TimeTest
39 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.scheduler.tests.test_cli::test_cli_task_runner_no_task
swh_scheduler = <swh.scheduler.backend.SchedulerBackend object at 0x7f20300bc4a8> storage = <swh.storage.in_memory.InMemoryStorage object at 0x7f20293ced68>
43 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.scheduler.tests.test_cli::test_cli_task_runner_unknown_task_types
self = <AliasedGroup scheduler> args = ['start-runner', '--task-type', 'swh-test-multiping', '--task-type', 'unknown-task-type'] prog_name = 'scheduler', complete_var = None, standalone_mode = True
45 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.scheduler.tests.test_cli::test_cli_task_runner_with_known_tasks[--with-priority]
self = <AliasedGroup scheduler> args = ['start-runner', '--with-priority', '--task-type', 'swh-test-error', '--task-type', 'swh-test-error'] prog_name = 'scheduler', complete_var = None, standalone_mode = True
41 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.scheduler.tests.test_cli::test_cli_task_runner_with_known_tasks[--without-priority]
self = <AliasedGroup scheduler> args = ['start-runner', '--without-priority', '--task-type', 'swh-test-error', '--task-type', 'swh-test-ping'] prog_name = 'scheduler', complete_var = None, standalone_mode = True
7 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.scheduler.cli.task::swh.scheduler.cli.task.pretty_print_task
View Full Test Results (4 Failed · 322 Passed · 1 Skipped)

Event Timeline

Build is green

Patch application report for D5818 (id=20791)

Could not rebase; Attempt merge onto 9f7ab8fcdc...

Updating 9f7ab8f..b39af7e
Fast-forward
 swh/scheduler/backend.py               | 90 +++++++++++++++++++++++-----------
 swh/scheduler/celery_backend/config.py | 22 ++++++++-
 swh/scheduler/celery_backend/runner.py | 24 ++-------
 swh/scheduler/cli/origin.py            | 64 ++++++++++++++++++++++++
 swh/scheduler/interface.py             | 19 +++++++
 5 files changed, 170 insertions(+), 49 deletions(-)
Changes applied before test
commit b39af7e83c2ec66089f2ba65573463144f9423f3
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jun 3 16:03:26 2021 +0200

    send-to-celery: Add more options to allow scheduling of edge cases
    
    In the non optimal case, we may want to trigger specific case (not-yet enabled origins,
    origin from specific lister...).
    
    Related to T3350

commit 370ec4d66da913b409784bc949db402392594b0d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jun 2 15:59:15 2021 +0200

    Direct scheduling of origin visits in celery
    
    Summary:
    This stack of changes builds up to a CLI endpoint allowing us to schedule origin
    visits directly in Celery, bypassing the legacy scheduler entirely.
    
    This has zero test coverage save from old tests still passing, which is already
    something... It's being used on the actual production database to schedule
    actual tasks for git, npm and pypi.
    
    Included changes:
    
    - Drop duplicate docstring from backend
    - Make the origin visit scheduling cooldown configurable
    
    (Cosmetic changes)
    
    - Add a (longer) specific cooldown for failed origin visits
    - Add a specific cooldown for notfound origins
    
    Both of these changes prevent repeating visits on failing origins. This is
    necessary because, as we're using a consistent ordering with respect to the
    upstream information, we'd always be trying to load them, never reaching origins
    further down the stack. Listers should eventually disable these origins.
    
    - Add table sampling option to grab_next_visits
    
    Running common operations on all git origins is pretty intense. Using
    table sampling gives us the opportunity to at least schedule some jobs
    in (decently small) time.
    
    - Add a (very basic) scheduling policy for origins with no known last update
    
    This is especially useful for pypi, as well as some git hosters that do not
    provide the right info in their APIs. We will need to implement smarter
    heuristics to avoid repeated uneventful visits on these origins.
    
    - Split off the helper for available slots in a celery queue
    
    This is needed for the send-to-celery subcommand as well, so split it off of the
    runner module.
    
    - Add a swh scheduler origin send-to-celery subcommand
    
    Yes, finally!
    
    Test Plan: obviously needs at least /some/ test coverage.
    
    Reviewers: #reviewers
    
    Subscribers: ardumont
    
    Differential Revision: https://forge.softwareheritage.org/D5809

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/351/ for more details.

Build is green

Patch application report for D5818 (id=20792)

Could not rebase; Attempt merge onto 9f7ab8fcdc...

Updating 9f7ab8f..af8b91e
Fast-forward
 swh/scheduler/backend.py               | 90 +++++++++++++++++++++++-----------
 swh/scheduler/celery_backend/config.py | 22 ++++++++-
 swh/scheduler/celery_backend/runner.py | 24 ++-------
 swh/scheduler/cli/origin.py            | 66 +++++++++++++++++++++++++
 swh/scheduler/interface.py             | 19 +++++++
 5 files changed, 172 insertions(+), 49 deletions(-)
Changes applied before test
commit af8b91e31ae52534298ab597983ebefab7563c60
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jun 3 16:03:26 2021 +0200

    send-to-celery: Add more options to allow scheduling of edge cases
    
    In the non optimal case, we may want to trigger specific case (not-yet enabled origins,
    origin from specific lister...).
    
    Related to T3350

commit 370ec4d66da913b409784bc949db402392594b0d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jun 2 15:59:15 2021 +0200

    Direct scheduling of origin visits in celery
    
    Summary:
    This stack of changes builds up to a CLI endpoint allowing us to schedule origin
    visits directly in Celery, bypassing the legacy scheduler entirely.
    
    This has zero test coverage save from old tests still passing, which is already
    something... It's being used on the actual production database to schedule
    actual tasks for git, npm and pypi.
    
    Included changes:
    
    - Drop duplicate docstring from backend
    - Make the origin visit scheduling cooldown configurable
    
    (Cosmetic changes)
    
    - Add a (longer) specific cooldown for failed origin visits
    - Add a specific cooldown for notfound origins
    
    Both of these changes prevent repeating visits on failing origins. This is
    necessary because, as we're using a consistent ordering with respect to the
    upstream information, we'd always be trying to load them, never reaching origins
    further down the stack. Listers should eventually disable these origins.
    
    - Add table sampling option to grab_next_visits
    
    Running common operations on all git origins is pretty intense. Using
    table sampling gives us the opportunity to at least schedule some jobs
    in (decently small) time.
    
    - Add a (very basic) scheduling policy for origins with no known last update
    
    This is especially useful for pypi, as well as some git hosters that do not
    provide the right info in their APIs. We will need to implement smarter
    heuristics to avoid repeated uneventful visits on these origins.
    
    - Split off the helper for available slots in a celery queue
    
    This is needed for the send-to-celery subcommand as well, so split it off of the
    runner module.
    
    - Add a swh scheduler origin send-to-celery subcommand
    
    Yes, finally!
    
    Test Plan: obviously needs at least /some/ test coverage.
    
    Reviewers: #reviewers
    
    Subscribers: ardumont
    
    Differential Revision: https://forge.softwareheritage.org/D5809

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/352/ for more details.

Adapt get_available_slots so it works in those edge cases.

Build is green

Patch application report for D5818 (id=20793)

Could not rebase; Attempt merge onto 9f7ab8fcdc...

Updating 9f7ab8f..7a6f936
Fast-forward
 swh/scheduler/backend.py               | 90 +++++++++++++++++++++++-----------
 swh/scheduler/celery_backend/config.py | 23 ++++++++-
 swh/scheduler/celery_backend/runner.py | 24 ++-------
 swh/scheduler/cli/origin.py            | 66 +++++++++++++++++++++++++
 swh/scheduler/interface.py             | 19 +++++++
 5 files changed, 173 insertions(+), 49 deletions(-)
Changes applied before test
commit 7a6f936e943855001fd2da8adba05ae3303dee36
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jun 3 16:03:26 2021 +0200

    send-to-celery: Add more options to allow scheduling of edge cases
    
    In the non optimal case, we may want to trigger specific case (not-yet enabled origins,
    origin from specific lister...).
    
    Related to T3350

commit 370ec4d66da913b409784bc949db402392594b0d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jun 2 15:59:15 2021 +0200

    Direct scheduling of origin visits in celery
    
    Summary:
    This stack of changes builds up to a CLI endpoint allowing us to schedule origin
    visits directly in Celery, bypassing the legacy scheduler entirely.
    
    This has zero test coverage save from old tests still passing, which is already
    something... It's being used on the actual production database to schedule
    actual tasks for git, npm and pypi.
    
    Included changes:
    
    - Drop duplicate docstring from backend
    - Make the origin visit scheduling cooldown configurable
    
    (Cosmetic changes)
    
    - Add a (longer) specific cooldown for failed origin visits
    - Add a specific cooldown for notfound origins
    
    Both of these changes prevent repeating visits on failing origins. This is
    necessary because, as we're using a consistent ordering with respect to the
    upstream information, we'd always be trying to load them, never reaching origins
    further down the stack. Listers should eventually disable these origins.
    
    - Add table sampling option to grab_next_visits
    
    Running common operations on all git origins is pretty intense. Using
    table sampling gives us the opportunity to at least schedule some jobs
    in (decently small) time.
    
    - Add a (very basic) scheduling policy for origins with no known last update
    
    This is especially useful for pypi, as well as some git hosters that do not
    provide the right info in their APIs. We will need to implement smarter
    heuristics to avoid repeated uneventful visits on these origins.
    
    - Split off the helper for available slots in a celery queue
    
    This is needed for the send-to-celery subcommand as well, so split it off of the
    runner module.
    
    - Add a swh scheduler origin send-to-celery subcommand
    
    Yes, finally!
    
    Test Plan: obviously needs at least /some/ test coverage.
    
    Reviewers: #reviewers
    
    Subscribers: ardumont
    
    Differential Revision: https://forge.softwareheritage.org/D5809

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/353/ for more details.

ardumont added a subscriber: olasd.
ardumont added inline comments.
swh/scheduler/celery_backend/config.py
246 ↗(On Diff #20793)

@olasd This ^ might be worth considering being integrated in your diff this diff builds upon.

(in my case, queue_length is returned as None by the get_queue_length call so not ending in the initial except and then the last instruction raises a TypeError because, expectedly you can't mix None and integer)

swh/scheduler/celery_backend/config.py
246 ↗(On Diff #20793)

integrated into master already through D5846

Build has FAILED

Patch application report for D5818 (id=22035)

Could not rebase; Attempt merge onto 8281e351d6...

Updating 8281e35..6cbd735
Fast-forward
 swh/scheduler/backend.py                 | 26 ++++++++++++-
 swh/scheduler/celery_backend/runner.py   | 15 ++++----
 swh/scheduler/cli/admin.py               | 27 ++++++++-----
 swh/scheduler/cli/origin.py              | 65 ++++++++++++++++++++++++++++++++
 swh/scheduler/interface.py               | 10 +++++
 swh/scheduler/tests/test_celery_tasks.py | 12 ++++--
 6 files changed, 131 insertions(+), 24 deletions(-)
Changes applied before test
commit 6cbd735c86fa40ab9293f2a9baad42c9a173688d
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jun 8 17:36:28 2021 +0200

    runner: Separate scheduling tasks with and without priority concern
    
    Related to T3367

commit 286e8b4ecf3cc229c3cc68e10f61773fb2017503
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jun 3 16:03:26 2021 +0200

    send-to-celery: Add more options to allow scheduling of edge cases
    
    In the non optimal case, we may want to trigger specific case (not-yet enabled origins,
    origin from specific lister...).
    
    Related to T3350

commit ecb0843c44ce37d178f058855ac64b0410092478
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jun 2 15:59:15 2021 +0200

    Direct scheduling of origin visits in celery
    
    Summary:
    This stack of changes builds up to a CLI endpoint allowing us to schedule origin
    visits directly in Celery, bypassing the legacy scheduler entirely.
    
    This has zero test coverage save from old tests still passing, which is already
    something... It's being used on the actual production database to schedule
    actual tasks for git, npm and pypi.
    
    Included changes:
    
    - Drop duplicate docstring from backend
    - Make the origin visit scheduling cooldown configurable
    
    (Cosmetic changes)
    
    - Add a (longer) specific cooldown for failed origin visits
    - Add a specific cooldown for notfound origins
    
    Both of these changes prevent repeating visits on failing origins. This is
    necessary because, as we're using a consistent ordering with respect to the
    upstream information, we'd always be trying to load them, never reaching origins
    further down the stack. Listers should eventually disable these origins.
    
    - Add table sampling option to grab_next_visits
    
    Running common operations on all git origins is pretty intense. Using
    table sampling gives us the opportunity to at least schedule some jobs
    in (decently small) time.
    
    - Add a (very basic) scheduling policy for origins with no known last update
    
    This is especially useful for pypi, as well as some git hosters that do not
    provide the right info in their APIs. We will need to implement smarter
    heuristics to avoid repeated uneventful visits on these origins.
    
    - Split off the helper for available slots in a celery queue
    
    This is needed for the send-to-celery subcommand as well, so split it off of the
    runner module.
    
    - Add a swh scheduler origin send-to-celery subcommand
    
    Yes, finally!
    
    Test Plan: obviously needs at least /some/ test coverage.
    
    Reviewers: #reviewers
    
    Subscribers: ardumont
    
    Differential Revision: https://forge.softwareheritage.org/D5809

Link to build: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/434/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/434/console

Fix missing conflict resolution

Build has FAILED

Patch application report for D5818 (id=22036)

Could not rebase; Attempt merge onto 8281e351d6...

Updating 8281e35..f3c9067
Fast-forward
 swh/scheduler/backend.py                 | 26 ++++++++++++-
 swh/scheduler/celery_backend/runner.py   | 17 +++------
 swh/scheduler/cli/admin.py               | 27 ++++++++-----
 swh/scheduler/cli/origin.py              | 65 ++++++++++++++++++++++++++++++++
 swh/scheduler/interface.py               | 10 +++++
 swh/scheduler/tests/test_celery_tasks.py | 12 ++++--
 6 files changed, 130 insertions(+), 27 deletions(-)
Changes applied before test
commit f3c9067f21f06e73ef55d4d1235905339db86899
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jun 8 17:36:28 2021 +0200

    runner: Separate scheduling tasks with and without priority concern
    
    Related to T3367

commit 286e8b4ecf3cc229c3cc68e10f61773fb2017503
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jun 3 16:03:26 2021 +0200

    send-to-celery: Add more options to allow scheduling of edge cases
    
    In the non optimal case, we may want to trigger specific case (not-yet enabled origins,
    origin from specific lister...).
    
    Related to T3350

commit ecb0843c44ce37d178f058855ac64b0410092478
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jun 2 15:59:15 2021 +0200

    Direct scheduling of origin visits in celery
    
    Summary:
    This stack of changes builds up to a CLI endpoint allowing us to schedule origin
    visits directly in Celery, bypassing the legacy scheduler entirely.
    
    This has zero test coverage save from old tests still passing, which is already
    something... It's being used on the actual production database to schedule
    actual tasks for git, npm and pypi.
    
    Included changes:
    
    - Drop duplicate docstring from backend
    - Make the origin visit scheduling cooldown configurable
    
    (Cosmetic changes)
    
    - Add a (longer) specific cooldown for failed origin visits
    - Add a specific cooldown for notfound origins
    
    Both of these changes prevent repeating visits on failing origins. This is
    necessary because, as we're using a consistent ordering with respect to the
    upstream information, we'd always be trying to load them, never reaching origins
    further down the stack. Listers should eventually disable these origins.
    
    - Add table sampling option to grab_next_visits
    
    Running common operations on all git origins is pretty intense. Using
    table sampling gives us the opportunity to at least schedule some jobs
    in (decently small) time.
    
    - Add a (very basic) scheduling policy for origins with no known last update
    
    This is especially useful for pypi, as well as some git hosters that do not
    provide the right info in their APIs. We will need to implement smarter
    heuristics to avoid repeated uneventful visits on these origins.
    
    - Split off the helper for available slots in a celery queue
    
    This is needed for the send-to-celery subcommand as well, so split it off of the
    runner module.
    
    - Add a swh scheduler origin send-to-celery subcommand
    
    Yes, finally!
    
    Test Plan: obviously needs at least /some/ test coverage.
    
    Reviewers: #reviewers
    
    Subscribers: ardumont
    
    Differential Revision: https://forge.softwareheritage.org/D5809

Link to build: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/435/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/435/console

Build has FAILED

Patch application report for D5818 (id=22037)

Could not rebase; Attempt merge onto 8281e351d6...

Updating 8281e35..889457e
Fast-forward
 swh/scheduler/backend.py                 | 26 ++++++++++++-
 swh/scheduler/celery_backend/runner.py   | 17 +++------
 swh/scheduler/cli/admin.py               | 22 +++++++----
 swh/scheduler/cli/origin.py              | 65 ++++++++++++++++++++++++++++++++
 swh/scheduler/interface.py               | 10 +++++
 swh/scheduler/tests/test_celery_tasks.py | 12 ++++--
 6 files changed, 128 insertions(+), 24 deletions(-)
Changes applied before test
commit 889457e053a73ab7c19f9f874f98ea9f3d1e5318
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jun 8 17:36:28 2021 +0200

    runner: Separate scheduling tasks with and without priority concern
    
    Related to T3367

commit 286e8b4ecf3cc229c3cc68e10f61773fb2017503
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jun 3 16:03:26 2021 +0200

    send-to-celery: Add more options to allow scheduling of edge cases
    
    In the non optimal case, we may want to trigger specific case (not-yet enabled origins,
    origin from specific lister...).
    
    Related to T3350

commit ecb0843c44ce37d178f058855ac64b0410092478
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jun 2 15:59:15 2021 +0200

    Direct scheduling of origin visits in celery
    
    Summary:
    This stack of changes builds up to a CLI endpoint allowing us to schedule origin
    visits directly in Celery, bypassing the legacy scheduler entirely.
    
    This has zero test coverage save from old tests still passing, which is already
    something... It's being used on the actual production database to schedule
    actual tasks for git, npm and pypi.
    
    Included changes:
    
    - Drop duplicate docstring from backend
    - Make the origin visit scheduling cooldown configurable
    
    (Cosmetic changes)
    
    - Add a (longer) specific cooldown for failed origin visits
    - Add a specific cooldown for notfound origins
    
    Both of these changes prevent repeating visits on failing origins. This is
    necessary because, as we're using a consistent ordering with respect to the
    upstream information, we'd always be trying to load them, never reaching origins
    further down the stack. Listers should eventually disable these origins.
    
    - Add table sampling option to grab_next_visits
    
    Running common operations on all git origins is pretty intense. Using
    table sampling gives us the opportunity to at least schedule some jobs
    in (decently small) time.
    
    - Add a (very basic) scheduling policy for origins with no known last update
    
    This is especially useful for pypi, as well as some git hosters that do not
    provide the right info in their APIs. We will need to implement smarter
    heuristics to avoid repeated uneventful visits on these origins.
    
    - Split off the helper for available slots in a celery queue
    
    This is needed for the send-to-celery subcommand as well, so split it off of the
    runner module.
    
    - Add a swh scheduler origin send-to-celery subcommand
    
    Yes, finally!
    
    Test Plan: obviously needs at least /some/ test coverage.
    
    Reviewers: #reviewers
    
    Subscribers: ardumont
    
    Differential Revision: https://forge.softwareheritage.org/D5809

Link to build: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/436/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/436/console

Build is green

Patch application report for D5818 (id=22038)

Could not rebase; Attempt merge onto 8281e351d6...

Updating 8281e35..24691cd
Fast-forward
 swh/scheduler/backend.py                 | 26 ++++++++++++-
 swh/scheduler/celery_backend/runner.py   | 17 +++------
 swh/scheduler/cli/admin.py               | 11 +++++-
 swh/scheduler/cli/origin.py              | 65 ++++++++++++++++++++++++++++++++
 swh/scheduler/interface.py               | 10 +++++
 swh/scheduler/tests/test_celery_tasks.py | 12 ++++--
 6 files changed, 122 insertions(+), 19 deletions(-)
Changes applied before test
commit 24691cddff108ba14d190f8cc5bad2dd0a60fe16
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jun 8 17:36:28 2021 +0200

    runner: Separate scheduling tasks with and without priority concern
    
    Related to T3367

commit 286e8b4ecf3cc229c3cc68e10f61773fb2017503
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jun 3 16:03:26 2021 +0200

    send-to-celery: Add more options to allow scheduling of edge cases
    
    In the non optimal case, we may want to trigger specific case (not-yet enabled origins,
    origin from specific lister...).
    
    Related to T3350

commit ecb0843c44ce37d178f058855ac64b0410092478
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jun 2 15:59:15 2021 +0200

    Direct scheduling of origin visits in celery
    
    Summary:
    This stack of changes builds up to a CLI endpoint allowing us to schedule origin
    visits directly in Celery, bypassing the legacy scheduler entirely.
    
    This has zero test coverage save from old tests still passing, which is already
    something... It's being used on the actual production database to schedule
    actual tasks for git, npm and pypi.
    
    Included changes:
    
    - Drop duplicate docstring from backend
    - Make the origin visit scheduling cooldown configurable
    
    (Cosmetic changes)
    
    - Add a (longer) specific cooldown for failed origin visits
    - Add a specific cooldown for notfound origins
    
    Both of these changes prevent repeating visits on failing origins. This is
    necessary because, as we're using a consistent ordering with respect to the
    upstream information, we'd always be trying to load them, never reaching origins
    further down the stack. Listers should eventually disable these origins.
    
    - Add table sampling option to grab_next_visits
    
    Running common operations on all git origins is pretty intense. Using
    table sampling gives us the opportunity to at least schedule some jobs
    in (decently small) time.
    
    - Add a (very basic) scheduling policy for origins with no known last update
    
    This is especially useful for pypi, as well as some git hosters that do not
    provide the right info in their APIs. We will need to implement smarter
    heuristics to avoid repeated uneventful visits on these origins.
    
    - Split off the helper for available slots in a celery queue
    
    This is needed for the send-to-celery subcommand as well, so split it off of the
    runner module.
    
    - Add a swh scheduler origin send-to-celery subcommand
    
    Yes, finally!
    
    Test Plan: obviously needs at least /some/ test coverage.
    
    Reviewers: #reviewers
    
    Subscribers: ardumont
    
    Differential Revision: https://forge.softwareheritage.org/D5809

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/437/ for more details.

Build is green

Patch application report for D5818 (id=22039)

Could not rebase; Attempt merge onto 8281e351d6...

Updating 8281e35..66d95e6
Fast-forward
 swh/scheduler/backend.py                 | 26 ++++++++++++-
 swh/scheduler/celery_backend/runner.py   | 17 +++------
 swh/scheduler/cli/admin.py               |  8 +++-
 swh/scheduler/cli/origin.py              | 65 ++++++++++++++++++++++++++++++++
 swh/scheduler/interface.py               | 10 +++++
 swh/scheduler/tests/test_celery_tasks.py | 12 ++++--
 6 files changed, 120 insertions(+), 18 deletions(-)
Changes applied before test
commit 66d95e69fd2b769a2911d6a73aad1121fe39fc7b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jun 8 17:36:28 2021 +0200

    runner: Separate scheduling tasks with and without priority concern
    
    Related to T3367

commit 286e8b4ecf3cc229c3cc68e10f61773fb2017503
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jun 3 16:03:26 2021 +0200

    send-to-celery: Add more options to allow scheduling of edge cases
    
    In the non optimal case, we may want to trigger specific case (not-yet enabled origins,
    origin from specific lister...).
    
    Related to T3350

commit ecb0843c44ce37d178f058855ac64b0410092478
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jun 2 15:59:15 2021 +0200

    Direct scheduling of origin visits in celery
    
    Summary:
    This stack of changes builds up to a CLI endpoint allowing us to schedule origin
    visits directly in Celery, bypassing the legacy scheduler entirely.
    
    This has zero test coverage save from old tests still passing, which is already
    something... It's being used on the actual production database to schedule
    actual tasks for git, npm and pypi.
    
    Included changes:
    
    - Drop duplicate docstring from backend
    - Make the origin visit scheduling cooldown configurable
    
    (Cosmetic changes)
    
    - Add a (longer) specific cooldown for failed origin visits
    - Add a specific cooldown for notfound origins
    
    Both of these changes prevent repeating visits on failing origins. This is
    necessary because, as we're using a consistent ordering with respect to the
    upstream information, we'd always be trying to load them, never reaching origins
    further down the stack. Listers should eventually disable these origins.
    
    - Add table sampling option to grab_next_visits
    
    Running common operations on all git origins is pretty intense. Using
    table sampling gives us the opportunity to at least schedule some jobs
    in (decently small) time.
    
    - Add a (very basic) scheduling policy for origins with no known last update
    
    This is especially useful for pypi, as well as some git hosters that do not
    provide the right info in their APIs. We will need to implement smarter
    heuristics to avoid repeated uneventful visits on these origins.
    
    - Split off the helper for available slots in a celery queue
    
    This is needed for the send-to-celery subcommand as well, so split it off of the
    runner module.
    
    - Add a swh scheduler origin send-to-celery subcommand
    
    Yes, finally!
    
    Test Plan: obviously needs at least /some/ test coverage.
    
    Reviewers: #reviewers
    
    Subscribers: ardumont
    
    Differential Revision: https://forge.softwareheritage.org/D5809

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/438/ for more details.

douardda added inline comments.
swh/scheduler/celery_backend/runner.py
23

i don't really see the purpose of renaming this variable (since it's properly type annotated), but meh

swh/scheduler/cli/origin.py
158

the semantics of this flag option is not clear to me. What does --with-enabled means when I use this send-to-celery command? And what --without-enabeld?

it looks to me the naming/semantics for this option is very close to the implementation, but does not make much sense for the user.

swh/scheduler/celery_backend/runner.py
62

This made sense for an old implementation which no longer is the case here.
I'll revert as well.

swh/scheduler/cli/origin.py
158

ok, let's go with '--only-enabled/--only-disabled'.

I'll add the help message for that option which will clarify the meaning.

Thanks for the heads up.

ardumont marked an inline comment as done.
  • Rebase on top of latest master
  • Adapt according to remarks

Use --only-enabled and --only-disabled as mentioned yesterday

Build is green

Patch application report for D5818 (id=22267)

Rebasing onto 7cc37fa233...

Current branch diff-target is up to date.
Changes applied before test
commit a5996cb9b6deb56b09b10ba7cffb6878ab8f8981
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jun 8 17:36:28 2021 +0200

    runner: Separate scheduling tasks with and without priority concern
    
    Related to T3367

commit 713007976240841a66f3cc595895ab45241e6d3c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jun 3 16:03:26 2021 +0200

    send-to-celery: Add more options to allow scheduling of edge cases
    
    In the non optimal case, we may want to trigger specific case (not-yet enabled origins,
    origin from specific lister...).
    
    Related to T3350

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/449/ for more details.

Build is green

Patch application report for D5818 (id=22268)

Rebasing onto 7cc37fa233...

Current branch diff-target is up to date.
Changes applied before test
commit af13da0d9a691c61b9a03ffa73a9bc050611671f
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jun 8 17:36:28 2021 +0200

    runner: Separate scheduling tasks with and without priority concern
    
    Related to T3367

commit 63fdda00f5f923294ebae3565c26d1741a001cab
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jun 3 16:03:26 2021 +0200

    send-to-celery: Add more options to allow scheduling of edge cases
    
    In the non optimal case, we may want to trigger specific case (not-yet enabled origins,
    origin from specific lister...).
    
    Related to T3350

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/450/ for more details.

The send-to-celery part LGTM, thanks.

There's a weird set of changes that seems to be mixed in to the new send-to-celery options, I'm not sure that was intended?

This revision is now accepted and ready to land.Sep 1 2021, 5:45 PM

Only target the one commit for the diff

There's a weird set of changes that seems to be mixed in to the new send-to-celery options, I'm not sure that was intended?

It was but its scope seems different from the original diff.
I've removed it, thanks for the heads up.

Build is green

Patch application report for D5818 (id=22351)

Rebasing onto 7cc37fa233...

Current branch diff-target is up to date.
Changes applied before test
commit 63fdda00f5f923294ebae3565c26d1741a001cab
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jun 3 16:03:26 2021 +0200

    send-to-celery: Add more options to allow scheduling of edge cases
    
    In the non optimal case, we may want to trigger specific case (not-yet enabled origins,
    origin from specific lister...).
    
    Related to T3350

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/451/ for more details.