Page MenuHomeSoftware Heritage

Replace swh-worker-control with a swh scheduler celery-monitor subcommand
ClosedPublic

Authored by olasd on Jun 8 2020, 7:51 PM.

Details

Summary

This new subcommand has two commands:

  • ping: checks whether the given worker instance answers within a given timeout
  • list-running: lists running tasks on the given worker instance
Test Plan

new tox tests introduced

Diff Detail

Repository
rDSCH Scheduling utilities
Branch
feature/worker-monitoring
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 12737
Build 19372: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 19371: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D3248 (id=11507)

Rebasing onto 14cd5bb5ad...

Current branch diff-target is up to date.
Changes applied before test
commit 2a57da5cb9cecae76dc34a9d8de33d48d3f2bd37
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Mon Jun 8 19:30:25 2020 +0200

    Replace swh-worker-control with a swh scheduler celery-monitor subcommand
    
    This new subcommand has two commands:
    
     - ping: checks whether the given worker instance answers within a given timeout
     - list-running: lists running tasks on the given worker instance

Link to build: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/8/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/8/console

Interesting. Thanks.

I gather it's to help detect stuck workers.

(so given recent discussion, @zack could be interested to know this ;)

Curious me, what's the (long-term?) plan as a next step?
Adding a new priviledged command to try and restart the stuck workers or first use this and see where it's going?

Build has FAILED

E   pkg_resources.VersionConflict: (swh.scheduler base-revision-8-D3248.post1 (/var/lib/jenkins/workspace/DSCH/tests-on-diff/.tox/py3/lib/python3.7/site-packages), Requirement.parse('swh.scheduler>=0.0.58'))

That's something I encountered from time to time locally but I don't really
understand it... It's not usually with tox though, more with pytest... and I
usually reinstall the current dependency (here scheduler) and all goes fine
after that. (That won't help here).

Could it be related to recent changes on the clean up routine installed in jenkins?

Build has FAILED

E   pkg_resources.VersionConflict: (swh.scheduler base-revision-8-D3248.post1 (/var/lib/jenkins/workspace/DSCH/tests-on-diff/.tox/py3/lib/python3.7/site-packages), Requirement.parse('swh.scheduler>=0.0.58'))

That's something I encountered from time to time locally but I don't really
understand it... It's not usually with tox though, more with pytest... and I
usually reinstall the current dependency (here scheduler) and all goes fine
after that.

Could it be related to recent change on the clean up routine installed in jenkins?

That has nothing to do with cleanup on jenkins.

This is due to vcversioner using any random tag to generate a version number for the package, and the jenkins patch application using tags too.

I think setuptools-scm has a better behavior (by filtering on tags that "look like a version number" instead of just picking the closest tag).

Interesting. Thanks.

I gather it's to help detect stuck workers.

Yeah, it's basically wrapping up what I was doing for T2335 by hand as a built-in command. Which we kinda already had.

(so given recent discussion, @zack could be interested to know this ;)

Curious me, what's the (long-term?) plan as a next step?
Adding a new priviledged command to try and restart the stuck workers or first use this and see where it's going?

Basically, adding a cronjob / timer unit that runs the ping command a few times and restarts the worker if it's supposed to be started (systemctl status says so) and doesn't respond for $timeframe.

I'm not sure how to add tests to this stuff; I need to look at the celery scaffolding around testing the workers themselves to see if something implements the "remote control" API as a test fixture.

Build is green

Patch application report for D3248 (id=11524)

Rebasing onto 28c5b8d479...

Current branch diff-target is up to date.
Changes applied before test
commit 579b26858087fb4c668b32d5b49630c377aa61c0
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Mon Jun 8 19:30:25 2020 +0200

    Replace swh-worker-control with a swh scheduler celery-monitor subcommand
    
    This new subcommand has two commands:
    
     - ping: checks whether the given worker instance answers within a given timeout
     - list-running: lists running tasks on the given worker instance

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/10/ for more details.

douardda added inline comments.
swh/scheduler/cli/__init__.py
53

why is this removed? why in this revision? How is it related to added celery monitor cli commands?

Add tests for both subcommands

swh/scheduler/cli/__init__.py
53

Yeah, I wanted to do a separate commit for that, and then promptly forgot about it.

Build is green

Patch application report for D3248 (id=11529)

Rebasing onto 28c5b8d479...

Current branch diff-target is up to date.
Changes applied before test
commit 4cdbf4be7e36f3f318ba155b64c93f51ecd3ee07
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Mon Jun 8 19:30:25 2020 +0200

    Replace swh-worker-control with a swh scheduler celery-monitor subcommand
    
    This new subcommand has two commands:
    
     - ping: checks whether the given worker instance answers within a given timeout
     - list-running: lists running tasks on the given worker instance

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/11/ for more details.

Split unrelated commits out of the main feature addition

olasd added inline comments.
swh/scheduler/cli/__init__.py
53

Now done.

Build is green

Patch application report for D3248 (id=11530)

Rebasing onto 28c5b8d479...

Current branch diff-target is up to date.
Changes applied before test
commit a7778955dbbb12eca534d325c82c282e3c3ccf2e
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jun 10 11:31:45 2020 +0200

    Replace swh-worker-control with a swh scheduler celery-monitor subcommand
    
    This new subcommand has two commands:
    
     - ping: checks whether the given worker instance answers within a given timeout
     - list-running: lists running tasks on the given worker instance

commit 8411335a351c2bc5a4b6ae1feb7e1b8f2dcf6130
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jun 10 11:30:31 2020 +0200

    Remove double logging setup in cli
    
    The logging module is already initialized by the main swh.core cli; This only
    creates double logging with no advantages whatsoever.

commit 873cdacfaf2bd4127f6378358504feefb3a47fd4
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jun 10 11:28:19 2020 +0200

    Handle psycopg2 OperationalError in cli initialization
    
    When running the cli with default settings (i.e. pointing to a
    softwareheritage-scheduler-dev database), and the database doesn't exist, an
    OperationalError is raised.
    
    This shouldn't prevent (some of the) cli subcommands from working, so catch this
    error and ignore it as one of the scheduler backend setup failure modes.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/12/ for more details.

Replace --host/--type with a more generic --pattern

Drop unused destination_filter

Build is green

Patch application report for D3248 (id=11533)

Rebasing onto 28c5b8d479...

Current branch diff-target is up to date.
Changes applied before test
commit f8be919cfc38a10c93ecb0162e7fe2aa618f5a49
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jun 10 11:31:45 2020 +0200

    Replace swh-worker-control with a swh scheduler celery-monitor subcommand
    
    This new subcommand has two commands:
    
     - ping: checks whether the given worker instance answers within a given timeout
     - list-running: lists running tasks on the given worker instance

commit 8411335a351c2bc5a4b6ae1feb7e1b8f2dcf6130
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jun 10 11:30:31 2020 +0200

    Remove double logging setup in cli
    
    The logging module is already initialized by the main swh.core cli; This only
    creates double logging with no advantages whatsoever.

commit 873cdacfaf2bd4127f6378358504feefb3a47fd4
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jun 10 11:28:19 2020 +0200

    Handle psycopg2 OperationalError in cli initialization
    
    When running the cli with default settings (i.e. pointing to a
    softwareheritage-scheduler-dev database), and the database doesn't exist, an
    OperationalError is raised.
    
    This shouldn't prevent (some of the) cli subcommands from working, so catch this
    error and ignore it as one of the scheduler backend setup failure modes.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/13/ for more details.

This revision is now accepted and ready to land.Jun 10 2020, 12:19 PM

Build is green

Patch application report for D3248 (id=11534)

Rebasing onto 28c5b8d479...

Current branch diff-target is up to date.
Changes applied before test
commit aedd323f1ea8b8cb6838b71ad59098d5402b127d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jun 10 11:31:45 2020 +0200

    Replace swh-worker-control with a swh scheduler celery-monitor subcommand
    
    This new subcommand has two commands:
    
     - ping: checks whether the given worker instance answers within a given timeout
     - list-running: lists running tasks on the given worker instance

commit 8411335a351c2bc5a4b6ae1feb7e1b8f2dcf6130
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jun 10 11:30:31 2020 +0200

    Remove double logging setup in cli
    
    The logging module is already initialized by the main swh.core cli; This only
    creates double logging with no advantages whatsoever.

commit 873cdacfaf2bd4127f6378358504feefb3a47fd4
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Wed Jun 10 11:28:19 2020 +0200

    Handle psycopg2 OperationalError in cli initialization
    
    When running the cli with default settings (i.e. pointing to a
    softwareheritage-scheduler-dev database), and the database doesn't exist, an
    OperationalError is raised.
    
    This shouldn't prevent (some of the) cli subcommands from working, so catch this
    error and ignore it as one of the scheduler backend setup failure modes.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/14/ for more details.