Page MenuHomeSoftware Heritage

scheduler.cli.journal: Add `swh scheduler journal visit-stats` cli
ClosedPublic

Authored by ardumont on Jan 18 2021, 3:36 PM.

Details

Summary

This adds the cli entrypoint to actually process origin_visit_status topic (which writes
to the origin_visit_stats db table).

Note that:

  • the cli name visit-stats could be changed if someone has a better name in mind
  • this does not open the swh.journal as optional dependency for the scheduler

which may be a tad cumbersome

Related to T2967
Depends on D4873

Diff Detail

Repository
rDSCH Scheduling utilities
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 18449
Build 28525: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 28524: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D4876 (id=17307)

Could not rebase; Attempt merge onto d3afd144af...

Merge made by the 'recursive' strategy.
 requirements-swh.txt                       |   1 +
 swh/scheduler/cli/journal.py               |  62 +++++++++++++
 swh/scheduler/journal_client.py            |  27 ++++--
 swh/scheduler/tests/test_cli_journal.py    | 113 +++++++++++++++++++++++
 swh/scheduler/tests/test_journal_client.py | 139 +++++++++++++++++++++++++++--
 5 files changed, 329 insertions(+), 13 deletions(-)
 create mode 100644 swh/scheduler/cli/journal.py
 create mode 100644 swh/scheduler/tests/test_cli_journal.py
Changes applied before test
commit 7e85a958b3c566560d920e8dd7f69e3226f838b4
Merge: d3afd14 6bd95aa
Author: Jenkins user <jenkins@localhost>
Date:   Mon Jan 18 14:36:53 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 6bd95aa1a09e183a00e5e7712724e0fb703ef083
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Jan 18 14:46:57 2021 +0100

    scheduler.cli.journal: Add `swh scheduler journal visit-stats` cli
    
    This adds the cli entrypoint to actually process origin_visit_status topics and write to
    the origin_visit_stats db table.
    
    Related to T2967

commit 2733c357a52c47016733352ec5418360363a98ff
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jan 15 15:49:41 2021 +0100

    journal_client: Improve stats detection
    
    This adds an integration test which permutes input to ensure out of order renders the
    same result.
    
    This also improves the current algorithm which revealed some hit-and-miss cases.
    
    Related to T2967

Link to build: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/143/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/143/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jan 18 2021, 3:37 PM
Harbormaster failed remote builds in B18446: Diff 17307!

Fix warnings which fail the build

Build has FAILED

Patch application report for D4876 (id=17308)

Could not rebase; Attempt merge onto d3afd144af...

Merge made by the 'recursive' strategy.
 mypy.ini                                   |   3 +
 requirements-swh.txt                       |   1 +
 swh/scheduler/cli/journal.py               |  62 +++++++++++++
 swh/scheduler/journal_client.py            |  27 ++++--
 swh/scheduler/tests/test_cli_journal.py    | 113 +++++++++++++++++++++++
 swh/scheduler/tests/test_journal_client.py | 139 +++++++++++++++++++++++++++--
 6 files changed, 332 insertions(+), 13 deletions(-)
 create mode 100644 swh/scheduler/cli/journal.py
 create mode 100644 swh/scheduler/tests/test_cli_journal.py
Changes applied before test
commit a59af2ef6d4b1cbaebe799661817d6b4f053a7cf
Merge: d3afd14 3a0504d
Author: Jenkins user <jenkins@localhost>
Date:   Mon Jan 18 14:45:12 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 3a0504d1615b16182036708433e966ba8750f6db
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Jan 18 14:46:57 2021 +0100

    scheduler.cli.journal: Add `swh scheduler journal visit-stats` cli
    
    This adds the cli entrypoint to actually process origin_visit_status topics and write to
    the origin_visit_stats db table.
    
    Related to T2967

commit 2733c357a52c47016733352ec5418360363a98ff
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jan 15 15:49:41 2021 +0100

    journal_client: Improve stats detection
    
    This adds an integration test which permutes input to ensure out of order renders the
    same result.
    
    This also improves the current algorithm which revealed some hit-and-miss cases.
    
    Related to T2967

Link to build: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/144/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/144/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jan 18 2021, 3:47 PM
Harbormaster failed remote builds in B18447: Diff 17308!

Fix test

(kafka-server init failure)

Build has FAILED

Patch application report for D4876 (id=17309)

Could not rebase; Attempt merge onto d3afd144af...

Merge made by the 'recursive' strategy.
 mypy.ini                                   |   3 +
 requirements-swh.txt                       |   1 +
 swh/scheduler/cli/journal.py               |  62 +++++++++++++
 swh/scheduler/journal_client.py            |  27 ++++--
 swh/scheduler/tests/test_cli_journal.py    | 114 +++++++++++++++++++++++
 swh/scheduler/tests/test_journal_client.py | 139 +++++++++++++++++++++++++++--
 6 files changed, 333 insertions(+), 13 deletions(-)
 create mode 100644 swh/scheduler/cli/journal.py
 create mode 100644 swh/scheduler/tests/test_cli_journal.py
Changes applied before test
commit 4cb1d7962948d7124fbe0fe092deb198a507cbab
Merge: d3afd14 e6c4bf5
Author: Jenkins user <jenkins@localhost>
Date:   Mon Jan 18 15:00:22 2021 +0000

    Merge branch 'diff-target' into HEAD

commit e6c4bf5b2f67d0a5007f35cac2297dd0e1724985
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Jan 18 14:46:57 2021 +0100

    scheduler.cli.journal: Add `swh scheduler journal visit-stats` cli
    
    This adds the cli entrypoint to actually process origin_visit_status topics and write to
    the origin_visit_stats db table.
    
    Related to T2967

commit 2733c357a52c47016733352ec5418360363a98ff
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jan 15 15:49:41 2021 +0100

    journal_client: Improve stats detection
    
    This adds an integration test which permutes input to ensure out of order renders the
    same result.
    
    This also improves the current algorithm which revealed some hit-and-miss cases.
    
    Related to T2967

Link to build: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/145/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/145/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jan 18 2021, 4:02 PM
Harbormaster failed remote builds in B18448: Diff 17309!

Build has FAILED

Patch application report for D4876 (id=17309)

Could not rebase; Attempt merge onto d3afd144af...

Merge made by the 'recursive' strategy.
 mypy.ini                                   |   3 +
 requirements-swh.txt                       |   1 +
 swh/scheduler/cli/journal.py               |  62 +++++++++++++
 swh/scheduler/journal_client.py            |  27 ++++--
 swh/scheduler/tests/test_cli_journal.py    | 114 +++++++++++++++++++++++
 swh/scheduler/tests/test_journal_client.py | 139 +++++++++++++++++++++++++++--
 6 files changed, 333 insertions(+), 13 deletions(-)
 create mode 100644 swh/scheduler/cli/journal.py
 create mode 100644 swh/scheduler/tests/test_cli_journal.py
Changes applied before test
commit 16f5c26c323ebcf8088f2a2bc905b7e3f74de603
Merge: d3afd14 e6c4bf5
Author: Jenkins user <jenkins@localhost>
Date:   Mon Jan 18 15:07:37 2021 +0000

    Merge branch 'diff-target' into HEAD

commit e6c4bf5b2f67d0a5007f35cac2297dd0e1724985
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Jan 18 14:46:57 2021 +0100

    scheduler.cli.journal: Add `swh scheduler journal visit-stats` cli
    
    This adds the cli entrypoint to actually process origin_visit_status topics and write to
    the origin_visit_stats db table.
    
    Related to T2967

commit 2733c357a52c47016733352ec5418360363a98ff
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jan 15 15:49:41 2021 +0100

    journal_client: Improve stats detection
    
    This adds an integration test which permutes input to ensure out of order renders the
    same result.
    
    This also improves the current algorithm which revealed some hit-and-miss cases.
    
    Related to T2967

Link to build: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/146/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/146/console

Build has FAILED

Patch application report for D4876 (id=17309)

Could not rebase; Attempt merge onto d3afd144af...

Merge made by the 'recursive' strategy.
 mypy.ini                                   |   3 +
 requirements-swh.txt                       |   1 +
 swh/scheduler/cli/journal.py               |  62 +++++++++++++
 swh/scheduler/journal_client.py            |  27 ++++--
 swh/scheduler/tests/test_cli_journal.py    | 114 +++++++++++++++++++++++
 swh/scheduler/tests/test_journal_client.py | 139 +++++++++++++++++++++++++++--
 6 files changed, 333 insertions(+), 13 deletions(-)
 create mode 100644 swh/scheduler/cli/journal.py
 create mode 100644 swh/scheduler/tests/test_cli_journal.py
Changes applied before test
commit 0e321723b7fbc215495931a6d8b42b93416bc404
Merge: d3afd14 e6c4bf5
Author: Jenkins user <jenkins@localhost>
Date:   Mon Jan 18 15:51:11 2021 +0000

    Merge branch 'diff-target' into HEAD

commit e6c4bf5b2f67d0a5007f35cac2297dd0e1724985
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Jan 18 14:46:57 2021 +0100

    scheduler.cli.journal: Add `swh scheduler journal visit-stats` cli
    
    This adds the cli entrypoint to actually process origin_visit_status topics and write to
    the origin_visit_stats db table.
    
    Related to T2967

commit 2733c357a52c47016733352ec5418360363a98ff
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jan 15 15:49:41 2021 +0100

    journal_client: Improve stats detection
    
    This adds an integration test which permutes input to ensure out of order renders the
    same result.
    
    This also improves the current algorithm which revealed some hit-and-miss cases.
    
    Related to T2967

Link to build: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/147/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/147/console

Fix test (missing a producer.flush call ¯\_(ツ)_/¯)

Build is green

Patch application report for D4876 (id=17310)

Could not rebase; Attempt merge onto d3afd144af...

Merge made by the 'recursive' strategy.
 mypy.ini                                   |   3 +
 requirements-swh.txt                       |   1 +
 swh/scheduler/cli/journal.py               |  62 +++++++++++++
 swh/scheduler/journal_client.py            |  27 ++++--
 swh/scheduler/tests/test_cli_journal.py    | 115 ++++++++++++++++++++++++
 swh/scheduler/tests/test_journal_client.py | 139 +++++++++++++++++++++++++++--
 6 files changed, 334 insertions(+), 13 deletions(-)
 create mode 100644 swh/scheduler/cli/journal.py
 create mode 100644 swh/scheduler/tests/test_cli_journal.py
Changes applied before test
commit 9bb39fed16246d14d634b2809534700e97d23dc7
Merge: d3afd14 ca5a77d
Author: Jenkins user <jenkins@localhost>
Date:   Mon Jan 18 16:29:44 2021 +0000

    Merge branch 'diff-target' into HEAD

commit ca5a77d303ec47b59cf4d30ef53f3659bf3b0b5c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Jan 18 14:46:57 2021 +0100

    scheduler.cli.journal: Add `swh scheduler journal visit-stats` cli
    
    This adds the cli entrypoint to actually process origin_visit_status topics and write to
    the origin_visit_stats db table.
    
    Related to T2967

commit 2733c357a52c47016733352ec5418360363a98ff
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jan 15 15:49:41 2021 +0100

    journal_client: Improve stats detection
    
    This adds an integration test which permutes input to ensure out of order renders the
    same result.
    
    This also improves the current algorithm which revealed some hit-and-miss cases.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/148/ for more details.

olasd added a subscriber: olasd.

I guess we could just call this swh scheduler journal-client (single command at the top level of swh scheduler), until we need another one of these (in which case we can move them both to a subcommand of swh scheduler journal-client).

The rest looks like journal client boilerplate, which is totally fine by me!

This revision is now accepted and ready to land.Jan 18 2021, 5:43 PM

I guess we could just call this swh scheduler journal-client (single command at the top level of swh scheduler),
until we need another one of these (in which case we can move them both to a subcommand of swh scheduler journal-client).

The only small thing with this is that we kinda lose the notion of visit-stats as we
pass no object type in that current implementation (granted that can be fixed
with a docstring though).

I also forgot to mention here that @douardda suggested to invert the command
subcommand names to swh scheduler visit-stats journal --help as journal is an
implementation detail.

Given that, we could go a tad further and even drop the journal notion and replace it
with compute or process or some such.

swh scheduler visit-stats process

which seems to the point of what we do here, we are processing visit-status to aggregate
them into visit-stats.

+1 for swh scheduler journal-client, for consistency with other packages + we may want to add other stuff in that client.

And we can rename it later if needed

Hey,

I guess we could just call this swh scheduler journal-client (single command at the top level of swh scheduler),
until we need another one of these (in which case we can move them both to a subcommand of swh scheduler journal-client).

The only small thing with this is that we kinda lose the notion of visit-stats as we
pass no object type in that current implementation (granted that can be fixed
with a docstring though).

I also forgot to mention here that @douardda suggested to invert the command
subcommand names to swh scheduler visit-stats journal --help as journal is an
implementation detail.

Given that, we could go a tad further and even drop the journal notion and replace it
with compute or process or some such.

swh scheduler visit-stats process

which seems to the point of what we do here, we are processing visit-status to aggregate
them into visit-stats.

I think journal-client would be more explicit than process (and consistent with other modules), even if it "leaks" an implementation detail (but in any case, you'll need a journal client configuration to run the tool, so I don't think that's really an issue).

I guess my main objection to that command name is that I don't like the "visit-stats" name (there's not a single statistic in any field of that table!). I don't have any better suggestion for it anyway, so that's not a very relevant objection to naming the thing that keeps that table up to date. I'm not a huge fan of having a command group with a single sub-command either, which is another objection, but I guess we will want to add a query function or two as subcommands there anyway.

All in all, I:

  • strongly prefer that this command be named journal-client
  • don't much care if it ends up being a visit-stats subcommand, as long as it's consistent with the table name and it doesn't stay "alone" forever.

Adapt according to review (thanks ;)

Rename cli to swh scheduler journal-client

Build is green

Patch application report for D4876 (id=17328)

Rebasing onto 58ca79622d...

Current branch diff-target is up to date.
Changes applied before test
commit 9395aa0763b79cdde7cefbfdb1f93b03d2cb8c65
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Jan 18 14:46:57 2021 +0100

    scheduler.cli.journal: Add `swh scheduler journal-client` cli
    
    This adds the cli entrypoint to actually process origin_visit_status topics and write to
    the origin_visit_stats db table.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/153/ for more details.