Page MenuHomeSoftware Heritage

scheduler.cli.journal: Add `swh scheduler journal visit-stats` cli
ClosedPublic

Authored by ardumont on Jan 18 2021, 3:36 PM.

Details

Summary

This adds the cli entrypoint to actually process origin_visit_status topic (which writes
to the origin_visit_stats db table).

Note that:

  • the cli name visit-stats could be changed if someone has a better name in mind
  • this does not open the swh.journal as optional dependency for the scheduler

which may be a tad cumbersome

Related to T2967
Depends on D4873

Diff Detail

Repository
rDSCH Scheduling utilities
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D4876 (id=17307)

Could not rebase; Attempt merge onto d3afd144af...

Merge made by the 'recursive' strategy.
 requirements-swh.txt                       |   1 +
 swh/scheduler/cli/journal.py               |  62 +++++++++++++
 swh/scheduler/journal_client.py            |  27 ++++--
 swh/scheduler/tests/test_cli_journal.py    | 113 +++++++++++++++++++++++
 swh/scheduler/tests/test_journal_client.py | 139 +++++++++++++++++++++++++++--
 5 files changed, 329 insertions(+), 13 deletions(-)
 create mode 100644 swh/scheduler/cli/journal.py
 create mode 100644 swh/scheduler/tests/test_cli_journal.py
Changes applied before test
commit 7e85a958b3c566560d920e8dd7f69e3226f838b4
Merge: d3afd14 6bd95aa
Author: Jenkins user <jenkins@localhost>
Date:   Mon Jan 18 14:36:53 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 6bd95aa1a09e183a00e5e7712724e0fb703ef083
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Jan 18 14:46:57 2021 +0100

    scheduler.cli.journal: Add `swh scheduler journal visit-stats` cli
    
    This adds the cli entrypoint to actually process origin_visit_status topics and write to
    the origin_visit_stats db table.
    
    Related to T2967

commit 2733c357a52c47016733352ec5418360363a98ff
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jan 15 15:49:41 2021 +0100

    journal_client: Improve stats detection
    
    This adds an integration test which permutes input to ensure out of order renders the
    same result.
    
    This also improves the current algorithm which revealed some hit-and-miss cases.
    
    Related to T2967

Link to build: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/143/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/143/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jan 18 2021, 3:37 PM
Harbormaster failed remote builds in B18446: Diff 17307!

Fix warnings which fail the build

Build has FAILED

Patch application report for D4876 (id=17308)

Could not rebase; Attempt merge onto d3afd144af...

Merge made by the 'recursive' strategy.
 mypy.ini                                   |   3 +
 requirements-swh.txt                       |   1 +
 swh/scheduler/cli/journal.py               |  62 +++++++++++++
 swh/scheduler/journal_client.py            |  27 ++++--
 swh/scheduler/tests/test_cli_journal.py    | 113 +++++++++++++++++++++++
 swh/scheduler/tests/test_journal_client.py | 139 +++++++++++++++++++++++++++--
 6 files changed, 332 insertions(+), 13 deletions(-)
 create mode 100644 swh/scheduler/cli/journal.py
 create mode 100644 swh/scheduler/tests/test_cli_journal.py
Changes applied before test
commit a59af2ef6d4b1cbaebe799661817d6b4f053a7cf
Merge: d3afd14 3a0504d
Author: Jenkins user <jenkins@localhost>
Date:   Mon Jan 18 14:45:12 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 3a0504d1615b16182036708433e966ba8750f6db
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Jan 18 14:46:57 2021 +0100

    scheduler.cli.journal: Add `swh scheduler journal visit-stats` cli
    
    This adds the cli entrypoint to actually process origin_visit_status topics and write to
    the origin_visit_stats db table.
    
    Related to T2967

commit 2733c357a52c47016733352ec5418360363a98ff
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jan 15 15:49:41 2021 +0100

    journal_client: Improve stats detection
    
    This adds an integration test which permutes input to ensure out of order renders the
    same result.
    
    This also improves the current algorithm which revealed some hit-and-miss cases.
    
    Related to T2967

Link to build: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/144/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/144/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jan 18 2021, 3:47 PM
Harbormaster failed remote builds in B18447: Diff 17308!

Fix test

(kafka-server init failure)

Build has FAILED

Patch application report for D4876 (id=17309)

Could not rebase; Attempt merge onto d3afd144af...

Merge made by the 'recursive' strategy.
 mypy.ini                                   |   3 +
 requirements-swh.txt                       |   1 +
 swh/scheduler/cli/journal.py               |  62 +++++++++++++
 swh/scheduler/journal_client.py            |  27 ++++--
 swh/scheduler/tests/test_cli_journal.py    | 114 +++++++++++++++++++++++
 swh/scheduler/tests/test_journal_client.py | 139 +++++++++++++++++++++++++++--
 6 files changed, 333 insertions(+), 13 deletions(-)
 create mode 100644 swh/scheduler/cli/journal.py
 create mode 100644 swh/scheduler/tests/test_cli_journal.py
Changes applied before test
commit 4cb1d7962948d7124fbe0fe092deb198a507cbab
Merge: d3afd14 e6c4bf5
Author: Jenkins user <jenkins@localhost>
Date:   Mon Jan 18 15:00:22 2021 +0000

    Merge branch 'diff-target' into HEAD

commit e6c4bf5b2f67d0a5007f35cac2297dd0e1724985
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Jan 18 14:46:57 2021 +0100

    scheduler.cli.journal: Add `swh scheduler journal visit-stats` cli
    
    This adds the cli entrypoint to actually process origin_visit_status topics and write to
    the origin_visit_stats db table.
    
    Related to T2967

commit 2733c357a52c47016733352ec5418360363a98ff
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jan 15 15:49:41 2021 +0100

    journal_client: Improve stats detection
    
    This adds an integration test which permutes input to ensure out of order renders the
    same result.
    
    This also improves the current algorithm which revealed some hit-and-miss cases.
    
    Related to T2967

Link to build: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/145/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/145/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jan 18 2021, 4:02 PM
Harbormaster failed remote builds in B18448: Diff 17309!

Build has FAILED

Patch application report for D4876 (id=17309)

Could not rebase; Attempt merge onto d3afd144af...

Merge made by the 'recursive' strategy.
 mypy.ini                                   |   3 +
 requirements-swh.txt                       |   1 +
 swh/scheduler/cli/journal.py               |  62 +++++++++++++
 swh/scheduler/journal_client.py            |  27 ++++--
 swh/scheduler/tests/test_cli_journal.py    | 114 +++++++++++++++++++++++
 swh/scheduler/tests/test_journal_client.py | 139 +++++++++++++++++++++++++++--
 6 files changed, 333 insertions(+), 13 deletions(-)
 create mode 100644 swh/scheduler/cli/journal.py
 create mode 100644 swh/scheduler/tests/test_cli_journal.py
Changes applied before test
commit 16f5c26c323ebcf8088f2a2bc905b7e3f74de603
Merge: d3afd14 e6c4bf5
Author: Jenkins user <jenkins@localhost>
Date:   Mon Jan 18 15:07:37 2021 +0000

    Merge branch 'diff-target' into HEAD

commit e6c4bf5b2f67d0a5007f35cac2297dd0e1724985
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Jan 18 14:46:57 2021 +0100

    scheduler.cli.journal: Add `swh scheduler journal visit-stats` cli
    
    This adds the cli entrypoint to actually process origin_visit_status topics and write to
    the origin_visit_stats db table.
    
    Related to T2967

commit 2733c357a52c47016733352ec5418360363a98ff
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jan 15 15:49:41 2021 +0100

    journal_client: Improve stats detection
    
    This adds an integration test which permutes input to ensure out of order renders the
    same result.
    
    This also improves the current algorithm which revealed some hit-and-miss cases.
    
    Related to T2967

Link to build: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/146/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/146/console

Build has FAILED

Patch application report for D4876 (id=17309)

Could not rebase; Attempt merge onto d3afd144af...

Merge made by the 'recursive' strategy.
 mypy.ini                                   |   3 +
 requirements-swh.txt                       |   1 +
 swh/scheduler/cli/journal.py               |  62 +++++++++++++
 swh/scheduler/journal_client.py            |  27 ++++--
 swh/scheduler/tests/test_cli_journal.py    | 114 +++++++++++++++++++++++
 swh/scheduler/tests/test_journal_client.py | 139 +++++++++++++++++++++++++++--
 6 files changed, 333 insertions(+), 13 deletions(-)
 create mode 100644 swh/scheduler/cli/journal.py
 create mode 100644 swh/scheduler/tests/test_cli_journal.py
Changes applied before test
commit 0e321723b7fbc215495931a6d8b42b93416bc404
Merge: d3afd14 e6c4bf5
Author: Jenkins user <jenkins@localhost>
Date:   Mon Jan 18 15:51:11 2021 +0000

    Merge branch 'diff-target' into HEAD

commit e6c4bf5b2f67d0a5007f35cac2297dd0e1724985
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Jan 18 14:46:57 2021 +0100

    scheduler.cli.journal: Add `swh scheduler journal visit-stats` cli
    
    This adds the cli entrypoint to actually process origin_visit_status topics and write to
    the origin_visit_stats db table.
    
    Related to T2967

commit 2733c357a52c47016733352ec5418360363a98ff
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jan 15 15:49:41 2021 +0100

    journal_client: Improve stats detection
    
    This adds an integration test which permutes input to ensure out of order renders the
    same result.
    
    This also improves the current algorithm which revealed some hit-and-miss cases.
    
    Related to T2967

Link to build: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/147/
See console output for more information: https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/147/console

Fix test (missing a producer.flush call ¯\_(ツ)_/¯)

Build is green

Patch application report for D4876 (id=17310)

Could not rebase; Attempt merge onto d3afd144af...

Merge made by the 'recursive' strategy.
 mypy.ini                                   |   3 +
 requirements-swh.txt                       |   1 +
 swh/scheduler/cli/journal.py               |  62 +++++++++++++
 swh/scheduler/journal_client.py            |  27 ++++--
 swh/scheduler/tests/test_cli_journal.py    | 115 ++++++++++++++++++++++++
 swh/scheduler/tests/test_journal_client.py | 139 +++++++++++++++++++++++++++--
 6 files changed, 334 insertions(+), 13 deletions(-)
 create mode 100644 swh/scheduler/cli/journal.py
 create mode 100644 swh/scheduler/tests/test_cli_journal.py
Changes applied before test
commit 9bb39fed16246d14d634b2809534700e97d23dc7
Merge: d3afd14 ca5a77d
Author: Jenkins user <jenkins@localhost>
Date:   Mon Jan 18 16:29:44 2021 +0000

    Merge branch 'diff-target' into HEAD

commit ca5a77d303ec47b59cf4d30ef53f3659bf3b0b5c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Jan 18 14:46:57 2021 +0100

    scheduler.cli.journal: Add `swh scheduler journal visit-stats` cli
    
    This adds the cli entrypoint to actually process origin_visit_status topics and write to
    the origin_visit_stats db table.
    
    Related to T2967

commit 2733c357a52c47016733352ec5418360363a98ff
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jan 15 15:49:41 2021 +0100

    journal_client: Improve stats detection
    
    This adds an integration test which permutes input to ensure out of order renders the
    same result.
    
    This also improves the current algorithm which revealed some hit-and-miss cases.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/148/ for more details.

olasd added a subscriber: olasd.

I guess we could just call this swh scheduler journal-client (single command at the top level of swh scheduler), until we need another one of these (in which case we can move them both to a subcommand of swh scheduler journal-client).

The rest looks like journal client boilerplate, which is totally fine by me!

This revision is now accepted and ready to land.Jan 18 2021, 5:43 PM

I guess we could just call this swh scheduler journal-client (single command at the top level of swh scheduler),
until we need another one of these (in which case we can move them both to a subcommand of swh scheduler journal-client).

The only small thing with this is that we kinda lose the notion of visit-stats as we
pass no object type in that current implementation (granted that can be fixed
with a docstring though).

I also forgot to mention here that @douardda suggested to invert the command
subcommand names to swh scheduler visit-stats journal --help as journal is an
implementation detail.

Given that, we could go a tad further and even drop the journal notion and replace it
with compute or process or some such.

swh scheduler visit-stats process

which seems to the point of what we do here, we are processing visit-status to aggregate
them into visit-stats.

+1 for swh scheduler journal-client, for consistency with other packages + we may want to add other stuff in that client.

And we can rename it later if needed

Hey,

I guess we could just call this swh scheduler journal-client (single command at the top level of swh scheduler),
until we need another one of these (in which case we can move them both to a subcommand of swh scheduler journal-client).

The only small thing with this is that we kinda lose the notion of visit-stats as we
pass no object type in that current implementation (granted that can be fixed
with a docstring though).

I also forgot to mention here that @douardda suggested to invert the command
subcommand names to swh scheduler visit-stats journal --help as journal is an
implementation detail.

Given that, we could go a tad further and even drop the journal notion and replace it
with compute or process or some such.

swh scheduler visit-stats process

which seems to the point of what we do here, we are processing visit-status to aggregate
them into visit-stats.

I think journal-client would be more explicit than process (and consistent with other modules), even if it "leaks" an implementation detail (but in any case, you'll need a journal client configuration to run the tool, so I don't think that's really an issue).

I guess my main objection to that command name is that I don't like the "visit-stats" name (there's not a single statistic in any field of that table!). I don't have any better suggestion for it anyway, so that's not a very relevant objection to naming the thing that keeps that table up to date. I'm not a huge fan of having a command group with a single sub-command either, which is another objection, but I guess we will want to add a query function or two as subcommands there anyway.

All in all, I:

  • strongly prefer that this command be named journal-client
  • don't much care if it ends up being a visit-stats subcommand, as long as it's consistent with the table name and it doesn't stay "alone" forever.

Adapt according to review (thanks ;)

Rename cli to swh scheduler journal-client

Build is green

Patch application report for D4876 (id=17328)

Rebasing onto 58ca79622d...

Current branch diff-target is up to date.
Changes applied before test
commit 9395aa0763b79cdde7cefbfdb1f93b03d2cb8c65
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Jan 18 14:46:57 2021 +0100

    scheduler.cli.journal: Add `swh scheduler journal-client` cli
    
    This adds the cli entrypoint to actually process origin_visit_status topics and write to
    the origin_visit_stats db table.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/153/ for more details.