Page MenuHomeSoftware Heritage

journal_client: Improve stats detection
ClosedPublic

Authored by ardumont on Jan 15 2021, 4:57 PM.

Details

Summary

This adds an integration test which permutes inputs to ensure out of order renders the
same result.

This also improves the current algorithm which revealed some hit-and-miss cases:

  • Initialization of the first visit detection (through the "last_snapshot" absence field,

the previous implementation check could fail otherwise).

  • out of order policy (ignore old event) in case of supposedly "eventful" event was done too early

which ignored too much messages (those new test cases failed in some permutations).

  • This is now specifically checked in case of referenced snapshots which led to cases of possibly

changing eventful event into uneventful one. For example, the case of an anterior eventful event is
caught which means that the current most-up-to-date eventful event is actually an uneventful one).

  • ...

Related to T2967

Diff Detail

Repository
rDSCH Scheduling utilities
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 18459
Build 28541: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 28540: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D4873 (id=17278)

Rebasing onto a5fb291703...

Current branch diff-target is up to date.
Changes applied before test
commit 2733c357a52c47016733352ec5418360363a98ff
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jan 15 15:49:41 2021 +0100

    journal_client: Improve stats detection
    
    This adds an integration test which permutes input to ensure out of order renders the
    same result.
    
    This also improves the current algorithm which revealed some hit-and-miss cases.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/139/ for more details.

Can you be more explicit about what the improvements are, in the commit msg / diff description?

Can you be more explicit about what the improvements are, in the commit msg / diff description?

Description updated.
I'll amend the commit message when we are fine with such a description ;)

swh/scheduler/journal_client.py
88 ↗(On Diff #17278)

this was too dismissive a filter so this conditional was moved in the subsequent conditional with different behavior changes as explained in the diff description.

swh/scheduler/tests/test_journal_client.py
310 ↗(On Diff #17278)

Those tests are not real changes, they were deemed clearer inlined.

olasd added a subscriber: olasd.

The new logic is somewhat elaborate, but it's well documented and seems to be properly tested, so this looks fine to me.

For now there's nothing that I see that would prevent this logic from being implemented in SQL to reduce the amount of database ping-pong, when we're done refining it. Alternatively, or as a stop-gap, we could consider retrieving the "current recorded status" in a single batch for all the messages being processed now (that is, collapsing all the scheduler.origin_visit_stats_get calls into a single call to a new backend endpoint).

swh/scheduler/journal_client.py
87 ↗(On Diff #17278)

This is bikeshed-ish, but I think the logic would be even easier to follow with:

  • date renamed to latest_recorded_visit(_date?)
  • msg_dict["date"] given a more explicit name like current_status_date
101–104 ↗(On Diff #17278)

Good catch!

This revision is now accepted and ready to land.Jan 18 2021, 5:58 PM

Adapt according to review, rename variables to improve readability

Build is green

Patch application report for D4873 (id=17320)

Could not rebase; Attempt merge onto d3afd144af...

Merge made by the 'recursive' strategy.
 mypy.ini                                   |   3 +
 requirements-swh.txt                       |   1 +
 swh/scheduler/cli/journal.py               |  62 +++++++++++++
 swh/scheduler/journal_client.py            |  38 ++++++--
 swh/scheduler/tests/test_cli_journal.py    | 115 ++++++++++++++++++++++++
 swh/scheduler/tests/test_journal_client.py | 139 +++++++++++++++++++++++++++--
 6 files changed, 343 insertions(+), 15 deletions(-)
 create mode 100644 swh/scheduler/cli/journal.py
 create mode 100644 swh/scheduler/tests/test_cli_journal.py
Changes applied before test
commit ca8be2930a95f7bff9fc41d00370858987e5ad02
Merge: d3afd14 9a9db55
Author: Jenkins user <jenkins@localhost>
Date:   Tue Jan 19 08:53:44 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 9a9db555a03e10305c183591e66a6dac5e760a33
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Mon Jan 18 14:46:57 2021 +0100

    scheduler.cli.journal: Add `swh scheduler journal visit-stats` cli
    
    This adds the cli entrypoint to actually process origin_visit_status topics and write to
    the origin_visit_stats db table.
    
    Related to T2967

commit dea576ef295210f9d772ff5fda8d44e5202fb5af
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jan 15 15:49:41 2021 +0100

    journal_client: Improve stats detection
    
    This adds an integration test which permutes input to ensure out of order renders the
    same result.
    
    This also improves the current algorithm which revealed some hit-and-miss cases.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/149/ for more details.

Use the right commit range for the diff update with the previous description

Build is green

Patch application report for D4873 (id=17321)

Rebasing onto d3afd144af...

First, rewinding head to replay your work on top of it...
Applying: journal_client: Improve stats detection
Changes applied before test
commit 7b3a64e68e79f5165ceccb96679e07413ebb4d19
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jan 15 15:49:41 2021 +0100

    journal_client: Improve stats detection
    
    This adds an integration test which permutes input to ensure out of order renders the
    same result.
    
    This also improves the current algorithm which revealed some hit-and-miss cases.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/150/ for more details.

Rework commit message as promised (I almost forgot ¯\_(ツ)_/¯)

Build is green

Patch application report for D4873 (id=17322)

Rebasing onto d3afd144af...

Current branch diff-target is up to date.
Changes applied before test
commit c4f12d097ce23ef4cb748cf5412bf5f3d808a58d
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jan 15 15:49:41 2021 +0100

    journal_client: Improve stats detection
    
    This adds an integration test which permutes input to ensure out of order renders the
    same result.
    
    This also improves the current algorithm which revealed some hit-and-miss cases.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/151/ for more details.

Build is green

Patch application report for D4873 (id=17323)

Rebasing onto d3afd144af...

Current branch diff-target is up to date.
Changes applied before test
commit 58ca79622db077d9a2a897b4671a6013bf70f97d
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jan 15 15:49:41 2021 +0100

    journal_client: Improve stats detection
    
    This adds an integration test which permutes input to ensure out of order renders the
    same result.
    
    This also improves the current algorithm which revealed some hit-and-miss cases:
    
    - Initialization of the first visit detection (through the "last_snapshot" absence
      field, the previous implementation check could fail otherwise).
    
    - out of order policy (ignore old event) in case of supposedly "eventful" event was done
      too early which ignored too much messages (those new test cases failed in some
      permutations). This is now specifically checked in case of referenced snapshots which
      led to cases of possibly changing eventful event into uneventful one. For example, the
      case of an anterior eventful event is caught which means that the current
      most-up-to-date eventful event is actually an uneventful one).
    
    ...
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/152/ for more details.

This revision was automatically updated to reflect the committed changes.