Page MenuHomeSoftware Heritage

Only record last_visited and last_successful in origin_visit_stats
ClosedPublic

Authored by olasd on Jul 23 2021, 11:56 AM.

Details

Summary

After using this schema for a while, all queries can be implemented in
terms of these two timestamps, instead of the four original
last_eventful, last_uneventful, last_failed and last_notfound
timestamps.

This ends up simplifying the logic within the journal client, as well as
that of the grab_next_visits query builder.

To make this change work, we also stop considering out of order messages
altogether in journal_client. This welcome simplification is an accuracy
tradeoff that is explained in the updated documentation of the journal
client:

.. [1] Ignoring out of order messages makes the initialization of the

origin_visit_status table (from a full journal) less deterministic: only the
`last_visit`, `last_visit_state` and `last_successful` fields are guaranteed
to be exact, the `next_position_offset` field is a best effort estimate
(which should converge once the client has run for a while on in-order
messages).
Test Plan

everything is awesome

Diff Detail

Repository
rDSCH Scheduling utilities
Branch
database-cleanup
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 22724
Build 35438: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 35437: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D6018 (id=21757)

Could not rebase; Attempt merge onto d58776ab0b...

Updating d58776a..87e66fa
Fast-forward
 sql/updates/30.sql                         |  79 ++++++++
 swh/scheduler/api/serializers.py           |  10 +-
 swh/scheduler/backend.py                   |  52 +++--
 swh/scheduler/interface.py                 |   6 +-
 swh/scheduler/journal_client.py            | 162 ++++++++--------
 swh/scheduler/model.py                     |  49 +++--
 swh/scheduler/sql/30-schema.sql            |  18 +-
 swh/scheduler/sql/40-func.sql              |   2 +-
 swh/scheduler/tests/test_journal_client.py | 298 ++++++++++++++---------------
 swh/scheduler/tests/test_scheduler.py      | 149 +++++----------
 10 files changed, 420 insertions(+), 405 deletions(-)
 create mode 100644 sql/updates/30.sql
Changes applied before test
commit 87e66faa300df8e1e3cb8c82660d8155e877564f
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Fri Jul 23 11:48:23 2021 +0200

    Only record last_visited and last_successful in origin_visit_stats
    
    After using this schema for a while, all queries can be implemented in
    terms of these two timestamps, instead of the four original
    last_eventful, last_uneventful, last_failed and last_notfound
    timestamps.
    
    This ends up simplifying the logic within the journal client, as well as
    that of the grab_next_visits query builder.
    
    To make this change work, we also stop considering out of order messages
    altogether in journal_client. This welcome simplification is an accuracy
    tradeoff that is explained in the updated documentation of the journal
    client:
    
    .. [1] Ignoring out of order messages makes the initialization of the
          origin_visit_status table (from a full journal) less deterministic: only the
          `last_visit`, `last_visit_state` and `last_successful` fields are guaranteed
          to be exact, the `next_position_offset` field is a best effort estimate
          (which should converge once the client has run for a while on in-order
          messages).

commit 3ca0d659503ffce208d643e26d5712f46c28ffae
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Jul 23 09:22:46 2021 +0200

    test_journal_client: Unify test assertion like the rest
    
    Related to D5917

commit 8cf2238eac8d9d14c2e510d4f3a0c7dd0e48d9a8
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jul 22 11:42:24 2021 +0200

    test: Refactor assert_visit_stats_ok to ignore_fields
    
    This simplifies and unifies properly the utility test function to compare visit stats.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/419/ for more details.

Awesome indeed.

Couple of comments but nothing blocking.

swh/scheduler/journal_client.py
111

We forget to mention that when called at this point, the message are in order.
Thus the rather simple implementation here.

Again (as per the other comment), i think it's fine to just push the diff as is.
I intend to update docstings and refactor a bit the tests prior to actually
implement the last part (about too much successive failing visits which disable origins).

226

This call could be inlined as:

  • there are only 2 instructions in that function
  • that function actually mutates the input (which is not that good)
  • it's only called once (after that refactoring)

(i'll do this later, no need to bother with it now ;)

This revision is now accepted and ready to land.Jul 23 2021, 12:17 PM