Page MenuHomeSoftware Heritage

Populate origin_visit_stats table out of the origin_visit_status topic
ClosedPublic

Authored by ardumont on Jan 14 2021, 5:46 PM.

Details

Summary

The snapshot is used to determine the "eventful/uneventful" nature of the origin visit
status.

When no snapshot is provided, the visit is considered as failed so the last_failed
column is updated.

As there is no time guarantee when reading message from the topic, the code tries to
keep the data in the most timely ordered as possible. Only most recent information is
kept.

TODO: @vlorentz mentioned in D4852 that the api origin_visit_stats_upsert should
be a batch api. That concern shall be addressed in D4868.

Related to T2967

Test Plan

tox

Diff Detail

Repository
rDSCH Scheduling utilities
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D4866 (id=17235)

Rebasing onto ca45d40f2a...

Current branch diff-target is up to date.
Changes applied before test
commit 7f4d22a42213378b1e5e85e921c61bfa62be2868
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jan 13 13:03:28 2021 +0100

    Populate origin_visit_stats table out of the origin_visit_status topic
    
    The snapshot is used to determine the "eventful/uneventful" nature of the origin visit
    status.
    
    When no snapshot is provided, the visit is considered as failed so the last_failed
    column is updated.
    
    As there is no time guarantee when reading message from the topic, the code tries to
    keep the data in the most timely ordered as possible. Only most recent information is
    kept.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/125/ for more details.

vlorentz added inline comments.
sql/updates/19.sql
21
swh/scheduler/backend.py
805

isn't this line redundant?

swh/scheduler/journal_client.py
29–31

Why not swh.model.model objects?

39
46–85

Make last_notfound/last_failed/... local variables, and build OriginVisitStats with explicit arguments at the end, so that mypy can type-check them.

swh/scheduler/tests/test_journal_client.py
26–28

Could you rename them to DATE1, DATE2, and DATE3?

31–32

use the Python, Luke

swh/scheduler/tests/test_scheduler.py
865–867
This revision now requires changes to proceed.Jan 15 2021, 11:22 AM

incremental review

swh/scheduler/backend.py
805

It seems the conditional is bogus.
Is it ovi.last_eventful > excluded.last_eventful?
But even then there would be redundancy in doing 3 cases on this comparison.

swh/scheduler/journal_client.py
20
22

may seem useless after a raise, but using else/elif clauses may make mypy happy AND be less redundant

swh/scheduler/backend.py
805

as mentioned in the previous diff, that could be collapsed, yes.
It was clearer as an incremental step.

and lol at the bogus conditional, tenma, you are right.

swh/scheduler/journal_client.py
22

it's totally for mypy here...

29–31

Are you implying we can receive model objects here?

It'd be neater.

46–85

next diff about batch upsert modifies this part so i won't touch it yet.

swh/scheduler/tests/test_journal_client.py
44

Adapt according to review
(again, minus the model object remark)

Build is green

Patch application report for D4866 (id=17254)

Rebasing onto ca45d40f2a...

Current branch diff-target is up to date.
Changes applied before test
commit a916b4a8d3fbca565a9ec5abead5bdeb0e346845
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jan 13 13:03:28 2021 +0100

    Populate origin_visit_stats table out of the origin_visit_status topic
    
    The snapshot is used to determine the "eventful/uneventful" nature of the origin visit
    status.
    
    When no snapshot is provided, the visit is considered as failed so the last_failed
    column is updated.
    
    As there is no time guarantee when reading message from the topic, the code tries to
    keep the data in the most timely ordered as possible. Only most recent information is
    kept.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/129/ for more details.

vlorentz added inline comments.
swh/scheduler/journal_client.py
29–31

my bad, we can't

swh/scheduler/tests/test_scheduler.py
865–867

forgot this one

This revision is now accepted and ready to land.Jan 15 2021, 2:17 PM
swh/scheduler/tests/test_scheduler.py
865–867

right.

Change missing assertion change

Build is green

Patch application report for D4866 (id=17269)

Rebasing onto ca45d40f2a...

Current branch diff-target is up to date.
Changes applied before test
commit a916b4a8d3fbca565a9ec5abead5bdeb0e346845
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jan 13 13:03:28 2021 +0100

    Populate origin_visit_stats table out of the origin_visit_status topic
    
    The snapshot is used to determine the "eventful/uneventful" nature of the origin visit
    status.
    
    When no snapshot is provided, the visit is considered as failed so the last_failed
    column is updated.
    
    As there is no time guarantee when reading message from the topic, the code tries to
    keep the data in the most timely ordered as possible. Only most recent information is
    kept.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/135/ for more details.

Actually fix the missing assertion change to a < b < c

Build is green

Patch application report for D4866 (id=17272)

Rebasing onto ca45d40f2a...

Current branch diff-target is up to date.
Changes applied before test
commit 608aa2075619bbc9f5c7dab56d1c37a560f13e64
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jan 13 13:03:28 2021 +0100

    Populate origin_visit_stats table out of the origin_visit_status topic
    
    The snapshot is used to determine the "eventful/uneventful" nature of the origin visit
    status.
    
    When no snapshot is provided, the visit is considered as failed so the last_failed
    column is updated.
    
    As there is no time guarantee when reading message from the topic, the code tries to
    keep the data in the most timely ordered as possible. Only most recent information is
    kept.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/138/ for more details.