Page MenuHomeSoftware Heritage

Populate origin_visit_stats table out of the origin_visit_status topic
ClosedPublic

Authored by ardumont on Jan 14 2021, 5:46 PM.

Details

Summary

The snapshot is used to determine the "eventful/uneventful" nature of the origin visit
status.

When no snapshot is provided, the visit is considered as failed so the last_failed
column is updated.

As there is no time guarantee when reading message from the topic, the code tries to
keep the data in the most timely ordered as possible. Only most recent information is
kept.

TODO: @vlorentz mentioned in D4852 that the api origin_visit_stats_upsert should
be a batch api. That concern shall be addressed in D4868.

Related to T2967

Test Plan

tox

Diff Detail

Repository
rDSCH Scheduling utilities
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 18379
Build 28397: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 28396: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D4866 (id=17235)

Rebasing onto ca45d40f2a...

Current branch diff-target is up to date.
Changes applied before test
commit 7f4d22a42213378b1e5e85e921c61bfa62be2868
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jan 13 13:03:28 2021 +0100

    Populate origin_visit_stats table out of the origin_visit_status topic
    
    The snapshot is used to determine the "eventful/uneventful" nature of the origin visit
    status.
    
    When no snapshot is provided, the visit is considered as failed so the last_failed
    column is updated.
    
    As there is no time guarantee when reading message from the topic, the code tries to
    keep the data in the most timely ordered as possible. Only most recent information is
    kept.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/125/ for more details.

vlorentz added inline comments.
sql/updates/19.sql
22
swh/scheduler/backend.py
806

isn't this line redundant?

swh/scheduler/journal_client.py
29–31

Why not swh.model.model objects?

39
46–85

Make last_notfound/last_failed/... local variables, and build OriginVisitStats with explicit arguments at the end, so that mypy can type-check them.

swh/scheduler/tests/test_journal_client.py
26–28

Could you rename them to DATE1, DATE2, and DATE3?

31–32

use the Python, Luke

swh/scheduler/tests/test_scheduler.py
866–868
This revision now requires changes to proceed.Jan 15 2021, 11:22 AM

incremental review

swh/scheduler/backend.py
806

It seems the conditional is bogus.
Is it ovi.last_eventful > excluded.last_eventful?
But even then there would be redundancy in doing 3 cases on this comparison.

swh/scheduler/journal_client.py
20
22

may seem useless after a raise, but using else/elif clauses may make mypy happy AND be less redundant

swh/scheduler/backend.py
806

as mentioned in the previous diff, that could be collapsed, yes.
It was clearer as an incremental step.

and lol at the bogus conditional, tenma, you are right.

swh/scheduler/journal_client.py
22

it's totally for mypy here...

29–31

Are you implying we can receive model objects here?

It'd be neater.

46–85

next diff about batch upsert modifies this part so i won't touch it yet.

swh/scheduler/tests/test_journal_client.py
44

Adapt according to review
(again, minus the model object remark)

Build is green

Patch application report for D4866 (id=17254)

Rebasing onto ca45d40f2a...

Current branch diff-target is up to date.
Changes applied before test
commit a916b4a8d3fbca565a9ec5abead5bdeb0e346845
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jan 13 13:03:28 2021 +0100

    Populate origin_visit_stats table out of the origin_visit_status topic
    
    The snapshot is used to determine the "eventful/uneventful" nature of the origin visit
    status.
    
    When no snapshot is provided, the visit is considered as failed so the last_failed
    column is updated.
    
    As there is no time guarantee when reading message from the topic, the code tries to
    keep the data in the most timely ordered as possible. Only most recent information is
    kept.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/129/ for more details.

vlorentz added inline comments.
swh/scheduler/journal_client.py
29–31

my bad, we can't

swh/scheduler/tests/test_scheduler.py
866–868

forgot this one

This revision is now accepted and ready to land.Jan 15 2021, 2:17 PM
swh/scheduler/tests/test_scheduler.py
866–868

right.

Change missing assertion change

Build is green

Patch application report for D4866 (id=17269)

Rebasing onto ca45d40f2a...

Current branch diff-target is up to date.
Changes applied before test
commit a916b4a8d3fbca565a9ec5abead5bdeb0e346845
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jan 13 13:03:28 2021 +0100

    Populate origin_visit_stats table out of the origin_visit_status topic
    
    The snapshot is used to determine the "eventful/uneventful" nature of the origin visit
    status.
    
    When no snapshot is provided, the visit is considered as failed so the last_failed
    column is updated.
    
    As there is no time guarantee when reading message from the topic, the code tries to
    keep the data in the most timely ordered as possible. Only most recent information is
    kept.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/135/ for more details.

Actually fix the missing assertion change to a < b < c

Build is green

Patch application report for D4866 (id=17272)

Rebasing onto ca45d40f2a...

Current branch diff-target is up to date.
Changes applied before test
commit 608aa2075619bbc9f5c7dab56d1c37a560f13e64
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jan 13 13:03:28 2021 +0100

    Populate origin_visit_stats table out of the origin_visit_status topic
    
    The snapshot is used to determine the "eventful/uneventful" nature of the origin visit
    status.
    
    When no snapshot is provided, the visit is considered as failed so the last_failed
    column is updated.
    
    As there is no time guarantee when reading message from the topic, the code tries to
    keep the data in the most timely ordered as possible. Only most recent information is
    kept.
    
    Related to T2967

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/138/ for more details.