Page MenuHomeSoftware Heritage

Implement storage of listed origins
ClosedPublic

Authored by olasd on Jun 16 2020, 11:01 AM.

Details

Summary

This new API endpoint allows listers to record the origins they have seen during
their current run.

Origins are identified by the lister instance, the url of the origin, and the
type of loader that should be used to load this origin.

The implementation allows listers just send the list of origins they've
seen (with some lightweight extra information), leaving the backend to handle
whether to do an insertion or an update to an existing origin.

The current implementation doesn't disable origins that have disappeared when
doing a full listing run. This step will be done by a separate "origin garbage
collection" endpoint, which will peruse the last_seen field.

Depends on D3288.
Related to T2442

Test Plan

tox tests added for both the insert and update behaviors

Diff Detail

Repository
rDSCH Scheduling utilities
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D3289 (id=11662)

Could not rebase; Attempt merge onto 1c93e553a1...

Updating 1c93e55..d107a55
Fast-forward
 swh/scheduler/backend.py               | 40 +++++++++++++++-
 swh/scheduler/interface.py             | 16 ++++++-
 swh/scheduler/model.py                 | 88 ++++++++++++++++++++++++++++++----
 swh/scheduler/sql/30-swh-schema.sql    | 33 +++++++++++++
 swh/scheduler/tests/conftest.py        | 26 +++++++++-
 swh/scheduler/tests/test_api_client.py |  1 +
 swh/scheduler/tests/test_model.py      | 19 +++++++-
 swh/scheduler/tests/test_scheduler.py  | 42 ++++++++++++----
 8 files changed, 241 insertions(+), 24 deletions(-)
Changes applied before test
commit d107a5553414ec7f2745a739dbc82e56eb62514e
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 16 10:25:08 2020 +0200

    Implement storage of listed origins
    
    This new API endpoint allows listers to record the origins they have seen during
    their current run.
    
    Origins are identified by the lister instance, the url of the origin, and the
    type of loader that should be used to load this origin.
    
    The implementation allows listers just send the list of origins they've
    seen (with some lightweight extra information), leaving the backend to handle
    whether to do an insertion or an update to an existing origin.
    
    The current implementation doesn't disable origins that have disappeared when
    doing a full listing run. This step will be done by a separate "origin garbage
    collection" endpoint, which will peruse the `last_seen` field.

commit e0fa5c58d38c2cbe39fe1f8e0fbb36591c29b661
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 16 10:24:03 2020 +0200

    Move lister addition in scheduler tests to a pytest fixture
    
    This lets us keep the tests a little DRYer.

commit 04894bd7fb6a1c4d658587395cbbe4f2d60c2a2a
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 16 10:22:23 2020 +0200

    Lister.instance_name doesn't need a factory/default value

commit f520108a8d0abefec3a91967aedbc29fb1a808f8
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Jun 16 10:08:59 2020 +0200

    Improve support of primary keys
    
    This splits primary keys across "automatic" primary keys (handled by the
    database) and manual primary keys (managed by the user). Use the opportunity to
    improve/clarify the documentation of field metadata attributes.

See https://jenkins.softwareheritage.org/job/DSCH/job/tests-on-diff/31/ for more details.

This revision is now accepted and ready to land.Jun 16 2020, 2:57 PM
This revision was automatically updated to reflect the committed changes.