Page MenuHomeSoftware Heritage

Add a backfiller cli command
Needs ReviewPublic

Authored by douardda on Dec 16 2022, 3:14 PM.

Details

Reviewers
None
Group Reviewers
Reviewers
Summary

This command allowd to backfill a kafka journal from an existing
Postgresql provenance storage.

The command will run a given number of workers in parallel. The state of
the backfilling process is saved in a leveldb store, so interrupting and
restarting a backfilling process is possible, with limitations: it won't
work properly if the range generation is modified.

Diff Detail

Repository
rDPROV Provenance database
Branch
diff/D8964
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 33325
Build 52233: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 52232: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D8964 (id=32301)

Rebasing onto c626cc21b3...

Current branch diff-target is up to date.
Changes applied before test
commit a98494dafd797ae6c1fb8e509d53fa17afa07374
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Dec 16 15:09:11 2022 +0100

    Add a backfiller cli command
    
    This command allowd to backfill a kafka journal from an existing
    Postgresql provenance storage.
    
    The command will run a given number of workers in parallel. The state of
    the backfilling process is saved in a leveldb store, so interrupting and
    restarting a backfilling process is possible, with limitations: it won't
    work properly if the range generation is modified.

commit 0b9df1a11c798767beacf09dfed6179ddc593419
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Dec 9 15:06:16 2022 +0100

    Extract the journal writer part from the ProvenanceStorageJournal class
    
    This allows to use the journal writing part independently from the
    ProvenanceStorage proxy class, eg. for the backfiller mechanism.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/711/ for more details.

You might want to remove your calls to logger.error() and print() before re-raising

swh/provenance/storage/backfill.py
206

add a call to notify("WATCHDOG=1") here, you currently don't have any.

swh/provenance/storage/backfill.py
195

in English, right-open intervals are written [%s, %s); [%s, %s[ is in French.

olasd added a subscriber: olasd.

Apply @vlorentz's comments

Build is green

Patch application report for D8964 (id=32334)

Could not rebase; Attempt merge onto c626cc21b3...

Updating c626cc2..e66a2bf
Fast-forward
 mypy.ini                                           |   3 +
 requirements.txt                                   |   1 +
 swh/provenance/cli.py                              |  91 +++++-
 swh/provenance/storage/backfill.py                 | 344 +++++++++++++++++++++
 swh/provenance/storage/journal.py                  | 104 ++++---
 .../tests/test_provenance_journal_writer.py        |  64 ++--
 6 files changed, 539 insertions(+), 68 deletions(-)
 create mode 100644 swh/provenance/storage/backfill.py
Changes applied before test
commit e66a2bf98615a59ffbea30f1269c364bdf4db57e
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Dec 16 15:09:11 2022 +0100

    Add a backfiller cli command
    
    This command allowd to backfill a kafka journal from an existing
    Postgresql provenance storage.
    
    The command will run a given number of workers in parallel. The state of
    the backfilling process is saved in a leveldb store, so interrupting and
    restarting a backfilling process is possible, with limitations: it won't
    work properly if the range generation is modified.

commit 0b9df1a11c798767beacf09dfed6179ddc593419
Author: David Douard <david.douard@sdfa3.org>
Date:   Fri Dec 9 15:06:16 2022 +0100

    Extract the journal writer part from the ProvenanceStorageJournal class
    
    This allows to use the journal writing part independently from the
    ProvenanceStorage proxy class, eg. for the backfiller mechanism.

See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/712/ for more details.

You might want to remove your calls to logger.error() and print() before re-raising

what about these?