Page MenuHomeSoftware Heritage

swh.journal: Add backfiller implementation
ClosedPublic

Authored by ardumont on Apr 4 2019, 4:17 PM.

Details

Summary

Backfill for the following object types:

  • content
  • skipped_content
  • directory
  • release
  • revision
  • snaphost
  • origin
  • origin_visit

Support for:

  • sending ranges
  • starting back from a given boundary (depending on the object type, could be a hash or an id).

This is not complete yet (missing directory and snapshot

We'd like to make sure the journal replayer is able to reuse the data sent to the journal.

Test Plan
  • tox
  • run in docker environment (D1346)

Diff Detail

Repository
rDJNL Journal infrastructure
Branch
backfiller
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 5238
Build 7081: tox-on-jenkinsJenkins
Build 7080: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Add support for origin, origin_visit, content, revision, release

(with pagination)

Remains snaphost, directory support to add

And checks with the KafkaReplayer

ardumont edited the test plan for this revision. (Show Details)
ardumont added projects: Journal, Storage manager.
ardumont edited the test plan for this revision. (Show Details)

Add support for snapshot and directory

Fix pep8 violation

Apparently 100_000_000 is not supported in python3.5 but it is in latest one

  • backfill: Fix origin_visit support to use the right format
  • backfill: Flush the messages to the journal when done with the batch
  • backfill: Make partition key a singleton

Build has FAILED

Yes, replayer tests are broken, unrelated to the backfiller
Apparently D1367 will fix that.

Rebase on cleanup_publisher branch

That gets out of the way some unrelated fix commits to the publisher

(tests will still fail)

Rebase on master

That gets out of the way the cli's logging setup fix (which is
unrelated to this branch and needed by other people anyway)

oops, the rebase should have been on cleanup-publisher branch

done now

  • Remove publisher notion
anlambert added inline comments.
swh/journal/backfill.py
8

I do not really understand what the journal backfiller is supposed to do by reading that first sentence.

49

I think it should simply be "id" here. The defined alias is the same as the column name.

74

same here

78

same here

215–217

This could be moved in a dedicated _compute_shift_bits function as it is also used in the byte_ranges method below.

423

s/Reads/Read as imperative form is used elsewhere

swh/journal/backfill.py
49

We used those to disambiguate the sql query we generate (when a join is implied).

For example, release, revision joins on person which also has columns name and id.

So we need those.

215–217

Indeed!
I will adapt.

  • backfill: Improve module docstrings
  • backfill: Flush when data is sent
  • backfill: Define a _compute_shit_bits function to reuse code
ardumont marked an inline comment as done.

Rebase on latest master

swh/journal/backfill.py
49

Oh I see, that was not straightforward to understand.

  • backfill: Use the right logging instruction
  • backfill: Fix logging statement
swh/journal/backfill.py
49

Indeed!

ardumont added inline comments.
swh/journal/backfill.py
340

@faux here :)

Fix pep8 violation (missing one blank line)

This revision is now accepted and ready to land.Apr 12 2019, 2:08 PM

Rebase to latest master and branch diff to master

This revision was automatically updated to reflect the committed changes.
ardumont retitled this revision from swh.journal: Bootstrap backfiller to swh.journal: Add backfiller implementation.