Page MenuHomeSoftware Heritage

journal: Use swh-model objects instead of dicts in replay and writer
ClosedPublic

Authored by ardumont on Mar 11 2020, 4:46 PM.

Details

Summary

@douardda that might break your configuration once merged and tagged (if that happens ;).

One step closer to let the BaseModel flow through the journal.

Depends on D2800

Test Plan

tox

Diff Detail

Repository
rDJNL Journal infrastructure
Branch
remove-validate-proxy
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 11075
Build 16690: tox-on-jenkinsJenkins
Build 16689: arc lint + arc unit

Event Timeline

ardumont created this revision.Mar 11 2020, 4:46 PM
ardumont added inline comments.Mar 11 2020, 5:35 PM
swh/journal/replay.py
151

Working on this.

ardumont edited the summary of this revision. (Show Details)Mar 11 2020, 5:40 PM
ardumont added inline comments.
swh/journal/replay.py
151

Related to D2813

vlorentz requested changes to this revision.Mar 12 2020, 2:20 PM
vlorentz added a subscriber: vlorentz.

Bad commit message; you're describing a consequence of the config instead of what the change is about.

I suggest "Use swh-model objects instead of dicts in replay and writer".

And if it's not too much trouble, you should split this into one diff for the replayer, and one for the writer.

swh/journal/__init__.py
17–25

why is this not in replay.py?

swh/journal/replay.py
156–170

If you merge D2813 first, you don't need to return List[Dict].

And instead of Tuple[List[SkippedContent], List[Content]], you can return List[BaseContent] call isinstance when iterating the list.

339

I don't understand this comment

This revision now requires changes to proceed.Mar 12 2020, 2:20 PM
ardumont added inline comments.Mar 12 2020, 3:08 PM
swh/journal/__init__.py
17–25

most possibly because i thought it'd be shared and in the end it was not.
so yes, needs to be moved back in replay.

swh/journal/replay.py
156–170

I don't want to iterate again on the list...
I want the iteration to take place once and for all at the same location so here.

339

upserts accepts dict for now, it's D2813 again.

vlorentz added inline comments.Mar 12 2020, 3:12 PM
swh/journal/replay.py
156–170

Instead of giving fix_and_convert_objects three completely different behavior according to object types, you should split it into three different functions, one for each of the behaviors.

ardumont updated this revision to Diff 10034.Mar 12 2020, 6:08 PM

Adapt according to review

ardumont retitled this revision from journal: Remove 'validate' proxy to journal: Use swh-model objects instead of dicts in replay and writer.Mar 12 2020, 6:19 PM
ardumont added inline comments.Mar 12 2020, 8:23 PM
swh/journal/replay.py
156–170

In that regard, i was tempted to:

  • remove completely this function indirection
  • do the match case on object type directly in _insert_objects with the necessary calls on fix_stuff there

and i did.

ardumont updated this revision to Diff 10041.Mar 13 2020, 9:55 AM

Fix origin_visits must be fixed prior to create origins

vlorentz accepted this revision.Mar 13 2020, 11:34 AM

Nice!

Next step is replacing fix_* functions returning dicts with read_* (or similar name) returning swh-model objects, right?

This revision is now accepted and ready to land.Mar 13 2020, 11:34 AM
ardumont updated this revision to Diff 10050.Mar 13 2020, 3:12 PM

Rebase on latest master dev

Next step is replacing fix_* functions returning dicts with read_* (or similar name) returning swh-model objects, right?

I'm unsure of what you mean. Read model object from the kafka topics?

For sure first, i'd like we manipulate model objects here and not a mix of dict and model objects.