Page MenuHomeSoftware Heritage

journal: Use swh-model objects instead of dicts in replay and writer
ClosedPublic

Authored by ardumont on Mar 11 2020, 4:46 PM.

Details

Summary

@douardda that might break your configuration once merged and tagged (if that happens ;).

One step closer to let the BaseModel flow through the journal.

Depends on D2800

Test Plan

tox

Diff Detail

Repository
rDJNL Journal infrastructure
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

swh/journal/replay.py
124

Working on this.

ardumont added inline comments.
swh/journal/replay.py
124

Related to D2813

vlorentz added a subscriber: vlorentz.

Bad commit message; you're describing a consequence of the config instead of what the change is about.

I suggest "Use swh-model objects instead of dicts in replay and writer".

And if it's not too much trouble, you should split this into one diff for the replayer, and one for the writer.

swh/journal/__init__.py
17–25 ↗(On Diff #10003)

why is this not in replay.py?

swh/journal/replay.py
120–121

If you merge D2813 first, you don't need to return List[Dict].

And instead of Tuple[List[SkippedContent], List[Content]], you can return List[BaseContent] call isinstance when iterating the list.

310

I don't understand this comment

This revision now requires changes to proceed.Mar 12 2020, 2:20 PM
swh/journal/__init__.py
17–25 ↗(On Diff #10003)

most possibly because i thought it'd be shared and in the end it was not.
so yes, needs to be moved back in replay.

swh/journal/replay.py
120–121

I don't want to iterate again on the list...
I want the iteration to take place once and for all at the same location so here.

310

upserts accepts dict for now, it's D2813 again.

swh/journal/replay.py
120–121

Instead of giving fix_and_convert_objects three completely different behavior according to object types, you should split it into three different functions, one for each of the behaviors.

ardumont retitled this revision from journal: Remove 'validate' proxy to journal: Use swh-model objects instead of dicts in replay and writer.Mar 12 2020, 6:19 PM
swh/journal/replay.py
120–121

In that regard, i was tempted to:

  • remove completely this function indirection
  • do the match case on object type directly in _insert_objects with the necessary calls on fix_stuff there

and i did.

Fix origin_visits must be fixed prior to create origins

Nice!

Next step is replacing fix_* functions returning dicts with read_* (or similar name) returning swh-model objects, right?

This revision is now accepted and ready to land.Mar 13 2020, 11:34 AM

Rebase on latest master dev

Next step is replacing fix_* functions returning dicts with read_* (or similar name) returning swh-model objects, right?

I'm unsure of what you mean. Read model object from the kafka topics?

For sure first, i'd like we manipulate model objects here and not a mix of dict and model objects.