Page MenuHomeSoftware Heritage

Add a new JournalWriter interface, which is notified by swh-storage before writing to pgsql.
ClosedPublic

Authored by vlorentz on Mar 26 2019, 4:50 PM.

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

  • Add support for importing the future DirectKafkaWriter class from the journal
olasd requested changes to this revision.Mar 27 2019, 10:54 AM

I think all the self.assertEquals should be assertCountEquals (we don't really care about the order of objects there).

Is the id really needed in the journal writer interface? If so, the id of the objects should rather match what we're using in kafka (so, for contents, it's not just the sha1, it'd be a sorted tuple of all the hashes), but I think we can just drop it.

This revision now requires changes to proceed.Mar 27 2019, 10:54 AM
In D1294#27689, @olasd wrote:

I think all the self.assertEquals should be assertCountEquals (we don't really care about the order of objects there).

Scratch that, we do care that origin_visit updates get applied in order here.

In D1294#27689, @olasd wrote:

Is the id really needed in the journal writer interface?

Yes, for compaction.

If so, the id of the objects should rather match what we're using in kafka (so, for contents, it's not just the sha1, it'd be a sorted tuple of all the hashes)

will do

In D1294#27689, @olasd wrote:

(so, for contents, it's not just the sha1, it'd be a sorted tuple of all the hashes)

According to both the publisher's code and the tests, it appears that the sha1 is the only key.

drop id handling logic from the storage.

olasd requested changes to this revision.Mar 27 2019, 4:31 PM
  • the snapshot_add methods should actually record two changes:
    • the addition of the snapshot (with full data)
    • the update of the origin visit (with full data as well)
  • every write of a change to an origin visit (origin_visit_add, origin_visit_update *and* snapshot_add) should send the full data currently associated with the origin visit:
    • origin data
    • visit id
    • date
    • status
    • snapshot id
    • metadata

If we don't do that, we run the risk of clobbering existing data with an incomplete object.

This revision now requires changes to proceed.Mar 27 2019, 4:31 PM
  • Use origin dictionaries instead of origin ids when dealing with origin_visits
  • On origin_visit updates, send all the data of the visit to the journal writer.
In D1294#28007, @olasd wrote:
  • the snapshot_add methods should actually record two changes:
    • the addition of the snapshot (with full data)
    • the update of the origin visit (with full data as well)

Should it really update the origin visit? A reader of the journal can infer that the origin_visit must be updated when reading the message in the snapshot topic.

In D1294#28007, @olasd wrote:
  • the snapshot_add methods should actually record two changes:
    • the addition of the snapshot (with full data)
    • the update of the origin visit (with full data as well)

Should it really update the origin visit? A reader of the journal can infer that the origin_visit must be updated when reading the message in the snapshot topic.

Hmm... I didn't realize snapshot_add actually does two very different things: add a snapshot (which does not have an FK to the origin visit) AND update the origin_visit to point to that snapshot.

Hmm... I didn't realize snapshot_add actually does two very different things: add a snapshot (which does not have an FK to the origin visit) AND update the origin_visit to point to that snapshot.

Yeah, the method should probably be split in two ("add a snapshot", "update the origin_visit to record that it points to the snapshot"). This is an artefact of the old way occurrences were handled.

In any case, we're going to need that split when we write the code that reads all the graph leaves from kafka separately, and tries to store them back into PostgreSQL.

In D1294#28133, @olasd wrote:

Hmm... I didn't realize snapshot_add actually does two very different things: add a snapshot (which does not have an FK to the origin visit) AND update the origin_visit to point to that snapshot.

Yeah, the method should probably be split in two ("add a snapshot", "update the origin_visit to record that it points to the snapshot"). This is an artefact of the old way occurrences were handled.

That refactoring should be fairly easy as the only users of that method are:

  • swh.loader.core
  • the tests for the indexer
  • Write origin_visit updates when a snapshot is added.
This revision is now accepted and ready to land.Mar 28 2019, 11:35 AM
This revision was landed with ongoing or failed builds.Mar 28 2019, 11:42 AM
This revision was automatically updated to reflect the committed changes.