Details

Is the id really needed in the journal writer interface? If so, the id of the objects should rather match what we're using in kafka (so, for contents, it's not just the sha1, it'd be a sorted tuple of all the hashes), but I think we can just drop it.

This revision now requires changes to proceed.Mar 27 2019, 10:54 AM

In D1294#27689, @olasd wrote:

I think all the self.assertEquals should be assertCountEquals (we don't really care about the order of objects there).

Scratch that, we do care that origin_visit updates get applied in order here.

In D1294#27689, @olasd wrote:

Is the id really needed in the journal writer interface?

Yes, for compaction.

If so, the id of the objects should rather match what we're using in kafka (so, for contents, it's not just the sha1, it'd be a sorted tuple of all the hashes)

will do

In D1294#27689, @olasd wrote:

(so, for contents, it's not just the sha1, it'd be a sorted tuple of all the hashes)

According to both the publisher's code and the tests, it appears that the sha1 is the only key.

drop id handling logic from the storage.

Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/224/ for more details.

Harbormaster completed remote builds in B4896: Diff 4141.Mar 27 2019, 4:00 PM

the snapshot_add methods should actually record two changes:
- the addition of the snapshot (with full data)
- the update of the origin visit (with full data as well)

every write of a change to an origin visit (origin_visit_add, origin_visit_update *and* snapshot_add) should send the full data currently associated with the origin visit:
- origin data
- visit id
- date
- status
- snapshot id
- metadata

If we don't do that, we run the risk of clobbering existing data with an incomplete object.

This revision now requires changes to proceed.Mar 27 2019, 4:31 PM

Use origin dictionaries instead of origin ids when dealing with origin_visits
On origin_visit updates, send all the data of the visit to the journal writer.

vlorentz planned changes to this revision.Mar 27 2019, 4:55 PM

Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/225/ for more details.

Harbormaster completed remote builds in B4899: Diff 4144.Mar 27 2019, 4:57 PM

In D1294#28007, @olasd wrote:

the snapshot_add methods should actually record two changes:

the addition of the snapshot (with full data)

the update of the origin visit (with full data as well)

Should it really update the origin visit? A reader of the journal can infer that the origin_visit must be updated when reading the message in the snapshot topic.

revert doc change

Harbormaster completed remote builds in B4901: Diff 4146.Mar 27 2019, 5:41 PM

Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/226/ for more details.

In D1294#28022, @vlorentz wrote:

In D1294#28007, @olasd wrote:

the snapshot_add methods should actually record two changes:

the addition of the snapshot (with full data)

the update of the origin visit (with full data as well)

Should it really update the origin visit? A reader of the journal can infer that the origin_visit must be updated when reading the message in the snapshot topic.

Hmm... I didn't realize snapshot_add actually does two very different things: add a snapshot (which does not have an FK to the origin visit) AND update the origin_visit to point to that snapshot.

In D1294#28022, @vlorentz wrote:

Hmm... I didn't realize snapshot_add actually does two very different things: add a snapshot (which does not have an FK to the origin visit) AND update the origin_visit to point to that snapshot.

Yeah, the method should probably be split in two ("add a snapshot", "update the origin_visit to record that it points to the snapshot"). This is an artefact of the old way occurrences were handled.

In any case, we're going to need that split when we write the code that reads all the graph leaves from kafka separately, and tries to store them back into PostgreSQL.

In D1294#28133, @olasd wrote:

In D1294#28022, @vlorentz wrote:

Hmm... I didn't realize snapshot_add actually does two very different things: add a snapshot (which does not have an FK to the origin visit) AND update the origin_visit to point to that snapshot.

Yeah, the method should probably be split in two ("add a snapshot", "update the origin_visit to record that it points to the snapshot"). This is an artefact of the old way occurrences were handled.

That refactoring should be fairly easy as the only users of that method are:

swh.loader.core
the tests for the indexer

Write origin_visit updates when a snapshot is added.

Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/233/ for more details.

Harbormaster completed remote builds in B4920: Diff 4163.Mar 28 2019, 11:17 AM

olasd accepted this revision.Mar 28 2019, 11:35 AM

This revision is now accepted and ready to land.Mar 28 2019, 11:35 AM

rebase/squash

This revision was landed with ongoing or failed builds.Mar 28 2019, 11:42 AM

Closed by commit rDSTO246855c813a2: Add a new JournalWriter interface, which is notified by swh-storage before… (authored by vlorentz). · Explain Why

This revision was automatically updated to reflect the committed changes.

Harbormaster failed remote builds in B4922: Diff 4166!Mar 28 2019, 11:43 AM

Build has FAILED

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tox/234/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tox/234/console

Add a new JournalWriter interface, which is notified by swh-storage before writing to pgsql.
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 4167

swh/storage/in_memory.py

swh/storage/journal_writer.py

swh/storage/storage.py

swh/storage/tests/storage_testing.py

swh/storage/tests/test_api_client.py

swh/storage/tests/test_in_memory.py

swh/storage/tests/test_storage.py

Add a new JournalWriter interface, which is notified by swh-storage before writing to pgsql.ClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 4167

swh/storage/in_memory.py

swh/storage/journal_writer.py

swh/storage/storage.py

swh/storage/tests/storage_testing.py

swh/storage/tests/test_api_client.py

swh/storage/tests/test_in_memory.py

swh/storage/tests/test_storage.py

Add a new JournalWriter interface, which is notified by swh-storage before writing to pgsql.
ClosedPublic
Actions

Revision Contents
Changeset List