Details

Reviewers

zack

Group Reviewers

Reviewers

Summary

This rewrite enables multiple things:

Each export can now have multiple exporters, so we can read the journal a single time, then export the objects we read in different formats without having to re-read them every time.

We use a shared on-disk set for the nodes, to avoid storing them unnecessarily in each exporter

The SQLite files are sharded depending on the partition ID of the incoming messages. This reduces performance issues we had when using a single large set per process. It's also now easier to rewrite the on-disk set logic to use a different set backend, or to change the sharding.

The new abstractions make it a lot nicer to write exporters. You just need to override the methods corresponding to each object type, and you can do your setup and teardown in the enter and exit methods of your exporter, which is used as a context manager. Exporters also don't have to worry about duplicates, since this is already done in the journal processor itself.

Diff Detail

Repository

rDDATASET Datasets

Branch

rewrite_exporter

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 17971
Build 27760: arc lint + arc unit

Event Timeline

seirl created this revision.Dec 10 2020, 7:38 PM

Harbormaster completed remote builds in B17879: Diff 16714.Dec 10 2020, 7:38 PM

seirl added a reviewer: Reviewers.Dec 10 2020, 9:24 PM

zack accepted this revision.Dec 11 2020, 12:19 PM

zack added a subscriber: zack.

zack added inline comments.

swh/dataset/exporter.py
10–12	please add a brief docstring here (I guess you can reuse some of the stuff you already have in the diff description for this)
swh/dataset/graph.py
23	"to a Zstandard-compressed"
swh/dataset/journalprocessor.py
113–116	is this still relevant? I thought you mentioned that with the new sharding approach based on intrinsic IDs that wasn't a problem anymore. Or was it something else?

This revision is now accepted and ready to land.Dec 11 2020, 12:19 PM

olasd added a subscriber: olasd.Dec 11 2020, 12:25 PM

olasd added inline comments.

swh/dataset/journalprocessor.py
113–116	it's still relevant: sharding based on intrinsic ids would force multiple worker threads to concurrently write the sqlite files, which is a problem; so it was scrapped in favor of the partition based sharding, which (by design of kafka clients) means a single worker uses each database. However, I'd argue that's not a hack, in the sense that the `deserialize_message` hook /is/ the blessed way for a swh.journal client to get the API to "leak" the partition info to a consumer that would care about it.

Fix various coding errors and minor improvements

Harbormaster completed remote builds in B17910: Diff 16750.Dec 11 2020, 5:39 PM

Exporter documentation fixes
Journal processor: fetch offsets in parallel

Harbormaster completed remote builds in B17912: Diff 16752.Dec 11 2020, 6:07 PM

journalprocessor: also partition sqlite files by first byte
SQLite on-disk set: disable journalling and synchronous mode
tests: fix test_export_origin

Harbormaster completed remote builds in B17970: Diff 16814.Dec 15 2020, 6:46 PM

journalprocessor: remove comment about deserialize_message overload being a 'hack'

Harbormaster completed remote builds in B17971: Diff 16815.Dec 15 2020, 6:48 PM

Landed, but phabricator doesn't seem to see it.

Rewrite of the export pipeline using Exporters
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 16815

swh/dataset/exporter.py

swh/dataset/graph.py

swh/dataset/journalprocessor.py

swh/dataset/test/test_graph.py

swh/dataset/test/test_utils.py

swh/dataset/utils.py

Rewrite of the export pipeline using ExportersClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 16815

swh/dataset/exporter.py

swh/dataset/graph.py

swh/dataset/journalprocessor.py

swh/dataset/test/test_graph.py

swh/dataset/test/test_utils.py

swh/dataset/utils.py

Rewrite of the export pipeline using Exporters
ClosedPublic
Actions

Revision Contents
Changeset List