Page MenuHomeSoftware Heritage

Rewrite of the export pipeline using Exporters
ClosedPublic

Authored by seirl on Dec 10 2020, 7:38 PM.

Details

Reviewers
zack
Group Reviewers
Reviewers
Summary

This rewrite enables multiple things:

  1. Each export can now have multiple exporters, so we can read the journal a single time, then export the objects we read in different formats without having to re-read them every time.
  1. We use a shared on-disk set for the nodes, to avoid storing them unnecessarily in each exporter
  1. The SQLite files are sharded depending on the partition ID of the incoming messages. This reduces performance issues we had when using a single large set per process. It's also now easier to rewrite the on-disk set logic to use a different set backend, or to change the sharding.
  1. The new abstractions make it a lot nicer to write exporters. You just need to override the methods corresponding to each object type, and you can do your setup and teardown in the enter and exit methods of your exporter, which is used as a context manager. Exporters also don't have to worry about duplicates, since this is already done in the journal processor itself.

Diff Detail

Repository
rDDATASET Datasets
Branch
rewrite_exporter
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 17971
Build 27760: arc lint + arc unit

Event Timeline

zack added a subscriber: zack.
zack added inline comments.
swh/dataset/exporter.py
10–12

please add a brief docstring here (I guess you can reuse some of the stuff you already have in the diff description for this)

swh/dataset/graph.py
23

"to a Zstandard-compressed"

swh/dataset/journalprocessor.py
113–116

is this still relevant? I thought you mentioned that with the new sharding approach based on intrinsic IDs that wasn't a problem anymore. Or was it something else?

This revision is now accepted and ready to land.Dec 11 2020, 12:19 PM
olasd added inline comments.
swh/dataset/journalprocessor.py
113–116

it's still relevant: sharding based on intrinsic ids would force multiple worker threads to concurrently write the sqlite files, which is a problem; so it was scrapped in favor of the partition based sharding, which (by design of kafka clients) means a single worker uses each database.

However, I'd argue that's not a hack, in the sense that the deserialize_message hook /is/ the blessed way for a swh.journal client to get the API to "leak" the partition info to a consumer that would care about it.

Fix various coding errors and minor improvements

  • Exporter documentation fixes
  • Journal processor: fetch offsets in parallel
  • journalprocessor: also partition sqlite files by first byte
  • SQLite on-disk set: disable journalling and synchronous mode
  • tests: fix test_export_origin
  • journalprocessor: remove comment about deserialize_message overload being a 'hack'

Landed, but phabricator doesn't seem to see it.