Page MenuHomeSoftware Heritage

Enable the journal-writer for the swh-idx-storage in production
Open, NormalPublic

Description

the config is like for swh-storage, so something like this:

journal_writer:
  cls: kafka
  args:
    brokers: "%{alias('swh::deploy::journal::brokers')}"
    prefix: "%{alias('swh::deploy::journal::prefix')}"
    client_id: "swh.indexer.storage.journal_writer.%{::swh_hostname.short}"
    producer_config:
      message.max.bytes: 1000000000

It's unclear what the prefix should be. swh.storage uses swh.journal.objects, we can either use that one too, or a new one, eg. swh.journal.indexed

Event Timeline

vlorentz triaged this task as Normal priority.Mon, Nov 16, 1:31 PM
vlorentz created this task.
olasd added a subscriber: olasd.Thu, Nov 26, 5:53 PM

Is this supposed to be persistent (and keep the full history of all messages), or transient (and used for "real-time" clients)? IOW, what are the storage requirements for this?

Where is the list of topics that need to be created?

I think we should definitely use a different prefix as swh.storage, as the ACLs for third parties should be separate.

It's unclear what the prefix should be. swh.storage uses swh.journal.objects, we can either use that one too, or a new one, eg. swh.journal.indexed

I think we should definitely use a different prefix as swh.storage, as the ACLs for third parties should be separate.

so, heads up, the topic prefix swh.journal.indexed has been elected and declared in the current staging diff D4620

Where is the list of topics that need to be created?

I'd say in swh.indexer.storage.__init__.py:

./__init__.py:        self.journal_writer.write_additions("content_mimetype", mimetypes)
./__init__.py:        self.journal_writer.write_additions("content_language", languages)
./__init__.py:        self.journal_writer.write_additions("content_ctags", ctags)
./__init__.py:        self.journal_writer.write_additions("content_fossology_license", licenses)
./__init__.py:        self.journal_writer.write_additions("content_metadata", metadata)
./__init__.py:        self.journal_writer.write_additions("revision_intrinsic_metadata", metadata)
./__init__.py:        self.journal_writer.write_additions("origin_intrinsic_metadata", metadata)
In T2780#53415, @olasd wrote:

Is this supposed to be persistent (and keep the full history of all messages), or transient (and used for "real-time" clients)? IOW, what are the storage requirements for this?

I'd say transient, as we can always recompute it. But this means backfilling the journal every time we add a new client that needs to get all the messages, so I don't know.

Where is the list of topics that need to be created?

Answered by @ardumont

I think we should definitely use a different prefix as swh.storage, as the ACLs for third parties should be separate.

Agreed

I propose meeting in the middle and having the following policies:

  • content topics: transient, bound by volume
  • revision / origin topics: persistent

I expect the content topics to be the most "volatile" and heavy, and the revision / origin topics to be the most useful to keep in the long term for third party clients.

Does that make sense?