Page MenuHomeSoftware Heritage

Publish origin_intrinsic_metadata to Kafka.
AbandonedPublic

Authored by vlorentz on Sep 4 2019, 4:29 PM.

Details

Summary

So swh-search can use a Kafka client to fill its DB with metadata.

Controversial points:

  • It adds a new Kafka topic, of objects that are not part of the data model
  • Only this idx-storage endpoint writes to Kafka (because I don't think we need the other ones) -> inconsistency

Depends on D1958
^ implementation detail (some tests are currently failing in that diff because of data validation issue which are being fixed in this 1958)

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
int-metadata-journal
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 7668
Build 11004: tox-on-jenkinsJenkins
Build 11003: arc lint + arc unit

Event Timeline

I think this approach is perfectly fine.

We can just configure this topic to be under another hierarchy (e.g. swh.journal.metadata. instead of swh.journal.objects.) so it doesn't interfere with the mirroring infrastructure.

The only issue with that idea is that it would force us to use two journal clients if we want to process messages about objects and messages about metadata. I believe this is something that we can do anyway, as we might end up separating the kafka clusters between objects and metadata anyway.

In D1959#45436, @olasd wrote:

I think this approach is perfectly fine.

We can just configure this topic to be under another hierarchy (e.g. swh.journal.metadata. instead of swh.journal.objects.) so it doesn't interfere with the mirroring infrastructure.

Agreed

The only issue with that idea is that it would force us to use two journal clients

Which I do not see as an issue at all...

swh/indexer/storage/__init__.py
749

Do we really want to leak this id? Should we add a cache to avoid querying that data all the time?

swh/indexer/storage/__init__.py
749

I don't know. I'm not even sure publishing the tool to kafka is relevant

swh/indexer/storage/__init__.py
749

But it shouldn't hurt to publish too much and remove field later if we don't use them

Controversial points:

It adds a new Kafka topic, of objects that are not part of the data model

FWIW, fine with me too. I don't see this one as controversial at all. We want to have the full archive (the graph) in Kafka to be able to replay its creation, but I see no particular reason for not *also* having additional stuff in it. As long as it is clear what's permanent record and what's ephemeral/derived data (e.g., in the doc), this is perfectly fine.

swh/indexer/storage/__init__.py
749

I'm not even sure publishing the tool to kafka is relevant

Factually, that's the tool which computed the result so it should be part of the message.

Technically, in the postgres model, the tool id is part of the primary key which ensures unicity amongst indexer data per indexer (well it was like that initially and i don't think we diverged from that).