Publish origin_intrinsic_metadata to Kafka.
AbandonedPublic
Actions

Authored by vlorentz on Sep 4 2019, 4:29 PM.

Details

Reviewers

olasd
ardumont
zack

Group Reviewers

Reviewers

Maniphest Tasks

T2651: Make the indexer-storage publish its rows to Kafka

Summary

So swh-search can use a Kafka client to fill its DB with metadata.

Controversial points:

It adds a new Kafka topic, of objects that are not part of the data model
Only this idx-storage endpoint writes to Kafka (because I don't think we need the other ones) -> inconsistency

Depends on D1958
^ implementation detail (some tests are currently failing in that diff because of data validation issue which are being fixed in this 1958)

Diff Detail

Repository

rDCIDX Metadata indexer

Branch

int-metadata-journal

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 7668
Build 11004: tox-on-jenkins	Jenkins
Build 11003: arc lint + arc unit

Event Timeline

vlorentz created this revision.Sep 4 2019, 4:29 PM

Build has FAILED

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tox/608/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tox/608/console

Harbormaster failed remote builds in B7668: Diff 6592!Sep 4 2019, 4:31 PM

I think this approach is perfectly fine.

We can just configure this topic to be under another hierarchy (e.g. swh.journal.metadata. instead of swh.journal.objects.) so it doesn't interfere with the mirroring infrastructure.

The only issue with that idea is that it would force us to use two journal clients if we want to process messages about objects and messages about metadata. I believe this is something that we can do anyway, as we might end up separating the kafka clusters between objects and metadata anyway.

In D1959#45436, @olasd wrote:

I think this approach is perfectly fine.

We can just configure this topic to be under another hierarchy (e.g. swh.journal.metadata. instead of swh.journal.objects.) so it doesn't interfere with the mirroring infrastructure.

Agreed

The only issue with that idea is that it would force us to use two journal clients

Which I do not see as an issue at all...

ardumont edited the summary of this revision. (Show Details)Sep 6 2019, 9:39 AM

ardumont added a parent revision: D1958: indexer: ci: broken master build: Fix various test/utils data model issues (following the new model validation delivery).

Build has FAILED

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tox/613/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tox/613/console

vlorentz mentioned this in D1964: Make JournalClient accept arbitrary object_types..Sep 6 2019, 11:10 AM

olasd added inline comments.Sep 10 2019, 2:14 PM

swh/indexer/storage/__init__.py
749	Do we really want to leak this id? Should we add a cache to avoid querying that data all the time?

vlorentz added inline comments.Sep 10 2019, 2:19 PM

swh/indexer/storage/__init__.py
749	I don't know. I'm not even sure publishing the tool to kafka is relevant

vlorentz added inline comments.Sep 10 2019, 2:22 PM

swh/indexer/storage/__init__.py
749	But it shouldn't hurt to publish too much and remove field later if we don't use them

Controversial points:

It adds a new Kafka topic, of objects that are not part of the data model

FWIW, fine with me too. I don't see this one as controversial at all. We want to have the full archive (the graph) in Kafka to be able to replay its creation, but I see no particular reason for not *also* having additional stuff in it. As long as it is clear what's permanent record and what's ephemeral/derived data (e.g., in the doc), this is perfectly fine.

ardumont added inline comments.Sep 12 2019, 10:13 AM

swh/indexer/storage/__init__.py
749	I'm not even sure publishing the tool to kafka is relevant Factually, that's the tool which computed the result so it should be part of the message. Technically, in the postgres model, the tool id is part of the primary key which ensures unicity amongst indexer data per indexer (well it was like that initially and i don't think we diverged from that).