Page MenuHomeSoftware Heritage

Add support for "flags" in object_type used for kafka topics
AbandonedPublic

Authored by douardda on May 13 2020, 11:40 AM.

Details

Reviewers
None
Group Reviewers
Reviewers
Summary

Kafka topics are currentlt built as `{kafka_prefix}.{object_type}`. This
is expecially used in the `swh.journal.serializers.object_key()` function.
This revision makes this function ignore everything past a ":" in the
object_type argument, so one can serialize say `Revision` objects in
a "revision:anonymized" topic.

Depends on D3148.

Event Timeline

Build is green

Patch application report for D3149 (id=11177)

Could not rebase; Attempt merge onto 25339889f8...

Updating 2533988..769226c
Fast-forward
 swh/journal/serializers.py            |  1 +
 swh/journal/tests/journal_data.py     | 30 +++++++++++++++++++++++++++---
 swh/journal/tests/test_serializers.py | 12 +++++++++++-
 swh/journal/writer/kafka.py           | 23 -----------------------
 4 files changed, 39 insertions(+), 27 deletions(-)
Changes applied before test
commit 769226cf37703a25f69d475e282e0f0c9e79d371
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 12 15:49:16 2020 +0200

    Add support for "flags" in object_type used for kafka topics
    
    Kafka topics are currentlt built as ``{kafka_prefix}.{object_type}``. This
    is expecially used in the ``swh.journal.serializers.object_key()`` function.
    This revision makes this function ignore everything past a ":" in the
    object_type argument, so one can serialize say ``Revision`` objects in
    a "revision:anonymized" topic.

commit 89cd8f7bea879b989ee0609fb93923a8578cd132
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 12 16:06:04 2020 +0200

    Fix test data: only create aware datetime objects

commit bad6e3d67d500ceffeabb308dcf4723d7a469729
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 12 15:47:54 2020 +0200

    Move OBJECT_TYPES global variable to tests/journal_data.py
    
    since it's the only user of this variable left.

See https://jenkins.softwareheritage.org/job/DJNL/job/tests-on-diff/70/ for more details.

Did we already decide anonymized topics should use this name scheme?

ardumont added a subscriber: ardumont.

I'm ok, but for one question.

Shouldn't this also accept no : in the object type (the prior behavior if i understood correctly)?

This revision now requires changes to proceed.May 13 2020, 11:51 AM

Did we already decide anonymized topics should use this name scheme?

Nope, this is a base for discussion :-)

Did we already decide anonymized topics should use this name scheme?

Nope, this is a base for discussion :-)

A particular point we need to focus on for this is ACLs capabilities in Kafka so we can apply different authorization rules easily on those topics.

I'm ok, but for one question.

Shouldn't this also accept no : in the object type (the prior behavior if i understood correctly)?

Not sure yet what we need, want and NOT want... now and tomorrow. So well, I've made a first prototype of this (with D3150) so we can start discussing all this.

What about using a different prefix instead of adding a suffix, eg. swh.journal.anonymized_objects.?

It would require less special handling/configuration

What about using a different prefix instead of adding a suffix, eg. swh.journal.anonymized_objects.?

It would require less special handling/configuration

It might be also possible, but I'm not sure it would imply less special handling or config.

Since we do not really want to duplicate all the topics, it's not just a matter of subscribing to <prefix>_anonymized topics.

But indeed it would make it a bit easier since it this very diff would then be unnecessary, which is a nice property.

The idea here was that a user, with authorizations properly set in kafka's ACLs, would be able just use a basic client config, and the broker would then only list the topics he is allowed to subscribe to.

So an "anonymized user" would see all the regular topics and the anonymized versions of these 2 topics, whereas a "full-mirror user" would only see the regular topics by default. But I'm not sure this is possible /how ACL work in kafka.

ardumont accepted this revision.
This revision is now accepted and ready to land.May 13 2020, 4:13 PM
This revision now requires review to proceed.May 13 2020, 4:13 PM

What about using a different prefix instead of adding a suffix, eg. swh.journal.anonymized_objects.?

It would require less special handling/configuration

It might be also possible, but I'm not sure it would imply less special handling or config.

Since we do not really want to duplicate all the topics, it's not just a matter of subscribing to <prefix>_anonymized topics.

But indeed it would make it a bit easier since it this very diff would then be unnecessary, which is a nice property.

The idea here was that a user, with authorizations properly set in kafka's ACLs, would be able just use a basic client config, and the broker would then only list the topics he is allowed to subscribe to.

So an "anonymized user" would see all the regular topics and the anonymized versions of these 2 topics, whereas a "full-mirror user" would only see the regular topics by default. But I'm not sure this is possible /how ACL work in kafka.

Kafka ACLs can only match resources (in this case, topics) according to:

  • an exact match
  • a prefix match
  • or a wildcard ("any") match

So, while interesting in theory, the "tag" suffix idea doesn't provide a benefit over putting anonymous objects in a separate prefix: we'd still have to handle ACLs for all topics by hand.

I have another suggestion: switching the idea around, and making the main topics anonymous by default, and adding a set of "privileged", full-data topics.

This allows us to give _baseline_ access to the swh.journal.objects. prefix to all clients, and to have an additional, _explicit_ access list to swh.journal.objects_privileged for full-data clients. We could add a privileged flag in the journal client to transparently handle the overlay of both prefixes (this would subscribe, transparently, the client to the _privileged topics if available).

Doing things the other way around makes me a bit nervous.