Page MenuHomeSoftware Heritage

Add support for a priviledged "channel" of topics for non-anonymized objects
ClosedPublic

Authored by douardda on May 20 2020, 11:21 AM.

Details

Summary

The idea is to publish on a "public" prefix anonymized objects and have a
priviledged prefix for non-anonymized ones. Every model object which
".anonymize()" method returns a(n anonymized) model object will be sent
to both channels (anonymized version on the regular channel, and
original version on the privileged channel).

Currently only Release and Revision objects are anonymizable.

For anonymizable objects:

  • regular (non-anonymized) objects are sent to the topic

    "{prefix}_privileged.{object_type}"
  • anonymized version of these objects are sent to

    "{prefix}.{object_type}"

On the client side, a new boolean "privileged" config parameter is used to
select privileged topics if vivible on the Kafka broker at subscrition time.

Replaces D3160.

Diff Detail

Event Timeline

Build has FAILED

Patch application report for D3172 (id=11260)

Rebasing onto 89cd8f7bea...

Current branch diff-target is up to date.
Changes applied before test
commit 12f224e11a00827db38f9328b022ccf565416bb8
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 18 11:35:22 2020 +0200

    Add support for a priviledged "channel" of topics for non-anonymized objects
    
    The idea is to publish on a "public" prefix anonymized objects and have a
    priviledged prefix for non-anonymized ones. Every model object which
    ".anonymize()" method returns a(n anonymized) model object will be sent
    to both channels (anonymized version on the regular channel, and
    original version on the privileged channel).
    
    Currently only Release and Revision objects are anonymizable.
    
    For anonymizable objects:
    
    - regular (non-anonymized) objects are sent to the topic
    
      "{prefix}_privileged.{object_type}"
    
    - anonymized version of these objects are sent to
    
      "{prefix}.{object_type}"
    
    On the client side, a new boolean "privileged" config parameter is used to
    select privileged topics if vivible on the Kafka broker at subscrition time.

Link to build: https://jenkins.softwareheritage.org/job/DJNL/job/tests-on-diff/73/
See console output for more information: https://jenkins.softwareheritage.org/job/DJNL/job/tests-on-diff/73/console

ardumont added inline comments.
swh/journal/pytest_plugin.py
51

why not fstring for those changes?
It's nicer to the eyes plus you already used it before in that very diff.

swh/journal/tests/test_client.py
248

Maybe mentions that this tests only the client subscription to anonymized topics (and respectively priviledged topic for the next one).

maybe make that loop over the priviledged set of object types (revision, release)?

swh/journal/tests/test_client.py
327

curious me, what's expected to happen if we require priviledged=True to something other than revision and release?

Shouldn't it be tested as well?

swh/journal/tests/test_kafka_writer.py
45

couldn't our strategy generate those once in a while nonetheless (which would result in a false negative test then)?

Looks good.

I got questions in the diff, nothing blocking.

For my last remark though, i think there is an issue, i don't think you are
sending the stock object when the anonymization is on. Thus required changed to
ensure that.

swh/journal/writer/kafka.py
221

mmm if you keep that reference, here, i fear you are sending the anonymized object to the priviledged topic...

you should remove that last sentence.

This revision now requires changes to proceed.May 20 2020, 12:06 PM
swh/journal/tests/test_client.py
248

Will do (not sure yet which fix, either rename the test or consume all the topics).

327

The behavior of the client (which should be clear according to commit messages and docstrings, but...) is that if this config flag is set, it will subscribe to existing privileged topics rather than regular ones (if both privileged and regular exists in the advertised topics list), but will apply this logic for object_types one by one.

So for other object types than anonymizable ones, it will subscribe to regular topics.

swh/journal/tests/test_kafka_writer.py
45

Yes, I had this in mind and wanted to modify the strategy accordingly, but then I forgot :-)

Thanks

swh/journal/writer/kafka.py
221

I don't understand your point here.

swh/journal/writer/kafka.py
221

I meant instruction not sentence sorry.
Explicitely, i meant, object_ = anon_object_ should go away.

my understanding is that:

  • when anon_objects exists, we send this to the anonymized topic
  • we send nonetheless to the priviledge topic the full original object (what you called the stock object).

That last instruction overrode the stock object reference to be the anonymized
object. so the line below (225 in the current diff state).

dict_ = self._sanitize_object(object_type, object_)

sends the anonymized object to the priviledged topic instead of the stock
object. (which contradicts the diff's description [1]

Or am i misunderstanding something?

[1] I do read the descriptions when reviewing, it just takes some time to sink
in sometimes

swh/journal/tests/test_kafka_writer.py
45

Updated the hypothesis strategy in D3171.

swh/journal/writer/kafka.py
221

[1] I do read the descriptions when reviewing, it just takes some time to sink in sometimes

no kidding :-)

221

my understanding is that:

when anon_objects exists, we send this to the anonymized topic

 we send nonetheless to the priviledge topic the full original object (what you called the stock object).

Correct, with "anonymized topic" being the "standard one". So in this code, the non-anonymized version of say a Release will be first sent to the privileged topic. Then, thanks to this object_ = anon_object_, the remaining of the code, which is executed unconditionally, will send the anonymized object to the main channel.

So this anonymized version will be sent to the main channel if and only if the privileged one has been used for the stock version of the object.

That last instruction overrode the stock object reference to be the anonymized
object. so the line below (225 in the current diff state).

dict_ = self._sanitize_object(object_type, object_)

sends the anonymized object to the priviledged topic instead of the stock
object.

Nope, as explained above, it will send the anonymized object to the main topic, not the privileged one (this has already been dealt in the if block.)

swh/journal/writer/kafka.py
221

oh yeah, totally.

I misread the topic = f"{self._prefix_privileged}.{object_type}" at the start of the conditional!

sorry for the noise then.

This revision is now accepted and ready to land.May 20 2020, 3:08 PM

typos, rename a couple of tests, and better comments

Build has FAILED

Patch application report for D3172 (id=11269)

Rebasing onto 89cd8f7bea...

Current branch diff-target is up to date.
Changes applied before test
commit cc4af0e0baece37aa2b0b9054e3c647ffa4b84b5
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 18 11:35:22 2020 +0200

    Add support for a priviledged "channel" of topics for non-anonymized objects
    
    The idea is to publish on a "public" prefix anonymized objects and have a
    priviledged prefix for non-anonymized ones. Every model object which
    ".anonymize()" method returns a(n anonymized) model object will be sent
    to both channels (anonymized version on the regular channel, and
    original version on the privileged channel).
    
    Currently only Release and Revision objects are anonymizable.
    
    For anonymizable objects:
    
    - regular (non-anonymized) objects are sent to the topic
    
      "{prefix}_privileged.{object_type}"
    
    - anonymized version of these objects are sent to
    
      "{prefix}.{object_type}"
    
    On the client side, a new boolean "privileged" config parameter is used to
    select privileged topics if vivible on the Kafka broker at subscrition time.

Link to build: https://jenkins.softwareheritage.org/job/DJNL/job/tests-on-diff/74/
See console output for more information: https://jenkins.softwareheritage.org/job/DJNL/job/tests-on-diff/74/console

bump dep on swh-model to 0.2

Build is green

Patch application report for D3172 (id=11281)

Rebasing onto 89cd8f7bea...

Current branch diff-target is up to date.
Changes applied before test
commit 5210495975af9a7a0320d8f6f60986a3a495d427
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 18 11:35:22 2020 +0200

    Add support for a priviledged "channel" of topics for non-anonymized objects
    
    The idea is to publish on a "public" prefix anonymized objects and have a
    priviledged prefix for non-anonymized ones. Every model object which
    ".anonymize()" method returns a(n anonymized) model object will be sent
    to both channels (anonymized version on the regular channel, and
    original version on the privileged channel).
    
    Currently only Release and Revision objects are anonymizable.
    
    For anonymizable objects:
    
    - regular (non-anonymized) objects are sent to the topic
    
      "{prefix}_privileged.{object_type}"
    
    - anonymized version of these objects are sent to
    
      "{prefix}.{object_type}"
    
    On the client side, a new boolean "privileged" config parameter is used to
    select privileged topics if vivible on the Kafka broker at subscrition time.

See https://jenkins.softwareheritage.org/job/DJNL/job/tests-on-diff/75/ for more details.

Build is green

Patch application report for D3172 (id=11282)

Rebasing onto 89cd8f7bea...

Current branch diff-target is up to date.
Changes applied before test
commit 3cac9cffe243b0893f408606005c464f24e84412
Author: David Douard <david.douard@sdfa3.org>
Date:   Mon May 18 11:35:22 2020 +0200

    Add support for a priviledged "channel" of topics for non-anonymized objects
    
    The idea is to publish on a "public" prefix anonymized objects and have a
    priviledged prefix for non-anonymized ones. Every model object which
    ".anonymize()" method returns a(n anonymized) model object will be sent
    to both channels (anonymized version on the regular channel, and
    original version on the privileged channel).
    
    Currently only Release and Revision objects are anonymizable.
    
    For anonymizable objects:
    
    - regular (non-anonymized) objects are sent to the topic
    
      "{prefix}_privileged.{object_type}"
    
    - anonymized version of these objects are sent to
    
      "{prefix}.{object_type}"
    
    On the client side, a new boolean "privileged" config parameter is used to
    select privileged topics if vivible on the Kafka broker at subscrition time.

See https://jenkins.softwareheritage.org/job/DJNL/job/tests-on-diff/76/ for more details.