Page MenuHomeSoftware Heritage

Add support for anonymized journal topics
AbandonedPublic

Authored by douardda on May 18 2020, 1:53 PM.

Details

Reviewers
None
Group Reviewers
Reviewers
Summary

This is another approach for implementing anonymized topics. It replaces (with its counterpart in swh-journal, D3160) D3149 and D3150.

This uses the new `privileged` argument of the
KakfaJournalWriter.write_addition(s) methods.

Namely, for anonymizable objects (Revision and Released), this will fill the
following topics with unmodified objects:

  • `{kafka_prefix}_privileged.release` and
  • `{kafka_prefix}_privileged.revision`

whereas the regular topics will be filled with anonymized versions of these
objects.

The anonymization process consists simply in forging a Person with the
`fullname` being a hash of the triplet (fullname, name, email) of the
original Person in Release and Revision entities.

So the replayer process can be used as is (just have to not replay both
standard and anonymized topics at once).

Diff Detail

Repository
rDSTO Storage manager
Branch
anonymized_topics_2
Lint
Lint Skipped
Unit
Unit Tests Skipped
Build Status
Buildable 12410
Build 18829: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 18828: arc lint + arc unit

Unit TestsFailed

TimeTest
5,052 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_kafka_writer::test_storage_direct_writer
kafka_prefix = 'gnevdbzliv', kafka_server = '127.0.0.1:46949' consumer = <cimpl.Consumer object at 0x7f165a715620>
1,049 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_kafka_writer::test_storage_direct_writer_anonymized
kafka_prefix = 'zxqtrkdvpr', kafka_server = '127.0.0.1:46949' consumer = <cimpl.Consumer object at 0x7f165a715950>
2,052 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_replay::test_storage_play_anonymized
kafka_prefix = 'dsjtwgsdpn', kafka_consumer_group = 'test-consumer-dsjtwgsdpn' kafka_server = '127.0.0.1:46949' caplog = <_pytest.logging.LogCaptureFixture object at 0x7f165a5a39e8>
4 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.fixer::swh.storage.fixer._fix_content
2 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.fixer::swh.storage.fixer._fix_origin
View Full Test Results (3 Failed · 914 Passed · 24 Skipped)

Event Timeline

Build has FAILED

Patch application report for D3161 (id=11223)

Rebasing onto 87f7bee693...

Current branch diff-target is up to date.
Changes applied before test
commit 406ee2c212544c9b64c1ecbeb1a0ed84ec0b2727
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu May 7 15:33:35 2020 +0200

    Add support for anonymized journal topics
    
    This uses the new ``privileged`` argument of the
    KakfaJournalWriter.write_addition(s) methods.
    
    Namely, for anonymizable objects (Revision and Released), this will fill the
    following topics with unmodified objects:
    
    - ``{kafka_prefix}_privileged.release`` and
    - ``{kafka_prefix}_privileged.revision``
    
    whereas the regular topics will be filled with anonymized versions of these
    objects.
    
    The anonymization process consists simply in forging a Person with the
    ``fullname`` being a hash of the triplet (fullname, name, email) of the
    original Person in Release and Revision entities.
    
    So the replayer process can be used as is (just have to not replay both
    standard and anonymized topics at once).

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/175/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/175/console

Replaced by D3171 + D3172

Some tests in this diff may be worth keeping.