Page MenuHomeSoftware Heritage

Add support for anonymized journal topics
AbandonedPublic

Authored by douardda on Wed, May 13, 12:23 PM.

Details

Reviewers
None
Group Reviewers
Reviewers
Summary

This adds 2 new topics (if activated by config), namely

  • `{kafka_prefix}.release:anonymized` and
  • `{kafka_prefix}.revision:anonymized`

These topics are filled aside from their original non-anonymized version.
The anonymization process consists simply in forging a Person with the
`fullname` being a hash of the triplet (fullname, name, email) of the
original Person in Release and Revision entities.

So the replayer process can be used as is (just have to not replay both
standard and anonymized topics at once).

Depends on D3140.

Diff Detail

Repository
rDSTO Storage manager
Branch
skipped_content
Lint
Lint Skipped
Unit
Unit Tests Skipped
Build Status
Buildable 12371
Build 18765: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 18764: arc lint + arc unit

Unit TestsFailed

TimeTest
4,080 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_kafka_writer::test_storage_direct_writer
kafka_prefix = 'yclriuzlwn', kafka_server = '127.0.0.1:49293' consumer = <cimpl.Consumer object at 0x7f8972852598>
2,006 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_kafka_writer::test_storage_direct_writer_anonymized
kafka_prefix = 'rrnxafkqlg', kafka_server = '127.0.0.1:49293' consumer = <cimpl.Consumer object at 0x7f8972852400>
3,005 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_replay::test_storage_play_anonymized
kafka_prefix = 'ldilvdqmiv', kafka_consumer_group = 'test-consumer-ldilvdqmiv' kafka_server = '127.0.0.1:49293' caplog = <_pytest.logging.LogCaptureFixture object at 0x7f89727dada0>
6 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.fixer::swh.storage.fixer._fix_content
2 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.fixer::swh.storage.fixer._fix_origin
View Full Test Results (3 Failed · 914 Passed · 24 Skipped)

Event Timeline

douardda created this revision.Wed, May 13, 12:23 PM

Build has FAILED

Patch application report for D3150 (id=11179)

Could not rebase; Attempt merge onto 306aa69b26...

Updating 306aa69..be5c939
Fast-forward
 swh/storage/replay.py                  |   9 +-
 swh/storage/tests/test_kafka_writer.py | 105 ++++++++++++++++++++++--
 swh/storage/tests/test_replay.py       | 145 ++++++++++++++++++++++++++++++++-
 swh/storage/writer.py                  |  70 ++++++++++------
 4 files changed, 292 insertions(+), 37 deletions(-)
Changes applied before test
commit be5c93901012700455776fcfcc7dc7c616d06744
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu May 7 15:33:35 2020 +0200

    Add support for anonymized journal topics
    
    This adds 2 new topics (if activated by config), namely
    - ``{kafka_prefix}.release:anonymized`` and
    - ``{kafka_prefix}.revision:anonymized``
    
    These topics are filled aside from their original non-anonymized version.
    The anonymization process consists simply in forging a Person with the
    ``fullename`` being a hash of the triplet (fullname, name, email) of the
    original Person in Release and Revision entities.
    
    So the replayer process can be used as is (just have to not replay both
    standard and anonymized topics at once).

commit e0ad4f46a50bb45ceb679d176068f5e478624163
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 13 11:55:44 2020 +0200

    replay: add support for "flags" in topics
    
    simply ignore anything after a ":" in the object_type part of the topic.

commit 87f7bee6935b831a3c300cadd9afb5ce890f2292
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Apr 7 16:20:35 2020 +0200

    journal: add a skipped_content topic dedicated to SkippedContent objects
    
    instead of mixing them with Content in the content topic.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/169/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/169/console

ardumont edited the summary of this revision. (Show Details)Wed, May 13, 3:33 PM

Looks promising ;)

1 typos to fix in the commit message fullename instead of fullname (i fixed
the diff description).

Remaining comments in the diff.

swh/storage/tests/test_replay.py
527

no need for the while

swh/storage/writer.py
23

recursively

douardda updated this revision to Diff 11189.Wed, May 13, 4:21 PM

typos (thx ardumont)

Build was aborted

Patch application report for D3150 (id=11189)

Could not rebase; Attempt merge onto 306aa69b26...

Updating 306aa69..13351cd
Fast-forward
 swh/storage/replay.py                  |   9 ++-
 swh/storage/tests/test_kafka_writer.py | 104 ++++++++++++++++++++++--
 swh/storage/tests/test_replay.py       | 143 ++++++++++++++++++++++++++++++++-
 swh/storage/writer.py                  |  70 ++++++++++------
 4 files changed, 289 insertions(+), 37 deletions(-)
Changes applied before test
commit 13351cdb37a87a3597b8de044db8fe30aac6a4e4
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu May 7 15:33:35 2020 +0200

    Add support for anonymized journal topics
    
    This adds 2 new topics (if activated by config), namely
    - ``{kafka_prefix}.release:anonymized`` and
    - ``{kafka_prefix}.revision:anonymized``
    
    These topics are filled aside from their original non-anonymized version.
    The anonymization process consists simply in forging a Person with the
    ``fullname`` being a hash of the triplet (fullname, name, email) of the
    original Person in Release and Revision entities.
    
    So the replayer process can be used as is (just have to not replay both
    standard and anonymized topics at once).

commit e0ad4f46a50bb45ceb679d176068f5e478624163
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 13 11:55:44 2020 +0200

    replay: add support for "flags" in topics
    
    simply ignore anything after a ":" in the object_type part of the topic.

commit 87f7bee6935b831a3c300cadd9afb5ce890f2292
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Apr 7 16:20:35 2020 +0200

    journal: add a skipped_content topic dedicated to SkippedContent objects
    
    instead of mixing them with Content in the content topic.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/170/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/170/console

Build has FAILED

Patch application report for D3150 (id=11189)

Could not rebase; Attempt merge onto 306aa69b26...

Updating 306aa69..13351cd
Fast-forward
 swh/storage/replay.py                  |   9 ++-
 swh/storage/tests/test_kafka_writer.py | 104 ++++++++++++++++++++++--
 swh/storage/tests/test_replay.py       | 143 ++++++++++++++++++++++++++++++++-
 swh/storage/writer.py                  |  70 ++++++++++------
 4 files changed, 289 insertions(+), 37 deletions(-)
Changes applied before test
commit 13351cdb37a87a3597b8de044db8fe30aac6a4e4
Author: David Douard <david.douard@sdfa3.org>
Date:   Thu May 7 15:33:35 2020 +0200

    Add support for anonymized journal topics
    
    This adds 2 new topics (if activated by config), namely
    - ``{kafka_prefix}.release:anonymized`` and
    - ``{kafka_prefix}.revision:anonymized``
    
    These topics are filled aside from their original non-anonymized version.
    The anonymization process consists simply in forging a Person with the
    ``fullname`` being a hash of the triplet (fullname, name, email) of the
    original Person in Release and Revision entities.
    
    So the replayer process can be used as is (just have to not replay both
    standard and anonymized topics at once).

commit e0ad4f46a50bb45ceb679d176068f5e478624163
Author: David Douard <david.douard@sdfa3.org>
Date:   Wed May 13 11:55:44 2020 +0200

    replay: add support for "flags" in topics
    
    simply ignore anything after a ":" in the object_type part of the topic.

commit 87f7bee6935b831a3c300cadd9afb5ce890f2292
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue Apr 7 16:20:35 2020 +0200

    journal: add a skipped_content topic dedicated to SkippedContent objects
    
    instead of mixing them with Content in the content topic.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/171/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/171/console

douardda abandoned this revision.Wed, May 20, 11:16 AM