Page MenuHomeSoftware Heritage

Add support for model object anonymization
ClosedPublic

Authored by douardda on Wed, May 20, 11:20 AM.

Details

Summary

Simply add a BaseModel.anonymize() method. Default implementation returns
None, meaning the object is not anonymizable.

For Revision, Release and Person, the method do return an anonymized version of
the object.

Replaces (partially) the couple D3160/D3161.

See D3172 for the part in swh.journal.

Diff Detail

Repository
rDMOD Data model
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

douardda created this revision.Wed, May 20, 11:20 AM

Build is green

Patch application report for D3171 (id=11258)

Rebasing onto cce3036634...

Current branch diff-target is up to date.
Changes applied before test
commit 0292f52f53294f2a5e809218d67557940abeb34e
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 19 16:04:30 2020 +0200

    Add support for model object anonymization
    
    Simply add a BaseModel.anonymize() method. Default implementation returns
    None, meaning the object is not anonymizable.
    
    For Revision, Release and Person, the method do return an anonymized version of
    the object.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/60/ for more details.

ardumont accepted this revision.Wed, May 20, 11:28 AM
ardumont added a subscriber: ardumont.

Neat.

Some typos to fix.

swh/model/model.py
394

consists in

455

consists in

This revision is now accepted and ready to land.Wed, May 20, 11:28 AM
olasd added a subscriber: olasd.Wed, May 20, 12:42 PM
olasd added inline comments.
swh/model/model.py
149–150

name and email are just display helpers. The anonymous version should probably only hash the fullname data.

Shouldn't we make anonymized objects error when their compute_hash() method is called?

Shouldn't we make anonymized objects error when their compute_hash() method is called?

Maybe, but that would require we keep the info "this is an anonymized object" somewhere, which is not the case for now. This idea can be dealt later, maybe?

douardda added inline comments.Wed, May 20, 2:29 PM
swh/model/model.py
149–150

right, I read the code identifier.py too fast and thought all 3 were concatenated for hash computation, but they are not as you point out.

Maybe, but that would require we keep the info "this is an anonymized object" somewhere, which is not the case for now. This idea can be dealt later, maybe?

That's why I'm asking now, so you/we don't have to do some code changes later. But if you're comfortable with it, then fine.

douardda updated this revision to Diff 11267.Wed, May 20, 2:47 PM

Typos + comments/docstrings + hash on the fullname in Person.anonymize()

also ensures persons_d() strategy do not generate data that looks like
and anonymized person.

douardda edited the summary of this revision. (Show Details)Wed, May 20, 2:48 PM

Build is green

Patch application report for D3171 (id=11267)

Rebasing onto cce3036634...

Current branch diff-target is up to date.
Changes applied before test
commit e40fe471031bc85f9d40be163cba9d7351a02888
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 19 16:04:30 2020 +0200

    Add support for model object anonymization
    
    Simply add a BaseModel.anonymize() method. Default implementation returns
    None, meaning the object is not anonymizable.
    
    For Person, the method returns a Person whith hashed fullname (and unset
    name and email).
    
    For Revision and Release, the method returns an anonymized version of
    the object, i.e. with instance of Person replaced by anonymized ones.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/61/ for more details.

ardumont accepted this revision.Wed, May 20, 3:04 PM
ardumont added inline comments.
swh/model/hypothesis_strategies.py
102

\o/

douardda updated this revision to Diff 11268.Wed, May 20, 3:37 PM
douardda edited the summary of this revision. (Show Details)

properly annotate BaseModel.anonymize()

Build is green

Patch application report for D3171 (id=11268)

Rebasing onto cce3036634...

Current branch diff-target is up to date.
Changes applied before test
commit 0f3af381835fc2f1e3e420519d0bba7aef3d8ce6
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 19 16:04:30 2020 +0200

    Add support for model object anonymization
    
    Simply add a BaseModel.anonymize() method. Default implementation returns
    None, meaning the object is not anonymizable.
    
    For Person, the method returns a Person whith hashed fullname (and unset
    name and email).
    
    For Revision and Release, the method returns an anonymized version of
    the object, i.e. with instance of Person replaced by anonymized ones.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/62/ for more details.

douardda updated this revision to Diff 11270.Wed, May 20, 4:30 PM

use ModelType instead of T for type annotation

Build is green

Patch application report for D3171 (id=11270)

Rebasing onto cce3036634...

Current branch diff-target is up to date.
Changes applied before test
commit 29312dff6d96ac1c9bc18bf98de1d2e27a76c334
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 19 16:04:30 2020 +0200

    Add support for model object anonymization
    
    Simply add a BaseModel.anonymize() method. Default implementation returns
    None, meaning the object is not anonymizable.
    
    For Person, the method returns a Person whith hashed fullname (and unset
    name and email).
    
    For Revision and Release, the method returns an anonymized version of
    the object, i.e. with instance of Person replaced by anonymized ones.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/63/ for more details.

This revision was automatically updated to reflect the committed changes.