Page MenuHomeSoftware Heritage

Add support for model object anonymization
ClosedPublic

Authored by douardda on May 20 2020, 11:20 AM.

Details

Summary

Simply add a BaseModel.anonymize() method. Default implementation returns
None, meaning the object is not anonymizable.

For Revision, Release and Person, the method do return an anonymized version of
the object.

Replaces (partially) the couple D3160/D3161.

See D3172 for the part in swh.journal.

Diff Detail

Event Timeline

Build is green

Patch application report for D3171 (id=11258)

Rebasing onto cce3036634...

Current branch diff-target is up to date.
Changes applied before test
commit 0292f52f53294f2a5e809218d67557940abeb34e
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 19 16:04:30 2020 +0200

    Add support for model object anonymization
    
    Simply add a BaseModel.anonymize() method. Default implementation returns
    None, meaning the object is not anonymizable.
    
    For Revision, Release and Person, the method do return an anonymized version of
    the object.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/60/ for more details.

ardumont added a subscriber: ardumont.

Neat.

Some typos to fix.

swh/model/model.py
393

consists in

454

consists in

This revision is now accepted and ready to land.May 20 2020, 11:28 AM
olasd added inline comments.
swh/model/model.py
148–149

name and email are just display helpers. The anonymous version should probably only hash the fullname data.

Shouldn't we make anonymized objects error when their compute_hash() method is called?

Shouldn't we make anonymized objects error when their compute_hash() method is called?

Maybe, but that would require we keep the info "this is an anonymized object" somewhere, which is not the case for now. This idea can be dealt later, maybe?

swh/model/model.py
148–149

right, I read the code identifier.py too fast and thought all 3 were concatenated for hash computation, but they are not as you point out.

Maybe, but that would require we keep the info "this is an anonymized object" somewhere, which is not the case for now. This idea can be dealt later, maybe?

That's why I'm asking now, so you/we don't have to do some code changes later. But if you're comfortable with it, then fine.

Typos + comments/docstrings + hash on the fullname in Person.anonymize()

also ensures persons_d() strategy do not generate data that looks like
and anonymized person.

Build is green

Patch application report for D3171 (id=11267)

Rebasing onto cce3036634...

Current branch diff-target is up to date.
Changes applied before test
commit e40fe471031bc85f9d40be163cba9d7351a02888
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 19 16:04:30 2020 +0200

    Add support for model object anonymization
    
    Simply add a BaseModel.anonymize() method. Default implementation returns
    None, meaning the object is not anonymizable.
    
    For Person, the method returns a Person whith hashed fullname (and unset
    name and email).
    
    For Revision and Release, the method returns an anonymized version of
    the object, i.e. with instance of Person replaced by anonymized ones.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/61/ for more details.

ardumont added inline comments.
swh/model/hypothesis_strategies.py
101

\o/

douardda edited the summary of this revision. (Show Details)

properly annotate BaseModel.anonymize()

Build is green

Patch application report for D3171 (id=11268)

Rebasing onto cce3036634...

Current branch diff-target is up to date.
Changes applied before test
commit 0f3af381835fc2f1e3e420519d0bba7aef3d8ce6
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 19 16:04:30 2020 +0200

    Add support for model object anonymization
    
    Simply add a BaseModel.anonymize() method. Default implementation returns
    None, meaning the object is not anonymizable.
    
    For Person, the method returns a Person whith hashed fullname (and unset
    name and email).
    
    For Revision and Release, the method returns an anonymized version of
    the object, i.e. with instance of Person replaced by anonymized ones.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/62/ for more details.

use ModelType instead of T for type annotation

Build is green

Patch application report for D3171 (id=11270)

Rebasing onto cce3036634...

Current branch diff-target is up to date.
Changes applied before test
commit 29312dff6d96ac1c9bc18bf98de1d2e27a76c334
Author: David Douard <david.douard@sdfa3.org>
Date:   Tue May 19 16:04:30 2020 +0200

    Add support for model object anonymization
    
    Simply add a BaseModel.anonymize() method. Default implementation returns
    None, meaning the object is not anonymizable.
    
    For Person, the method returns a Person whith hashed fullname (and unset
    name and email).
    
    For Revision and Release, the method returns an anonymized version of
    the object, i.e. with instance of Person replaced by anonymized ones.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/63/ for more details.