Page MenuHomeSoftware Heritage

Refactor identifiers & model to make *_git_object() functions work on model classes instead of dicts
ClosedPublic

Authored by vlorentz on Sep 22 2021, 5:47 PM.

Details

Summary

Since we now use these classes everywhere, computing hashes required
using to_dict() just to compute identifiers, which can be a performance
bottleneck in code computing many checksums.

Depends on D6323 and D6324

Diff Detail

Repository
rDMOD Data model
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D6325 (id=22979)

Could not rebase; Attempt merge onto b6f5e30b53...

Updating b6f5e30..54e6a7f
Fast-forward
 swh/model/identifiers.py            | 739 ++++++------------------------------
 swh/model/model.py                  | 140 +++++--
 swh/model/swhids.py                 | 448 ++++++++++++++++++++++
 swh/model/tests/test_identifiers.py |  69 ++--
 4 files changed, 694 insertions(+), 702 deletions(-)
 create mode 100644 swh/model/swhids.py
Changes applied before test
commit 54e6a7fbf8c0b9f77da02d32ddccf55b4f9a389c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Sep 22 17:32:48 2021 +0200

    Refactor identifiers & model to make *_git_object() functions work on model classes instead of dicts
    
    Since we now use these classes everywhere, computing hashes required
    using to_dict() just to compute identifiers, which can be a performance
    bottleneck in code computing many checksums.

commit 209356fc6704070bbac0b5f09b9c2e0c9aa2f12c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Sep 22 17:19:10 2021 +0200

    test_identifiers.py: Fix/update malformed data dicts
    
    A future commit will make identifier computation use the attrs classes,
    which are strict about what they accept.

commit d08b45c8766ec82ef9b47c30e2d574e279a0e37e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Sep 22 16:36:30 2021 +0200

    Move SWHID classes and functions from identifiers.py to swhids.py
    
    identifiers.py initially worked only on bare sha1_git.
    
    I chose to add the SWHID classes in that module because
    of the name, but the SWHID code didn't actually interact
    with the other functions in the module, so it now feels
    out of place to me.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/374/ for more details.

shouldn't have staged this change

Build is green

Patch application report for D6325 (id=22983)

Could not rebase; Attempt merge onto b6f5e30b53...

Updating b6f5e30..b4764ee
Fast-forward
 swh/model/identifiers.py            | 739 ++++++------------------------------
 swh/model/model.py                  | 138 +++++--
 swh/model/swhids.py                 | 448 ++++++++++++++++++++++
 swh/model/tests/test_identifiers.py |  69 ++--
 4 files changed, 693 insertions(+), 701 deletions(-)
 create mode 100644 swh/model/swhids.py
Changes applied before test
commit b4764ee68e5a4a71c90e81482ed133ca8b22eb12
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Sep 22 17:32:48 2021 +0200

    Refactor identifiers & model to make *_git_object() functions work on model classes instead of dicts
    
    Since we now use these classes everywhere, computing hashes required
    using to_dict() just to compute identifiers, which can be a performance
    bottleneck in code computing many checksums.

commit 209356fc6704070bbac0b5f09b9c2e0c9aa2f12c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Sep 22 17:19:10 2021 +0200

    test_identifiers.py: Fix/update malformed data dicts
    
    A future commit will make identifier computation use the attrs classes,
    which are strict about what they accept.

commit d08b45c8766ec82ef9b47c30e2d574e279a0e37e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Sep 22 16:36:30 2021 +0200

    Move SWHID classes and functions from identifiers.py to swhids.py
    
    identifiers.py initially worked only on bare sha1_git.
    
    I chose to add the SWHID classes in that module because
    of the name, but the SWHID code didn't actually interact
    with the other functions in the module, so it now feels
    out of place to me.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/377/ for more details.

Thanks!

I believe most annotations for the *_git_object functions should formally be Unions. I think the fallbacks from dict should raise DeprecationWarnings too?

Feels like it might be a good time to trim the complexity out of the Timestamp.from_dict function, but I'm not sure what the shape of existing timestamps in the journal, for instance, is, so I don't know how much of the scope we should be narrowing... I guess we'll keep that for later.

swh/model/identifiers.py
13–15

"Hmm, this is going to do a circular import, isn't it?"

227–228

Ah, yes. Didn't we have a bug for negative timestamps with microseconds somewhere?

361–362

needs an update

410–411

That doesn't sound accurate.

swh/model/model.py
17

"Ah, I guess it doesn't!"

198

I think this fallback should raise a DeprecationWarning. We shouldn't be building Person objects out of ambiguous name/email specifications, except in very narrow legacy cases. We definitely don't want new code to do that.

259

needs a s/normalize_timestamp/Timestamp.from_dict/g

678–679

Wrong diff :-)

1187–1188

Hmm. I think I've missed what caused this change to be necessary?

vlorentz marked 4 inline comments as done.

apply comments

swh/model/identifiers.py
227–228

Yes. Either timestamps between 0 and -1 (exclusive) are not representable, or converting a Timestamp to float should actually be:

if date.seconds < 0:
    float_value = "%d.%06d" % (date.seconds+1, 1000000 - date.microseconds)
else:
    float_value = "%d.%06d" % (date.seconds, date.microseconds)

I don't know which one it is.

361–362

I'll just remove them; type annotations are explicit enough.

410–411

Why not?

swh/model/model.py
17

It does, but Python allows circular imports as long as modules don't access each other's objects while they are being imported.

198

I'll do it in a future diff.

259

That would be inaccurate, Timestamp.from_dict only accepts dicts.

But I see your point, I'll move the docstring from normalize_timestamp to this one when I remove normalize_timestamp

678–679

oops

1187–1188

Good catch, it was bogus test data.

Build is green

Patch application report for D6325 (id=22998)

Could not rebase; Attempt merge onto b6f5e30b53...

Updating b6f5e30..2563ab4
Fast-forward
 swh/model/identifiers.py            | 745 ++++++------------------------------
 swh/model/model.py                  | 129 +++++--
 swh/model/swhids.py                 | 448 ++++++++++++++++++++++
 swh/model/tests/test_identifiers.py |  81 ++--
 4 files changed, 692 insertions(+), 711 deletions(-)
 create mode 100644 swh/model/swhids.py
Changes applied before test
commit 2563ab4b076790160015727bb25fde59fe9e6f90
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Sep 22 17:32:48 2021 +0200

    Refactor identifiers & model to make *_git_object() functions work on model classes instead of dicts
    
    Since we now use these classes everywhere, computing hashes required
    using to_dict() just to compute identifiers, which can be a performance
    bottleneck in code computing many checksums.

commit 209356fc6704070bbac0b5f09b9c2e0c9aa2f12c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Sep 22 17:19:10 2021 +0200

    test_identifiers.py: Fix/update malformed data dicts
    
    A future commit will make identifier computation use the attrs classes,
    which are strict about what they accept.

commit d08b45c8766ec82ef9b47c30e2d574e279a0e37e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Sep 22 16:36:30 2021 +0200

    Move SWHID classes and functions from identifiers.py to swhids.py
    
    identifiers.py initially worked only on bare sha1_git.
    
    I chose to add the SWHID classes in that module because
    of the name, but the SWHID code didn't actually interact
    with the other functions in the module, so it now feels
    out of place to me.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/379/ for more details.

swh/model/identifiers.py
410–411

This should say "author and committer are serialized using the fullname attributes of the corresponding Person objects" or something to that effect, rather than defer to an optional behavior of the Person object which doesn't talk about formatting at all.

swh/model/model.py
259

I meant that this substitution should happen in the ValueErrors etc., sorry that wasn't clear.

swh/model/model.py
259

oh, indeed

fix spec of author/committer formatting in revision_identifier + update function name in ValueErrors raised by TimestampWithTimezone.from_dict

Build is green

Patch application report for D6325 (id=23006)

Could not rebase; Attempt merge onto 0dd33cdf7d...

Updating 0dd33cd..57ae405
Fast-forward
 swh/model/identifiers.py            | 744 ++++++------------------------------
 swh/model/model.py                  | 129 +++++--
 swh/model/swhids.py                 | 448 ++++++++++++++++++++++
 swh/model/tests/test_identifiers.py |  81 ++--
 4 files changed, 691 insertions(+), 711 deletions(-)
 create mode 100644 swh/model/swhids.py
Changes applied before test
commit 57ae405d312879bec19107d29a20c2c290d7861d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Sep 22 17:32:48 2021 +0200

    Refactor identifiers & model to make *_git_object() functions work on model classes instead of dicts
    
    Since we now use these classes everywhere, computing hashes required
    using to_dict() just to compute identifiers, which can be a performance
    bottleneck in code computing many checksums.

commit 6a72f88c5687f5b6c05f9b25eb73bce2d772f97c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Sep 22 17:19:10 2021 +0200

    test_identifiers.py: Fix/update malformed data dicts
    
    A future commit will make identifier computation use the attrs classes,
    which are strict about what they accept.

commit 9ec683264c415731286005dff823e1099ef358c3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Sep 22 16:36:30 2021 +0200

    Move SWHID classes and functions from identifiers.py to swhids.py
    
    identifiers.py initially worked only on bare sha1_git.
    
    I chose to add the SWHID classes in that module because
    of the name, but the SWHID code didn't actually interact
    with the other functions in the module, so it now feels
    out of place to me.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/381/ for more details.

Any chance for the signatures of the backward-compatible functions to be updated to take Union[]s? Else it'll make mypy in other modules sad.

swh/model/identifiers.py
227–228

Should be the latter. I think. It's pretty awful.

This revision is now accepted and ready to land.Sep 23 2021, 3:30 PM
swh/model/identifiers.py
227–228

Let's add it on the TODO-list for SWHIDv2 :)

In D6325#164133, @olasd wrote:

Else it'll make mypy in other modules sad.

Hmm I was about to say it's the point; but it's going to raise hard CI errors, so I'll do it.

This revision was landed with ongoing or failed builds.Sep 23 2021, 7:55 PM
This revision was automatically updated to reflect the committed changes.

Build is green

Patch application report for D6325 (id=23020)

Could not rebase; Attempt merge onto 0dd33cdf7d...

Updating 0dd33cd..57ae405
Fast-forward
 swh/model/identifiers.py            | 744 ++++++------------------------------
 swh/model/model.py                  | 129 +++++--
 swh/model/swhids.py                 | 448 ++++++++++++++++++++++
 swh/model/tests/test_identifiers.py |  81 ++--
 4 files changed, 691 insertions(+), 711 deletions(-)
 create mode 100644 swh/model/swhids.py
Changes applied before test
commit 57ae405d312879bec19107d29a20c2c290d7861d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Sep 22 17:32:48 2021 +0200

    Refactor identifiers & model to make *_git_object() functions work on model classes instead of dicts
    
    Since we now use these classes everywhere, computing hashes required
    using to_dict() just to compute identifiers, which can be a performance
    bottleneck in code computing many checksums.

commit 6a72f88c5687f5b6c05f9b25eb73bce2d772f97c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Sep 22 17:19:10 2021 +0200

    test_identifiers.py: Fix/update malformed data dicts
    
    A future commit will make identifier computation use the attrs classes,
    which are strict about what they accept.

commit 9ec683264c415731286005dff823e1099ef358c3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Sep 22 16:36:30 2021 +0200

    Move SWHID classes and functions from identifiers.py to swhids.py
    
    identifiers.py initially worked only on bare sha1_git.
    
    I chose to add the SWHID classes in that module because
    of the name, but the SWHID code didn't actually interact
    with the other functions in the module, so it now feels
    out of place to me.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/386/ for more details.