Page MenuHomeSoftware Heritage

identifiers: Properly define the behavior of raw_extrinsic_metadata on negative timestamps.
ClosedPublic

Authored by vlorentz on Feb 4 2021, 12:26 PM.

Details

Summary

The rounding algorithm wasn't specified

Depends on D5004.

Diff Detail

Repository
rDMOD Data model
Branch
metadata-identifier
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 19634
Build 30472: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 30471: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D5008 (id=17878)

Could not rebase; Attempt merge onto 0c16581283...

Updating 0c16581..e338317
Fast-forward
 swh/model/hashutil.py               |   9 +-
 swh/model/identifiers.py            |  94 ++++++++++++++
 swh/model/model.py                  |   8 +-
 swh/model/tests/test_identifiers.py | 243 ++++++++++++++++++++++++++++++++++++
 swh/model/tests/test_model.py       |   6 +-
 5 files changed, 357 insertions(+), 3 deletions(-)
Changes applied before test
commit e3383172e94518217be3905ed5e3f7acd0dd80fd
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Feb 4 12:24:46 2021 +0100

    identifiers: Properly define the behavior of raw_extrinsic_metadata on negative timestamps.

commit 03d282b9af891b6d48e83bd5301e57c0472d5b03
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Feb 4 11:02:39 2021 +0100

    identifiers: Change the manifest format of raw_extrinsic_metadata to use integer instead of ISO8601
    
    Serializing as ISO8601 makes the hash brittle, because the database may
    change the timezone silently and/or lose precision in the microseconds.
    
    As we do not need precise timestamp, using an integer is good enough,
    and is consistant with the git format.
    
    The manifest also does not need to contain a timezone, as it only
    represents the timezone of the system that fetched this metadata,
    which is useless data.

commit 266b88dcaaa0cab48c67e62ebca51f0a4599c435
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jan 29 15:08:49 2021 +0100

    model: Add 'id' field to RawExtrinsicMetadata
    
    So that they can be properly deduplicated and referenced.

commit 272468f3b5a96c8854a26efe333c32cba4504aff
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Jan 25 12:31:12 2021 +0100

    identifiers: Add raw_extrinsic_metadata_identifier
    
    This will be used to compute an intrisic identifier for RawExtrinsicMetadata;
    which can be used for deduplication and refering to it like any other sha1_git
    instead of needed to use a tuple of its fields.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/216/ for more details.

anlambert added a subscriber: anlambert.

Looks good to me.

This revision is now accepted and ready to land.Feb 8 2021, 2:23 PM
olasd added a subscriber: olasd.

I still find slightly unfortunate that this logic is separate from the one used for revisions/releases, but I guess it makes sense as the "input types" are so dissimilar.

swh/model/identifiers.py
781

I'm not sure under which conditions this assert would trigger? Could you turn it into a ValueError instead?

swh/model/tests/test_identifiers.py
974

TIL 00 is still a valid Python 3 integer. (should probably be a 0 anyway)

swh/model/identifiers.py
781

I'm not sure under which conditions this assert would trigger?

>>> dt = datetime.datetime.now().replace(tzinfo=datetime.timezone(datetime.timedelta(microseconds=1000)))
>>> dt.replace(microsecond=0).timestamp()
1612795046.999

but I don't see any reason for it to happen

Could you turn it into a ValueError instead?

Why?

swh/model/tests/test_identifiers.py
974

TIL too

olasd requested changes to this revision.Feb 8 2021, 4:06 PM
olasd added inline comments.
swh/model/identifiers.py
781

So this means the way we're using the .replace() operation is buggy.

We should first normalize the timezone to utc (.astimezone(datetime.timezone.utc)), then replace the microseconds value with zero.

ValueError is the right exception to raise when a value is wrong, rather than an assert that can be optimized away.

This revision now requires changes to proceed.Feb 8 2021, 4:06 PM

Build is green

Patch application report for D5008 (id=17977)

Could not rebase; Attempt merge onto 0c16581283...

Updating 0c16581..35aab96
Fast-forward
 swh/model/hashutil.py               |   9 +-
 swh/model/identifiers.py            |  99 ++++++++++++++
 swh/model/model.py                  |   8 +-
 swh/model/tests/test_identifiers.py | 263 ++++++++++++++++++++++++++++++++++++
 swh/model/tests/test_model.py       |   6 +-
 5 files changed, 382 insertions(+), 3 deletions(-)
Changes applied before test
commit 35aab964f65f25ba6c3acf161a49682d100aa450
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Feb 4 12:24:46 2021 +0100

    identifiers: Properly define the behavior of raw_extrinsic_metadata on negative timestamps.

commit 03d282b9af891b6d48e83bd5301e57c0472d5b03
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Feb 4 11:02:39 2021 +0100

    identifiers: Change the manifest format of raw_extrinsic_metadata to use integer instead of ISO8601
    
    Serializing as ISO8601 makes the hash brittle, because the database may
    change the timezone silently and/or lose precision in the microseconds.
    
    As we do not need precise timestamp, using an integer is good enough,
    and is consistant with the git format.
    
    The manifest also does not need to contain a timezone, as it only
    represents the timezone of the system that fetched this metadata,
    which is useless data.

commit 266b88dcaaa0cab48c67e62ebca51f0a4599c435
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jan 29 15:08:49 2021 +0100

    model: Add 'id' field to RawExtrinsicMetadata
    
    So that they can be properly deduplicated and referenced.

commit 272468f3b5a96c8854a26efe333c32cba4504aff
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Jan 25 12:31:12 2021 +0100

    identifiers: Add raw_extrinsic_metadata_identifier
    
    This will be used to compute an intrisic identifier for RawExtrinsicMetadata;
    which can be used for deduplication and refering to it like any other sha1_git
    instead of needed to use a tuple of its fields.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/220/ for more details.

swh/model/identifiers.py
781

Ok. But now that I'm replacing with .astimezone, an assert is the right one to use

it would be nice to have an explanation in the commit message of why this is actually needed. Explain why the original implementation is incorrect, and how it fixes this old incorrect behavior.

it would be nice to have an explanation in the commit message of why this is actually needed. Explain why the original implementation is incorrect, and how it fixes this old incorrect behavior.

still relevant ;)

big rebase + explain in the commit msg what wasn't specified

Build is green

Patch application report for D5008 (id=18555)

Could not rebase; Attempt merge onto 8e0119962b...

Updating 8e01199..a0d2933
Fast-forward
 swh/model/cli.py                    |  92 +++---
 swh/model/hashutil.py               |   9 +-
 swh/model/identifiers.py            | 283 ++++++------------
 swh/model/model.py                  |   8 +-
 swh/model/tests/test_cli.py         |   6 +-
 swh/model/tests/test_identifiers.py | 575 ++++++++++++++++--------------------
 swh/model/tests/test_model.py       |   7 +-
 7 files changed, 434 insertions(+), 546 deletions(-)
Changes applied before test
commit a0d29337130e4d063e25f51a8b8bbce0401039fc
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Feb 4 12:24:46 2021 +0100

    identifiers: Properly define the behavior of raw_extrinsic_metadata on negative timestamps.
    
    The rounding algorithm wasn't specified

commit ff1423f8fab417a7e146c036ae09c0e6dd0fe4e2
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Feb 4 11:02:39 2021 +0100

    identifiers: Change the manifest format of raw_extrinsic_metadata to use integer instead of ISO8601
    
    Serializing as ISO8601 makes the hash brittle, because the database may
    change the timezone silently and/or lose precision in the microseconds.
    
    As we do not need precise timestamp, using an integer is good enough,
    and is consistant with the git format.
    
    The manifest also does not need to contain a timezone, as it only
    represents the timezone of the system that fetched this metadata,
    which is useless data.

commit 8cbe59ee6251ac834a3d98b04146d2dfeb8da075
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jan 29 15:08:49 2021 +0100

    model: Add 'id' field to RawExtrinsicMetadata
    
    So that they can be properly deduplicated and referenced.

commit d88a5e13f2ffea5c0ebfad24e29ba41e86af20c0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Jan 25 12:31:12 2021 +0100

    identifiers: Add raw_extrinsic_metadata_identifier
    
    This will be used to compute an intrisic identifier for RawExtrinsicMetadata;
    which can be used for deduplication and refering to it like any other sha1_git
    instead of needed to use a tuple of its fields.

commit bf4ab4336f7b43d442988c47d3dd70bb82b595c5
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 3 10:44:48 2021 +0100

    identifiers: Remove the deprecated SWHID class
    
    Other packages don't use it anymore.

commit 1e924e84198a895003d6f649b8e3471cd93a7c7b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 3 10:44:27 2021 +0100

    cli: stop using the deprecated SWHID class

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/276/ for more details.

This revision is now accepted and ready to land.Mar 4 2021, 11:06 AM

Build is green

Patch application report for D5008 (id=18589)

Could not rebase; Attempt merge onto bf4ab4336f...

Updating bf4ab43..4386330
Fast-forward
 swh/model/hashutil.py               |   9 +-
 swh/model/identifiers.py            |  99 ++++++++++++++
 swh/model/model.py                  |   8 +-
 swh/model/tests/test_identifiers.py | 258 ++++++++++++++++++++++++++++++++++++
 swh/model/tests/test_model.py       |   7 +-
 5 files changed, 378 insertions(+), 3 deletions(-)
Changes applied before test
commit 43863306859ea90d7701db7ee5ea68c2dbff95ce
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Feb 4 12:24:46 2021 +0100

    identifiers: Properly define the behavior of raw_extrinsic_metadata on negative timestamps.
    
    The rounding algorithm wasn't specified

commit 3ce412506cd942d1a5e7340bc309f43da2b2adeb
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Feb 4 11:02:39 2021 +0100

    identifiers: Change the manifest format of raw_extrinsic_metadata to use integer instead of ISO8601
    
    Serializing as ISO8601 makes the hash brittle, because the database may
    change the timezone silently and/or lose precision in the microseconds.
    
    As we do not need precise timestamp, using an integer is good enough,
    and is consistant with the git format.
    
    The manifest also does not need to contain a timezone, as it only
    represents the timezone of the system that fetched this metadata,
    which is useless data.

commit fc808e1fc9e59ffae1e82ea483d529025d7d1436
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jan 29 15:08:49 2021 +0100

    model: Add 'id' field to RawExtrinsicMetadata
    
    So that they can be properly deduplicated and referenced.

commit f6eab95253f13f28fe4d4652fc471e3e8a0b5565
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Jan 25 12:31:12 2021 +0100

    identifiers: Add raw_extrinsic_metadata_identifier
    
    This will be used to compute an intrisic identifier for RawExtrinsicMetadata;
    which can be used for deduplication and refering to it like any other sha1_git
    instead of needed to use a tuple of its fields.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/280/ for more details.