Page MenuHomeSoftware Heritage

identifiers: Change the manifest format of raw_extrinsic_metadata to use integer instead of ISO8601
ClosedPublic

Authored by vlorentz on Feb 4 2021, 11:06 AM.

Details

Summary

Serializing as ISO8601 makes the hash brittle, because the database may
change the timezone silently and/or lose precision in the microseconds.

As we do not need precise timestamp, using an integer is good enough,
and is consistant with the git format.

The manifest also does not need to contain a timezone, as it only
represents the timezone of the system that fetched this metadata,
which is useless data.

Depends on D4970.

Diff Detail

Repository
rDMOD Data model
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 18990
Build 29422: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 29421: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D5004 (id=17868)

Could not rebase; Attempt merge onto 0c16581283...

Updating 0c16581..a5a7971
Fast-forward
 swh/model/hashutil.py               |   9 ++-
 swh/model/identifiers.py            |  91 ++++++++++++++++++++++++
 swh/model/model.py                  |   8 ++-
 swh/model/tests/test_identifiers.py | 137 ++++++++++++++++++++++++++++++++++++
 swh/model/tests/test_model.py       |   6 +-
 5 files changed, 248 insertions(+), 3 deletions(-)
Changes applied before test
commit a5a7971eca0a979de1d460c2f171e2456f35c846
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Feb 4 11:02:39 2021 +0100

    identifiers: Change the manifest format of raw_extrinsic_metadata to use integer instead of ISO8601
    
    Serializing as ISO8601 makes the hash brittle, because the database may
    change the timezone silently and/or lose precision in the microseconds.
    
    As we do not need precise timestamp, using an integer is good enough,
    and is consistant with the git format.
    
    The manifest also does not need to contain a timezone, as it only
    represents the timezone of the system that fetched this metadata,
    which is useless data.

commit 266b88dcaaa0cab48c67e62ebca51f0a4599c435
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jan 29 15:08:49 2021 +0100

    model: Add 'id' field to RawExtrinsicMetadata
    
    So that they can be properly deduplicated and referenced.

commit 272468f3b5a96c8854a26efe333c32cba4504aff
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Jan 25 12:31:12 2021 +0100

    identifiers: Add raw_extrinsic_metadata_identifier
    
    This will be used to compute an intrisic identifier for RawExtrinsicMetadata;
    which can be used for deduplication and refering to it like any other sha1_git
    instead of needed to use a tuple of its fields.

Link to build: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/212/
See console output for more information: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/212/console

Harbormaster returned this revision to the author for changes because remote builds failed.Feb 4 2021, 11:09 AM
Harbormaster failed remote builds in B18990: Diff 17868!
  • fix test
  • don't error on non-zero microsecond, just silently truncate it

Build is green

Patch application report for D5004 (id=17869)

Could not rebase; Attempt merge onto 0c16581283...

Updating 0c16581..8c257d1
Fast-forward
 swh/model/hashutil.py               |   9 ++-
 swh/model/identifiers.py            |  87 +++++++++++++++++++++++
 swh/model/model.py                  |   8 ++-
 swh/model/tests/test_identifiers.py | 137 ++++++++++++++++++++++++++++++++++++
 swh/model/tests/test_model.py       |   6 +-
 5 files changed, 244 insertions(+), 3 deletions(-)
Changes applied before test
commit 8c257d192bad503e5843ca9c13e533ee442a90c5
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Feb 4 11:02:39 2021 +0100

    identifiers: Change the manifest format of raw_extrinsic_metadata to use integer instead of ISO8601
    
    Serializing as ISO8601 makes the hash brittle, because the database may
    change the timezone silently and/or lose precision in the microseconds.
    
    As we do not need precise timestamp, using an integer is good enough,
    and is consistant with the git format.
    
    The manifest also does not need to contain a timezone, as it only
    represents the timezone of the system that fetched this metadata,
    which is useless data.

commit 266b88dcaaa0cab48c67e62ebca51f0a4599c435
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jan 29 15:08:49 2021 +0100

    model: Add 'id' field to RawExtrinsicMetadata
    
    So that they can be properly deduplicated and referenced.

commit 272468f3b5a96c8854a26efe333c32cba4504aff
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Jan 25 12:31:12 2021 +0100

    identifiers: Add raw_extrinsic_metadata_identifier
    
    This will be used to compute an intrisic identifier for RawExtrinsicMetadata;
    which can be used for deduplication and refering to it like any other sha1_git
    instead of needed to use a tuple of its fields.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/213/ for more details.

add test for microsecond-insensitivity.

Build is green

Patch application report for D5004 (id=17870)

Could not rebase; Attempt merge onto 0c16581283...

Updating 0c16581..b5142ae
Fast-forward
 swh/model/hashutil.py               |   9 ++-
 swh/model/identifiers.py            |  87 ++++++++++++++++++++
 swh/model/model.py                  |   8 +-
 swh/model/tests/test_identifiers.py | 156 ++++++++++++++++++++++++++++++++++++
 swh/model/tests/test_model.py       |   6 +-
 5 files changed, 263 insertions(+), 3 deletions(-)
Changes applied before test
commit b5142aee2500be8fd250f9a398eb5035105d89e5
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Feb 4 11:02:39 2021 +0100

    identifiers: Change the manifest format of raw_extrinsic_metadata to use integer instead of ISO8601
    
    Serializing as ISO8601 makes the hash brittle, because the database may
    change the timezone silently and/or lose precision in the microseconds.
    
    As we do not need precise timestamp, using an integer is good enough,
    and is consistant with the git format.
    
    The manifest also does not need to contain a timezone, as it only
    represents the timezone of the system that fetched this metadata,
    which is useless data.

commit 266b88dcaaa0cab48c67e62ebca51f0a4599c435
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jan 29 15:08:49 2021 +0100

    model: Add 'id' field to RawExtrinsicMetadata
    
    So that they can be properly deduplicated and referenced.

commit 272468f3b5a96c8854a26efe333c32cba4504aff
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Jan 25 12:31:12 2021 +0100

    identifiers: Add raw_extrinsic_metadata_identifier
    
    This will be used to compute an intrisic identifier for RawExtrinsicMetadata;
    which can be used for deduplication and refering to it like any other sha1_git
    instead of needed to use a tuple of its fields.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/214/ for more details.

add test for negative timestamp

Build is green

Patch application report for D5004 (id=17871)

Could not rebase; Attempt merge onto 0c16581283...

Updating 0c16581..03d282b
Fast-forward
 swh/model/hashutil.py               |   9 +-
 swh/model/identifiers.py            |  87 +++++++++++++++++
 swh/model/model.py                  |   8 +-
 swh/model/tests/test_identifiers.py | 185 ++++++++++++++++++++++++++++++++++++
 swh/model/tests/test_model.py       |   6 +-
 5 files changed, 292 insertions(+), 3 deletions(-)
Changes applied before test
commit 03d282b9af891b6d48e83bd5301e57c0472d5b03
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Feb 4 11:02:39 2021 +0100

    identifiers: Change the manifest format of raw_extrinsic_metadata to use integer instead of ISO8601
    
    Serializing as ISO8601 makes the hash brittle, because the database may
    change the timezone silently and/or lose precision in the microseconds.
    
    As we do not need precise timestamp, using an integer is good enough,
    and is consistant with the git format.
    
    The manifest also does not need to contain a timezone, as it only
    represents the timezone of the system that fetched this metadata,
    which is useless data.

commit 266b88dcaaa0cab48c67e62ebca51f0a4599c435
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jan 29 15:08:49 2021 +0100

    model: Add 'id' field to RawExtrinsicMetadata
    
    So that they can be properly deduplicated and referenced.

commit 272468f3b5a96c8854a26efe333c32cba4504aff
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Jan 25 12:31:12 2021 +0100

    identifiers: Add raw_extrinsic_metadata_identifier
    
    This will be used to compute an intrisic identifier for RawExtrinsicMetadata;
    which can be used for deduplication and refering to it like any other sha1_git
    instead of needed to use a tuple of its fields.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/215/ for more details.

lgtm

The reasoning is sound :)

One suggestion to decrease repetition in test inline.

Note that it's "consistent" in english (not consistant :)

swh/model/tests/test_identifiers.py
882

To decrease the repetition on those 2 tests (timezone and microsecond insensitive), you could use make the "equivalent" datetime parametric (one naive, one with timezone, one with microseconds, one with both timezone and microseconds, etc... (using pytest.mark.parametrize).

Expectations being the same hash in the end.

This revision is now accepted and ready to land.Feb 5 2021, 9:11 AM

big rebase, plz review again

Build is green

Patch application report for D5004 (id=18554)

Could not rebase; Attempt merge onto 8e0119962b...

Updating 8e01199..ff1423f
Fast-forward
 swh/model/cli.py                    |  92 ++++---
 swh/model/hashutil.py               |   9 +-
 swh/model/identifiers.py            | 271 ++++++--------------
 swh/model/model.py                  |   8 +-
 swh/model/tests/test_cli.py         |   6 +-
 swh/model/tests/test_identifiers.py | 499 +++++++++++++-----------------------
 swh/model/tests/test_model.py       |   7 +-
 7 files changed, 346 insertions(+), 546 deletions(-)
Changes applied before test
commit ff1423f8fab417a7e146c036ae09c0e6dd0fe4e2
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Feb 4 11:02:39 2021 +0100

    identifiers: Change the manifest format of raw_extrinsic_metadata to use integer instead of ISO8601
    
    Serializing as ISO8601 makes the hash brittle, because the database may
    change the timezone silently and/or lose precision in the microseconds.
    
    As we do not need precise timestamp, using an integer is good enough,
    and is consistant with the git format.
    
    The manifest also does not need to contain a timezone, as it only
    represents the timezone of the system that fetched this metadata,
    which is useless data.

commit 8cbe59ee6251ac834a3d98b04146d2dfeb8da075
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jan 29 15:08:49 2021 +0100

    model: Add 'id' field to RawExtrinsicMetadata
    
    So that they can be properly deduplicated and referenced.

commit d88a5e13f2ffea5c0ebfad24e29ba41e86af20c0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Jan 25 12:31:12 2021 +0100

    identifiers: Add raw_extrinsic_metadata_identifier
    
    This will be used to compute an intrisic identifier for RawExtrinsicMetadata;
    which can be used for deduplication and refering to it like any other sha1_git
    instead of needed to use a tuple of its fields.

commit bf4ab4336f7b43d442988c47d3dd70bb82b595c5
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 3 10:44:48 2021 +0100

    identifiers: Remove the deprecated SWHID class
    
    Other packages don't use it anymore.

commit 1e924e84198a895003d6f649b8e3471cd93a7c7b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 3 10:44:27 2021 +0100

    cli: stop using the deprecated SWHID class

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/275/ for more details.

This revision is now accepted and ready to land.Mar 4 2021, 11:06 AM

Build is green

Patch application report for D5004 (id=18588)

Could not rebase; Attempt merge onto bf4ab4336f...

Updating bf4ab43..3ce4125
Fast-forward
 swh/model/hashutil.py               |   9 +-
 swh/model/identifiers.py            |  87 +++++++++++++++++
 swh/model/model.py                  |   8 +-
 swh/model/tests/test_identifiers.py | 182 ++++++++++++++++++++++++++++++++++++
 swh/model/tests/test_model.py       |   7 +-
 5 files changed, 290 insertions(+), 3 deletions(-)
Changes applied before test
commit 3ce412506cd942d1a5e7340bc309f43da2b2adeb
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Feb 4 11:02:39 2021 +0100

    identifiers: Change the manifest format of raw_extrinsic_metadata to use integer instead of ISO8601
    
    Serializing as ISO8601 makes the hash brittle, because the database may
    change the timezone silently and/or lose precision in the microseconds.
    
    As we do not need precise timestamp, using an integer is good enough,
    and is consistant with the git format.
    
    The manifest also does not need to contain a timezone, as it only
    represents the timezone of the system that fetched this metadata,
    which is useless data.

commit fc808e1fc9e59ffae1e82ea483d529025d7d1436
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jan 29 15:08:49 2021 +0100

    model: Add 'id' field to RawExtrinsicMetadata
    
    So that they can be properly deduplicated and referenced.

commit f6eab95253f13f28fe4d4652fc471e3e8a0b5565
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Jan 25 12:31:12 2021 +0100

    identifiers: Add raw_extrinsic_metadata_identifier
    
    This will be used to compute an intrisic identifier for RawExtrinsicMetadata;
    which can be used for deduplication and refering to it like any other sha1_git
    instead of needed to use a tuple of its fields.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/279/ for more details.