Page MenuHomeSoftware Heritage

Deduplicate resolve_revision_from across package loaders
ClosedPublic

Authored by vlorentz on Mar 19 2021, 2:21 PM.

Details

Summary

All package loaders but deposit had logic to compute some object
from the new packageinfo, some other objects from the known artifacts,
and compare them.

This commit moves the comparison logic to the base class, and unifies
the two computation interfaces, respectively as an extid() method
on TPackageInfo and a method on the loader.

This unified object for comparison is a byte string,
which is internal to each loader for now, but a future commit
will read and write it from/to the ExtID storage instead of
computing it from the 'original_artifacts' present in
revision metadata.

Unit TestsFailed

TimeTest
272 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.loader.package.archive.tests.test_archive::test_archive_2_visits_with_new_artifact
swh_storage = <swh.storage.retry.RetryingProxyStorage object at 0x7faf15de9f98> requests_mock_datadir = <requests_mock.mocker.Mocker object at 0x7faf15e346a0>
245 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.loader.package.archive.tests.test_archive::test_archive_2_visits_without_change
swh_storage = <swh.storage.retry.RetryingProxyStorage object at 0x7faf15ef47f0> requests_mock_datadir = <requests_mock.mocker.Mocker object at 0x7faf15dc4dd8>
211 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.loader.package.archive.tests.test_archive::test_archive_2_visits_without_change_not_gnu
swh_storage = <swh.storage.retry.RetryingProxyStorage object at 0x7faf15f99588> requests_mock_datadir = <requests_mock.mocker.Mocker object at 0x7faf15dc4710>
217 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.loader.package.cran.tests.test_cran::test_cran_2_visits_same_origin
swh_storage = <swh.storage.retry.RetryingProxyStorage object at 0x7faf14af7fd0> requests_mock_datadir = <requests_mock.mocker.Mocker object at 0x7faf14b24978>
344 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.loader.package.nixguix.tests.test_nixguix::test_load_nixguix_one_common_artifact_from_other_loader
swh_storage = <swh.storage.retry.RetryingProxyStorage object at 0x7faf140f3e10> datadir = '/var/lib/jenkins/workspace/DLDBASE/tests-on-diff/.tox/py3/lib/python3.7/site-packages/swh/loader/package/nixguix/tests/data' requests_mock_datadir_visits = <requests_mock.mocker.Mocker object at 0x7faf140ad4a8>
View Full Test Results (5 Failed · 153 Passed)

Event Timeline

Build has FAILED

Patch application report for D5290 (id=18937)

Could not rebase; Attempt merge onto 132522e42e...

Auto-merging swh/loader/package/deposit/loader.py
Merge made by the 'recursive' strategy.
 swh/loader/package/archive/loader.py             | 57 ++++++++--------
 swh/loader/package/archive/tests/test_archive.py | 32 +++++----
 swh/loader/package/cran/loader.py                | 28 ++------
 swh/loader/package/debian/loader.py              | 85 +++++++++++++-----------
 swh/loader/package/debian/tests/test_debian.py   | 67 +++++++++++++++++--
 swh/loader/package/deposit/loader.py             |  5 ++
 swh/loader/package/loader.py                     | 50 ++++++++++++--
 swh/loader/package/nixguix/loader.py             | 40 +++++------
 swh/loader/package/nixguix/tests/test_nixguix.py | 33 +++++----
 swh/loader/package/npm/loader.py                 | 46 +++++++------
 swh/loader/package/npm/tests/test_npm.py         | 74 +++++++--------------
 swh/loader/package/pypi/loader.py                | 47 ++++++-------
 swh/loader/package/pypi/tests/test_pypi.py       | 79 ++++++----------------
 swh/loader/package/tests/test_loader.py          | 61 +++++++++++++++--
 14 files changed, 390 insertions(+), 314 deletions(-)
Changes applied before test
commit be873a0340160ad6fc57f8f7697558b2993b99f2
Merge: 132522e 80a3166
Author: Jenkins user <jenkins@localhost>
Date:   Fri Mar 19 13:30:43 2021 +0000

    Merge branch 'diff-target' into HEAD

commit 80a31665ad5c6faa9b5386531d10d4417d84b839
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 19 13:22:30 2021 +0100

    Deduplicate resolve_revision_from across package loaders
    
    All package loaders but deposit had logic to compute some object
    from the new packageinfo, some other objects from the known artifacts,
    and compare them.
    
    This commit moves the comparison logic to the base class, and unifies
    the two computation interfaces, respectively as an extid() method
    on TPackageInfo and a method on the loader.
    
    This unified object for comparison is a byte string,
    which is internal to each loader for now, but a future commit
    will read and write it from/to the ExtID storage instead of
    computing it from the 'original_artifacts' present in
    revision metadata.

commit d764a783e772d639a34b513b3e7d2dad68d68b72
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 19 10:51:43 2021 +0100

    archive, cran: Replace 'artifact_identity' with extid to detect known packages
    
    We want to store these identifiers in the ExtID storage, which expects
    a (preferably short) bytearray; but the 'artifact_identity' was a
    list of (possibly long) strings and ints.
    
    While this commit does not write them to the ExtID storage yet,
    it makes these two loaders use them internally.
    
    Assuming no sha256 collision, this does not change their behavior
    when seen from the outside, with two exceptions:
    
    * the list of keys to use is now configured with a template string
    * configuring an unknown key now raises a KeyError instead of silently
      using a None value.
    
    But we never use this configuration setting, so in practice there is no
    change at all.

commit 3357e642e5676aa1ace50ecffe2a777f51b251f7
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Mar 18 17:47:44 2021 +0100

    nixguix: Split 'integrity' extraction out of resolve_revision_from
    
    We will need it independently in a future commit

commit 9dc175c2aef95b0e5a2a83a71ec63e2c1f075c7b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 17 09:32:37 2021 +0100

    npm, pypi: Split original_artifact parsing out of artifact_to_revision_id
    
    We will need it independently in a future commit

commit 93c9aa8e014c19b02887e6c05a671ad7430235dd
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Mar 18 16:18:40 2021 +0100

    debian: Split original_artifact parsing out of resolve_revision_from
    
    We will need it independently in a future commit

commit 9a4991b391e6b825f9dee3008a410650d434ca6d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 17 10:02:21 2021 +0100

    debian: Make resolve_revision_from use the sha256 of the .dsc
    
    Instead of the sha256 + name + ... of all the files of a package.
    
    This will be needed to transition to ExtID, as we can't reasonably write
    this large set in the ExtID storage; and the sha256 of the .dsc is good
    enough, as the .dsc contains hashes and names of other files.

commit 455e05837dc2fb59b4adcf01df27c6f7d64781dd
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Mar 16 14:44:33 2021 +0100

    test_load_nixguix_one_common_artifact_from_other_loader: Test all log lines, not just the last one
    
    It's stricter and more readable.

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/408/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/408/console

Harbormaster returned this revision to the author for changes because remote builds failed.Mar 19 2021, 2:32 PM
Harbormaster failed remote builds in B20017: Diff 18937!

Build is green

Patch application report for D5290 (id=18938)

Could not rebase; Attempt merge onto 132522e42e...

Auto-merging swh/loader/package/deposit/loader.py
Merge made by the 'recursive' strategy.
 swh/loader/package/archive/loader.py             | 63 ++++++++++--------
 swh/loader/package/archive/tests/test_archive.py | 32 +++++----
 swh/loader/package/cran/loader.py                | 28 ++------
 swh/loader/package/debian/loader.py              | 85 +++++++++++++-----------
 swh/loader/package/debian/tests/test_debian.py   | 67 +++++++++++++++++--
 swh/loader/package/deposit/loader.py             |  5 ++
 swh/loader/package/loader.py                     | 53 +++++++++++++--
 swh/loader/package/nixguix/loader.py             | 40 +++++------
 swh/loader/package/nixguix/tests/test_nixguix.py | 27 ++++----
 swh/loader/package/npm/loader.py                 | 46 +++++++------
 swh/loader/package/npm/tests/test_npm.py         | 74 +++++++--------------
 swh/loader/package/pypi/loader.py                | 47 ++++++-------
 swh/loader/package/pypi/tests/test_pypi.py       | 79 ++++++----------------
 swh/loader/package/tests/test_loader.py          | 61 +++++++++++++++--
 14 files changed, 392 insertions(+), 315 deletions(-)
Changes applied before test
commit bea9c87ba8d1e067d7a418773546a9534b0db6f4
Merge: 132522e bcc4cb2
Author: Jenkins user <jenkins@localhost>
Date:   Fri Mar 19 13:55:42 2021 +0000

    Merge branch 'diff-target' into HEAD

commit bcc4cb28ed01bb8d41f364be24dab128dcfc3f03
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 19 13:22:30 2021 +0100

    Deduplicate resolve_revision_from across package loaders
    
    All package loaders but deposit had logic to compute some object
    from the new packageinfo, some other objects from the known artifacts,
    and compare them.
    
    This commit moves the comparison logic to the base class, and unifies
    the two computation interfaces, respectively as an extid() method
    on TPackageInfo and a method on the loader.
    
    This unified object for comparison is a byte string,
    which is internal to each loader for now, but a future commit
    will read and write it from/to the ExtID storage instead of
    computing it from the 'original_artifacts' present in
    revision metadata.

commit d764a783e772d639a34b513b3e7d2dad68d68b72
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 19 10:51:43 2021 +0100

    archive, cran: Replace 'artifact_identity' with extid to detect known packages
    
    We want to store these identifiers in the ExtID storage, which expects
    a (preferably short) bytearray; but the 'artifact_identity' was a
    list of (possibly long) strings and ints.
    
    While this commit does not write them to the ExtID storage yet,
    it makes these two loaders use them internally.
    
    Assuming no sha256 collision, this does not change their behavior
    when seen from the outside, with two exceptions:
    
    * the list of keys to use is now configured with a template string
    * configuring an unknown key now raises a KeyError instead of silently
      using a None value.
    
    But we never use this configuration setting, so in practice there is no
    change at all.

commit 3357e642e5676aa1ace50ecffe2a777f51b251f7
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Mar 18 17:47:44 2021 +0100

    nixguix: Split 'integrity' extraction out of resolve_revision_from
    
    We will need it independently in a future commit

commit 9dc175c2aef95b0e5a2a83a71ec63e2c1f075c7b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 17 09:32:37 2021 +0100

    npm, pypi: Split original_artifact parsing out of artifact_to_revision_id
    
    We will need it independently in a future commit

commit 93c9aa8e014c19b02887e6c05a671ad7430235dd
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Mar 18 16:18:40 2021 +0100

    debian: Split original_artifact parsing out of resolve_revision_from
    
    We will need it independently in a future commit

commit 9a4991b391e6b825f9dee3008a410650d434ca6d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 17 10:02:21 2021 +0100

    debian: Make resolve_revision_from use the sha256 of the .dsc
    
    Instead of the sha256 + name + ... of all the files of a package.
    
    This will be needed to transition to ExtID, as we can't reasonably write
    this large set in the ExtID storage; and the sha256 of the .dsc is good
    enough, as the .dsc contains hashes and names of other files.

commit 455e05837dc2fb59b4adcf01df27c6f7d64781dd
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Mar 16 14:44:33 2021 +0100

    test_load_nixguix_one_common_artifact_from_other_loader: Test all log lines, not just the last one
    
    It's stricter and more readable.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/409/ for more details.

Build has FAILED

Patch application report for D5290 (id=19001)

Could not rebase; Attempt merge onto 132522e42e...

Updating 132522e..2b41f01
Fast-forward
 swh/loader/package/archive/loader.py             | 63 ++++++++--------
 swh/loader/package/archive/tests/test_archive.py | 32 ++++-----
 swh/loader/package/cran/loader.py                | 28 ++------
 swh/loader/package/debian/loader.py              | 85 +++++++++++-----------
 swh/loader/package/debian/tests/test_debian.py   | 67 +++++++++++++++--
 swh/loader/package/deposit/loader.py             |  5 ++
 swh/loader/package/loader.py                     | 53 ++++++++++++--
 swh/loader/package/nixguix/loader.py             | 40 ++++-------
 swh/loader/package/nixguix/tests/test_nixguix.py | 27 +++----
 swh/loader/package/npm/loader.py                 | 46 ++++++------
 swh/loader/package/npm/tests/test_npm.py         | 74 ++++++-------------
 swh/loader/package/pypi/loader.py                | 55 +++++++-------
 swh/loader/package/pypi/tests/test_pypi.py       | 92 +++++++++---------------
 swh/loader/package/tests/test_loader.py          | 61 ++++++++++++++--
 14 files changed, 411 insertions(+), 317 deletions(-)
Changes applied before test
commit 2b41f01d3755ba74527f6e3405c88a84d73b7f8b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 19 13:22:30 2021 +0100

    Deduplicate resolve_revision_from across package loaders
    
    All package loaders but deposit had logic to compute some object
    from the new packageinfo, some other objects from the known artifacts,
    and compare them.
    
    This commit moves the comparison logic to the base class, and unifies
    the two computation interfaces, respectively as an extid() method
    on TPackageInfo and a method on the loader.
    
    This unified object for comparison is a byte string,
    which is internal to each loader for now, but a future commit
    will read and write it from/to the ExtID storage instead of
    computing it from the 'original_artifacts' present in
    revision metadata.

commit e61f507cdea82a01be93cb792abe486ad0e5a596
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 19 10:51:43 2021 +0100

    archive, cran: Replace 'artifact_identity' with extid to detect known packages
    
    We want to store these identifiers in the ExtID storage, which expects
    a (preferably short) bytearray; but the 'artifact_identity' was a
    list of (possibly long) strings and ints.
    
    While this commit does not write them to the ExtID storage yet,
    it makes these two loaders use them internally.
    
    Assuming no sha256 collision, this does not change their behavior
    when seen from the outside, with two exceptions:
    
    * the list of keys to use is now configured with a template string
    * configuring an unknown key now raises a KeyError instead of silently
      using a None value.
    
    But we never use this configuration setting, so in practice there is no
    change at all.

commit fd70fe0e475a4f5d339d2a69ce3eb99aad009d76
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Mar 18 17:47:44 2021 +0100

    nixguix: Split 'integrity' extraction out of resolve_revision_from
    
    We will need it independently in a future commit

commit 6da306159137c309ea3aa987589415c4b8f81de3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 17 09:32:37 2021 +0100

    npm, pypi: Split original_artifact parsing out of artifact_to_revision_id
    
    We will need it independently in a future commit

commit 3190406239d231d62b8583bb1aeb74def1016a08
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Mar 18 16:18:40 2021 +0100

    debian: Split original_artifact parsing out of resolve_revision_from
    
    We will need it independently in a future commit

commit 4960aa38b10b29d249b28b92e78e0308f482e551
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 17 10:02:21 2021 +0100

    debian: Make resolve_revision_from use the sha256 of the .dsc
    
    Instead of the sha256 + name + ... of all the files of a package.
    
    This will be needed to transition to ExtID, as we can't reasonably write
    this large set in the ExtID storage; and the sha256 of the .dsc is good
    enough, as the .dsc contains hashes and names of other files.

commit cf8fc9e9bd19b4f303b33ddeb73dc285d835bfcc
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Tue Mar 16 14:44:33 2021 +0100

    test_load_nixguix_one_common_artifact_from_other_loader: Test all log lines, not just the last one
    
    It's stricter and more readable.

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/416/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/416/console

Build is green

Patch application report for D5290 (id=19028)

Could not rebase; Attempt merge onto 9a339e981a...

Updating 9a339e9..93e7795
Fast-forward
 swh/loader/package/archive/loader.py             | 63 ++++++++--------
 swh/loader/package/archive/tests/test_archive.py | 32 ++++-----
 swh/loader/package/cran/loader.py                | 28 ++------
 swh/loader/package/debian/loader.py              | 85 +++++++++++-----------
 swh/loader/package/debian/tests/test_debian.py   | 67 +++++++++++++++--
 swh/loader/package/deposit/loader.py             |  5 ++
 swh/loader/package/loader.py                     | 53 ++++++++++++--
 swh/loader/package/nixguix/loader.py             | 40 ++++-------
 swh/loader/package/nixguix/tests/test_nixguix.py | 14 +---
 swh/loader/package/npm/loader.py                 | 46 ++++++------
 swh/loader/package/npm/tests/test_npm.py         | 74 ++++++-------------
 swh/loader/package/pypi/loader.py                | 55 +++++++-------
 swh/loader/package/pypi/tests/test_pypi.py       | 92 +++++++++---------------
 swh/loader/package/tests/test_loader.py          | 61 ++++++++++++++--
 14 files changed, 400 insertions(+), 315 deletions(-)
Changes applied before test
commit 93e7795830af987904bc9a67c8c2837ddaef5955
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 19 13:22:30 2021 +0100

    Deduplicate resolve_revision_from across package loaders
    
    All package loaders but deposit had logic to compute some object
    from the new packageinfo, some other objects from the known artifacts,
    and compare them.
    
    This commit moves the comparison logic to the base class, and unifies
    the two computation interfaces, respectively as an extid() method
    on TPackageInfo and a method on the loader.
    
    This unified object for comparison is a byte string,
    which is internal to each loader for now, but a future commit
    will read and write it from/to the ExtID storage instead of
    computing it from the 'original_artifacts' present in
    revision metadata.

commit 4cfb64bb8875109e77ff3ba29cb9cf4fc76bba88
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 19 10:51:43 2021 +0100

    archive, cran: Replace 'artifact_identity' with extid to detect known packages
    
    We want to store these identifiers in the ExtID storage, which expects
    a (preferably short) bytearray; but the 'artifact_identity' was a
    list of (possibly long) strings and ints.
    
    While this commit does not write them to the ExtID storage yet,
    it makes these two loaders use them internally.
    
    Assuming no sha256 collision, this does not change their behavior
    when seen from the outside, with two exceptions:
    
    * the list of keys to use is now configured with a template string
    * configuring an unknown key now raises a KeyError instead of silently
      using a None value.
    
    But we never use this configuration setting, so in practice there is no
    change at all.

commit d565e5b307dbcf5763171fcce828d4305dccf5de
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Mar 18 17:47:44 2021 +0100

    nixguix: Split 'integrity' extraction out of resolve_revision_from
    
    We will need it independently in a future commit

commit 749dd287432d116be5f4a69335ef35c244923ca6
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 17 09:32:37 2021 +0100

    npm, pypi: Split original_artifact parsing out of artifact_to_revision_id
    
    We will need it independently in a future commit

commit e7ae6364c06a4d1e1863e8c1e948cd3a82ca4286
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Mar 18 16:18:40 2021 +0100

    debian: Split original_artifact parsing out of resolve_revision_from
    
    We will need it independently in a future commit

commit 268f83e943a439958840b4d8cdcff5c9eb16429e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 17 10:02:21 2021 +0100

    debian: Make resolve_revision_from use the sha256 of the .dsc
    
    Instead of the sha256 + name + ... of all the files of a package.
    
    This will be needed to transition to ExtID, as we can't reasonably write
    this large set in the ExtID storage; and the sha256 of the .dsc is good
    enough, as the .dsc contains hashes and names of other files.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/428/ for more details.

Build is green

Patch application report for D5290 (id=19032)

Could not rebase; Attempt merge onto 9a339e981a...

Updating 9a339e9..9430bfa
Fast-forward
 swh/loader/package/archive/loader.py             | 62 ++++++++--------
 swh/loader/package/archive/tests/test_archive.py | 32 ++++-----
 swh/loader/package/cran/loader.py                | 28 ++------
 swh/loader/package/debian/loader.py              | 85 +++++++++++-----------
 swh/loader/package/debian/tests/test_debian.py   | 67 +++++++++++++++--
 swh/loader/package/deposit/loader.py             |  5 ++
 swh/loader/package/loader.py                     | 53 ++++++++++++--
 swh/loader/package/nixguix/loader.py             | 40 ++++-------
 swh/loader/package/nixguix/tests/test_nixguix.py | 14 +---
 swh/loader/package/npm/loader.py                 | 46 ++++++------
 swh/loader/package/npm/tests/test_npm.py         | 74 ++++++-------------
 swh/loader/package/pypi/loader.py                | 55 +++++++-------
 swh/loader/package/pypi/tests/test_pypi.py       | 92 +++++++++---------------
 swh/loader/package/tests/test_loader.py          | 61 ++++++++++++++--
 14 files changed, 399 insertions(+), 315 deletions(-)
Changes applied before test
commit 9430bfab677d96ea55285e891be6f4fde660eb5d
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 19 13:22:30 2021 +0100

    Deduplicate resolve_revision_from across package loaders
    
    All package loaders but deposit had logic to compute some object
    from the new packageinfo, some other objects from the known artifacts,
    and compare them.
    
    This commit moves the comparison logic to the base class, and unifies
    the two computation interfaces, respectively as an extid() method
    on TPackageInfo and a method on the loader.
    
    This unified object for comparison is a byte string,
    which is internal to each loader for now, but a future commit
    will read and write it from/to the ExtID storage instead of
    computing it from the 'original_artifacts' present in
    revision metadata.

commit 0226f5c509bdc9d882cd310dcbd6dc0643bcb2ef
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 19 10:51:43 2021 +0100

    archive, cran: Replace 'artifact_identity' with extid to detect known packages
    
    We want to store these identifiers in the ExtID storage, which expects
    a (preferably short) bytearray; but the 'artifact_identity' was a
    list of (possibly long) strings and ints.
    
    While this commit does not write them to the ExtID storage yet,
    it makes these two loaders use them internally.
    
    Assuming no sha256 collision, this does not change their behavior
    when seen from the outside, with two exceptions:
    
    * the list of keys to use is now configured with a template string
    * configuring an unknown key now raises a KeyError instead of silently
      using a None value.
    
    But we never use this configuration setting, so in practice there is no
    change at all.

commit 589d97562923155d4e51de6367e8985bbfc92e5a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Mar 18 17:47:44 2021 +0100

    nixguix: Split 'integrity' extraction out of resolve_revision_from
    
    We will need it independently in a future commit

commit dc197af60c60f061df3ac35d7e7f94b0a0262140
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 17 09:32:37 2021 +0100

    npm, pypi: Split original_artifact parsing out of artifact_to_revision_id
    
    We will need it independently in a future commit

commit e7ae6364c06a4d1e1863e8c1e948cd3a82ca4286
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Mar 18 16:18:40 2021 +0100

    debian: Split original_artifact parsing out of resolve_revision_from
    
    We will need it independently in a future commit

commit 268f83e943a439958840b4d8cdcff5c9eb16429e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 17 10:02:21 2021 +0100

    debian: Make resolve_revision_from use the sha256 of the .dsc
    
    Instead of the sha256 + name + ... of all the files of a package.
    
    This will be needed to transition to ExtID, as we can't reasonably write
    this large set in the ExtID storage; and the sha256 of the .dsc is good
    enough, as the .dsc contains hashes and names of other files.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/432/ for more details.

Build is green

Patch application report for D5290 (id=19036)

Could not rebase; Attempt merge onto 9a339e981a...

Updating 9a339e9..c5d5de2
Fast-forward
 swh/loader/package/archive/loader.py             | 62 ++++++++--------
 swh/loader/package/archive/tests/test_archive.py | 32 ++++-----
 swh/loader/package/cran/loader.py                | 28 ++------
 swh/loader/package/debian/loader.py              | 85 +++++++++++-----------
 swh/loader/package/debian/tests/test_debian.py   | 67 +++++++++++++++--
 swh/loader/package/deposit/loader.py             |  5 ++
 swh/loader/package/loader.py                     | 53 ++++++++++++--
 swh/loader/package/nixguix/loader.py             | 40 ++++-------
 swh/loader/package/nixguix/tests/test_nixguix.py | 14 +---
 swh/loader/package/npm/loader.py                 | 46 ++++++------
 swh/loader/package/npm/tests/test_npm.py         | 74 ++++++-------------
 swh/loader/package/pypi/loader.py                | 55 +++++++-------
 swh/loader/package/pypi/tests/test_pypi.py       | 92 +++++++++---------------
 swh/loader/package/tests/test_loader.py          | 61 ++++++++++++++--
 14 files changed, 399 insertions(+), 315 deletions(-)
Changes applied before test
commit c5d5de207dc925154171cbfd7d6d5970b663de58
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 19 13:22:30 2021 +0100

    Deduplicate resolve_revision_from across package loaders
    
    All package loaders but deposit had logic to compute some object
    from the new packageinfo, some other objects from the known artifacts,
    and compare them.
    
    This commit moves the comparison logic to the base class, and unifies
    the two computation interfaces, respectively as an extid() method
    on TPackageInfo and a method on the loader.
    
    This unified object for comparison is a byte string,
    which is internal to each loader for now, but a future commit
    will read and write it from/to the ExtID storage instead of
    computing it from the 'original_artifacts' present in
    revision metadata.

commit 20a3c9c809c0ca4bacc892e3f983ac56a20f503b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 19 10:51:43 2021 +0100

    archive, cran: Replace 'artifact_identity' with extid to detect known packages
    
    We want to store these identifiers in the ExtID storage, which expects
    a (preferably short) bytearray; but the 'artifact_identity' was a
    list of (possibly long) strings and ints.
    
    While this commit does not write them to the ExtID storage yet,
    it makes these two loaders use them internally.
    
    Assuming no sha256 collision, this does not change their behavior
    when seen from the outside, with two exceptions:
    
    * the list of keys to use is now configured with a template string
    * configuring an unknown key now raises a KeyError instead of silently
      using a None value.
    
    But we never use this configuration setting, so in practice there is no
    change at all.

commit fdb41abd76b2d683c2fd44fcef0d267ee2788a25
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Mar 18 17:47:44 2021 +0100

    nixguix: Split 'integrity' extraction out of resolve_revision_from
    
    We will need it independently in a future commit

commit f827d04cb538038cc38aceb7ee55c5465c444f52
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 17 09:32:37 2021 +0100

    npm, pypi: Split original_artifact parsing out of artifact_to_revision_id
    
    We will need it independently in a future commit

commit e7ae6364c06a4d1e1863e8c1e948cd3a82ca4286
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Mar 18 16:18:40 2021 +0100

    debian: Split original_artifact parsing out of resolve_revision_from
    
    We will need it independently in a future commit

commit 268f83e943a439958840b4d8cdcff5c9eb16429e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 17 10:02:21 2021 +0100

    debian: Make resolve_revision_from use the sha256 of the .dsc
    
    Instead of the sha256 + name + ... of all the files of a package.
    
    This will be needed to transition to ExtID, as we can't reasonably write
    this large set in the ExtID storage; and the sha256 of the .dsc is good
    enough, as the .dsc contains hashes and names of other files.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/436/ for more details.

Shouldn't the extid() methods all return a tuple (extid_type, extid_value) rather than a plain bytes value? I can imagine a point where, for the same loader, we might want to change the extid_type, and the current implementation wouldn't be able to distinguish them.

(On the other hand, the likelihood of extid_values clashing for different extid_types is pretty darn low?)

Yes of course. I plan to do this in the near future

Not quite sure about the split between BasePackageInfo / BaseManifestPackageInfo (and I really don't like the new name).

I'd prefer just making the implementation of extid in the base class return None if the MANIFEST_FORMAT is None (the NotImplementedError feels pretty overkill, even more so when we're testing the extid implementation of most derived loaders).

Overall the changes look sound.

This revision is now accepted and ready to land.Mar 23 2021, 3:12 PM

great idea, done. please review again (I mostly just removed code)

Build is green

Patch application report for D5290 (id=19038)

Could not rebase; Attempt merge onto 9a339e981a...

Updating 9a339e9..eef74bb
Fast-forward
 swh/loader/package/archive/loader.py             | 58 ++++++++-------
 swh/loader/package/archive/tests/test_archive.py | 32 ++++-----
 swh/loader/package/cran/loader.py                | 24 ++-----
 swh/loader/package/debian/loader.py              | 85 +++++++++++-----------
 swh/loader/package/debian/tests/test_debian.py   | 67 +++++++++++++++--
 swh/loader/package/deposit/loader.py             |  5 ++
 swh/loader/package/loader.py                     | 47 ++++++++++--
 swh/loader/package/nixguix/loader.py             | 40 ++++-------
 swh/loader/package/nixguix/tests/test_nixguix.py | 14 +---
 swh/loader/package/npm/loader.py                 | 46 ++++++------
 swh/loader/package/npm/tests/test_npm.py         | 74 ++++++-------------
 swh/loader/package/pypi/loader.py                | 55 +++++++-------
 swh/loader/package/pypi/tests/test_pypi.py       | 92 +++++++++---------------
 swh/loader/package/tests/test_loader.py          | 57 +++++++++++++--
 14 files changed, 385 insertions(+), 311 deletions(-)
Changes applied before test
commit eef74bbfacc0f00f227763e39059636b0643e3f5
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 19 13:22:30 2021 +0100

    Deduplicate resolve_revision_from across package loaders
    
    All package loaders but deposit had logic to compute some object
    from the new packageinfo, some other objects from the known artifacts,
    and compare them.
    
    This commit moves the comparison logic to the base class, and unifies
    the two computation interfaces, respectively as an extid() method
    on TPackageInfo and a method on the loader.
    
    This unified object for comparison is a byte string,
    which is internal to each loader for now, but a future commit
    will read and write it from/to the ExtID storage instead of
    computing it from the 'original_artifacts' present in
    revision metadata.

commit 20a3c9c809c0ca4bacc892e3f983ac56a20f503b
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Mar 19 10:51:43 2021 +0100

    archive, cran: Replace 'artifact_identity' with extid to detect known packages
    
    We want to store these identifiers in the ExtID storage, which expects
    a (preferably short) bytearray; but the 'artifact_identity' was a
    list of (possibly long) strings and ints.
    
    While this commit does not write them to the ExtID storage yet,
    it makes these two loaders use them internally.
    
    Assuming no sha256 collision, this does not change their behavior
    when seen from the outside, with two exceptions:
    
    * the list of keys to use is now configured with a template string
    * configuring an unknown key now raises a KeyError instead of silently
      using a None value.
    
    But we never use this configuration setting, so in practice there is no
    change at all.

commit fdb41abd76b2d683c2fd44fcef0d267ee2788a25
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Mar 18 17:47:44 2021 +0100

    nixguix: Split 'integrity' extraction out of resolve_revision_from
    
    We will need it independently in a future commit

commit f827d04cb538038cc38aceb7ee55c5465c444f52
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 17 09:32:37 2021 +0100

    npm, pypi: Split original_artifact parsing out of artifact_to_revision_id
    
    We will need it independently in a future commit

commit e7ae6364c06a4d1e1863e8c1e948cd3a82ca4286
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Mar 18 16:18:40 2021 +0100

    debian: Split original_artifact parsing out of resolve_revision_from
    
    We will need it independently in a future commit

commit 268f83e943a439958840b4d8cdcff5c9eb16429e
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Mar 17 10:02:21 2021 +0100

    debian: Make resolve_revision_from use the sha256 of the .dsc
    
    Instead of the sha256 + name + ... of all the files of a package.
    
    This will be needed to transition to ExtID, as we can't reasonably write
    this large set in the ExtID storage; and the sha256 of the .dsc is good
    enough, as the .dsc contains hashes and names of other files.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/437/ for more details.

This revision is now accepted and ready to land.Mar 23 2021, 4:24 PM