Page MenuHomeSoftware Heritage

Make package loaders write releases instead of revisions
ClosedPublic

Authored by vlorentz on Nov 5 2021, 1:27 PM.

Details

Summary

The artifacts they load match the semantics of a Release, but we used Revisions
so far because of technical details (we needed the 'metadata' field of Revision
that Release lacks) that is no longer relevant (thanks to the metadata storage).

Packages that were loaded by previous versions of the package loader (as revs)
will be converted to releases. In order to avoid fetching them from the origin,
the loader will look for an existing extid pointing to a revision (like it used
to), fetch that revision, extract some fields (directory id, author, date, ...)
and build a new release using this information.

This commit is unfortunately very large because of all changes in tests, mostly
just new hashes and renaming 'revision' to 'release' (and various abbreviations
and capitalizations).

The only meaningful changes are in:

  • swh/loader/package/deposit/loader.py
  • swh/loader/package/tests/test_loader.py
  • swh/loader/package/loader.py

To keep this commit as short as possible, I did not yet change individual loaders
to create releases: they still create revisions, but are converted by the base
loader. The next commit will refactor them to remove this conversion layer.
This is implemented in rev2rel. Note that rev2rel drops the committer/committer_date,
which was always equal to the author/date anyway, except for the deposit and opam loaders.

Depends on D6600, D6606, D6607, D6613

Event Timeline

Build is green

Patch application report for D6616 (id=24019)

Could not rebase; Attempt merge onto 5063082e7d...

Updating 5063082..7d7c168
Fast-forward
 docs/package-loader-tutorial.rst                 |  25 ++
 swh/loader/package/archive/tests/test_archive.py |  40 +--
 swh/loader/package/cran/tests/test_cran.py       |  37 ++-
 swh/loader/package/debian/tests/test_debian.py   | 106 ++++---
 swh/loader/package/deposit/loader.py             |  25 +-
 swh/loader/package/deposit/tests/test_deposit.py | 134 ++++++---
 swh/loader/package/loader.py                     | 237 +++++++++++----
 swh/loader/package/nixguix/tests/test_nixguix.py |  61 ++--
 swh/loader/package/npm/tests/test_npm.py         | 145 +++++-----
 swh/loader/package/opam/loader.py                |  67 +++--
 swh/loader/package/opam/tests/test_opam.py       | 197 +++++++++++--
 swh/loader/package/pypi/tests/test_pypi.py       | 354 +++++++++--------------
 swh/loader/package/tests/test_loader.py          | 261 +++++++++++++----
 swh/loader/package/tests/test_loader_metadata.py |  13 +-
 swh/loader/tests/__init__.py                     |  18 +-
 swh/loader/tests/test_init.py                    |   4 +-
 16 files changed, 1084 insertions(+), 640 deletions(-)
Changes applied before test
commit 7d7c16822e36eea06f4185521029dbbfd73b5ddb
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Nov 5 13:18:01 2021 +0100

    Make package loaders write releases instead of revisions
    
    The artifacts they load match the semantics of a Release, but we used Revisions
    so far because of technical details (we needed the 'metadata' field of Revision
    that Release lacks) that is no longer relevant (thanks to the metadata storage).
    
    Packages that were loaded by previous versions of the package loader (as revs)
    will be converted to releases. In order to avoid fetching them from the origin,
    the loader will look for an existing extid pointing to a revision (like it used
    to), fetch that revision, extract some fields (directory id, author, date, ...)
    and build a new release using this information.
    
    This commit is unfortunately very large because of all changes in tests, mostly
    just new hashes and renaming 'revision' to 'release' (and various abbreviations
    and capitalizations).
    
    The only meaningful changes are in swh/loader/package/tests/test_loader.py and
    swh/loader/package/loader.py.
    
    To keep this commit as short as possible, I did not yet change individual loaders
    to create releases: they still create revisions, but are converted by the base
    loader. The next commit will refactor them to remove this conversion layer.

commit c0a98a5c4cfe4beac5c0b03bf61459d244b6f132
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 4 17:58:37 2021 +0100

    tests: Remove duplicate checks
    
    All the '*_missing' tests are already done automatically by check_snapshot
    (it recursively checks all objects are present in the storage).

commit 2311ad9b365e23d1a3c532aa005e32db46bc06f3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 4 17:53:20 2021 +0100

    tests: Hide utilities from stack traces
    
    They clutter the test output because pytest prints the whole code
    of the function raising the assertionerror.
    
    With this magic variable, the error is shown as if it was raised
    directly in the caller's body.

commit 551c55ff04774b2c4e50e54f701c514064607f6a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 4 15:49:45 2021 +0100

    package loaders: Make test failures more helpful
    
    Some tests did the following:
    
    1. build a snapshot
    2. get the snapshot from the storage
    3. compare it with the expected snapshot
    4. get the origin visit from the storage and check it
    
    If the loader built a wrong snapshot, the test fails at step 2,
    and the only information displayed is that the expected snapshot id
    does not exist, which is very unhelpful.
    
    Instead, I reordered them as: 1, 4, 2, 3. This way, if a wrong
    snapshot is build by the loader, it is detected when comparing
    the visit, and pytest shows the two hashes.
    Then, the test can be modified to use the hash that is actually
    generated to show the actual snapshot.
    
    This is consistent with what was already done in the pypi loader.
    
    Additionally, I made the following changes:
    
    1. always check stats last (because a difference in numbers is
       hardly actionable without testing other objects)
    2. add a few more snapshot id checks in visits
    3. deduplicated a hardcoded snapshot id.

commit 89a0bfee48ca1663faddaa08b8fb3f163d5cfc6f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Oct 21 14:35:47 2021 +0200

    deposit: Remove 'parent' deposit
    
    The parent is computed by the deposit as the revision of the latest deposit
    in the same origin before the current one.
    Therefore, it is redundant, as it can be recomputed from metadata
    + revision date.
    
    This is a preliminary change needed to make package loaders produce
    releases instead of revisions, as releases don't have parent relationships

commit aeffe01a2b3918262e4bd715ea52c7d7da27807c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 4 11:45:36 2021 +0100

    opam: Write package definitions to the extrinsic metadata storage

commit 18bbbae719fc9d165d5b543e54d449f5befc083a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 4 11:43:53 2021 +0100

    Add missing documentation for `get_metadata_authority`.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/600/ for more details.

vlorentz edited the summary of this revision. (Show Details)
ardumont added inline comments.
swh/loader/package/loader.py
859

or sthg?

swh/loader/package/pypi/tests/test_pypi.py
337

this looks weird.
Should be one of the other branch name instead, ain't it?

ardumont added 1 blocking reviewer(s): Reviewers.
vlorentz marked an inline comment as done.

fix bad copy-paste in test example

swh/loader/package/loader.py
859

later; this needs a few more changes.

swh/loader/package/pypi/tests/test_pypi.py
337

my bad

swh/loader/package/loader.py
859

sure ;)

swh/loader/package/pypi/tests/test_pypi.py
337

clearer ;)
Thanks.

Build has FAILED

Patch application report for D6616 (id=24029)

Could not rebase; Attempt merge onto 5063082e7d...

Updating 5063082..89417bb
Fast-forward
 docs/package-loader-tutorial.rst                 |  25 ++
 swh/loader/package/archive/tests/test_archive.py |  40 +--
 swh/loader/package/cran/tests/test_cran.py       |  37 ++-
 swh/loader/package/debian/tests/test_debian.py   | 106 ++++---
 swh/loader/package/deposit/loader.py             |  25 +-
 swh/loader/package/deposit/tests/test_deposit.py | 134 ++++++---
 swh/loader/package/loader.py                     | 237 +++++++++++----
 swh/loader/package/nixguix/tests/test_nixguix.py |  61 ++--
 swh/loader/package/npm/tests/test_npm.py         | 145 +++++-----
 swh/loader/package/opam/loader.py                |  67 +++--
 swh/loader/package/opam/tests/test_opam.py       | 197 +++++++++++--
 swh/loader/package/pypi/tests/test_pypi.py       | 353 +++++++++--------------
 swh/loader/package/tests/test_loader.py          | 261 +++++++++++++----
 swh/loader/package/tests/test_loader_metadata.py |  13 +-
 swh/loader/tests/__init__.py                     |  18 +-
 swh/loader/tests/test_init.py                    |   4 +-
 16 files changed, 1083 insertions(+), 640 deletions(-)
Changes applied before test
commit 89417bb072aade6b68402dbc10eaee61d98e9ae4
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Nov 5 13:18:01 2021 +0100

    Make package loaders write releases instead of revisions
    
    The artifacts they load match the semantics of a Release, but we used Revisions
    so far because of technical details (we needed the 'metadata' field of Revision
    that Release lacks) that is no longer relevant (thanks to the metadata storage).
    
    Packages that were loaded by previous versions of the package loader (as revs)
    will be converted to releases. In order to avoid fetching them from the origin,
    the loader will look for an existing extid pointing to a revision (like it used
    to), fetch that revision, extract some fields (directory id, author, date, ...)
    and build a new release using this information.
    
    This commit is unfortunately very large because of all changes in tests, mostly
    just new hashes and renaming 'revision' to 'release' (and various abbreviations
    and capitalizations).
    
    The only meaningful changes are in swh/loader/package/tests/test_loader.py and
    swh/loader/package/loader.py.
    
    To keep this commit as short as possible, I did not yet change individual loaders
    to create releases: they still create revisions, but are converted by the base
    loader. The next commit will refactor them to remove this conversion layer.

commit c0a98a5c4cfe4beac5c0b03bf61459d244b6f132
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 4 17:58:37 2021 +0100

    tests: Remove duplicate checks
    
    All the '*_missing' tests are already done automatically by check_snapshot
    (it recursively checks all objects are present in the storage).

commit 2311ad9b365e23d1a3c532aa005e32db46bc06f3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 4 17:53:20 2021 +0100

    tests: Hide utilities from stack traces
    
    They clutter the test output because pytest prints the whole code
    of the function raising the assertionerror.
    
    With this magic variable, the error is shown as if it was raised
    directly in the caller's body.

commit 551c55ff04774b2c4e50e54f701c514064607f6a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 4 15:49:45 2021 +0100

    package loaders: Make test failures more helpful
    
    Some tests did the following:
    
    1. build a snapshot
    2. get the snapshot from the storage
    3. compare it with the expected snapshot
    4. get the origin visit from the storage and check it
    
    If the loader built a wrong snapshot, the test fails at step 2,
    and the only information displayed is that the expected snapshot id
    does not exist, which is very unhelpful.
    
    Instead, I reordered them as: 1, 4, 2, 3. This way, if a wrong
    snapshot is build by the loader, it is detected when comparing
    the visit, and pytest shows the two hashes.
    Then, the test can be modified to use the hash that is actually
    generated to show the actual snapshot.
    
    This is consistent with what was already done in the pypi loader.
    
    Additionally, I made the following changes:
    
    1. always check stats last (because a difference in numbers is
       hardly actionable without testing other objects)
    2. add a few more snapshot id checks in visits
    3. deduplicated a hardcoded snapshot id.

commit 89a0bfee48ca1663faddaa08b8fb3f163d5cfc6f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Oct 21 14:35:47 2021 +0200

    deposit: Remove 'parent' deposit
    
    The parent is computed by the deposit as the revision of the latest deposit
    in the same origin before the current one.
    Therefore, it is redundant, as it can be recomputed from metadata
    + revision date.
    
    This is a preliminary change needed to make package loaders produce
    releases instead of revisions, as releases don't have parent relationships

commit aeffe01a2b3918262e4bd715ea52c7d7da27807c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 4 11:45:36 2021 +0100

    opam: Write package definitions to the extrinsic metadata storage

commit 18bbbae719fc9d165d5b543e54d449f5befc083a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 4 11:43:53 2021 +0100

    Add missing documentation for `get_metadata_authority`.

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/602/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/602/console

Build is green

Patch application report for D6616 (id=24029)

Could not rebase; Attempt merge onto 5063082e7d...

Updating 5063082..89417bb
Fast-forward
 docs/package-loader-tutorial.rst                 |  25 ++
 swh/loader/package/archive/tests/test_archive.py |  40 +--
 swh/loader/package/cran/tests/test_cran.py       |  37 ++-
 swh/loader/package/debian/tests/test_debian.py   | 106 ++++---
 swh/loader/package/deposit/loader.py             |  25 +-
 swh/loader/package/deposit/tests/test_deposit.py | 134 ++++++---
 swh/loader/package/loader.py                     | 237 +++++++++++----
 swh/loader/package/nixguix/tests/test_nixguix.py |  61 ++--
 swh/loader/package/npm/tests/test_npm.py         | 145 +++++-----
 swh/loader/package/opam/loader.py                |  67 +++--
 swh/loader/package/opam/tests/test_opam.py       | 197 +++++++++++--
 swh/loader/package/pypi/tests/test_pypi.py       | 353 +++++++++--------------
 swh/loader/package/tests/test_loader.py          | 261 +++++++++++++----
 swh/loader/package/tests/test_loader_metadata.py |  13 +-
 swh/loader/tests/__init__.py                     |  18 +-
 swh/loader/tests/test_init.py                    |   4 +-
 16 files changed, 1083 insertions(+), 640 deletions(-)
Changes applied before test
commit 89417bb072aade6b68402dbc10eaee61d98e9ae4
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Nov 5 13:18:01 2021 +0100

    Make package loaders write releases instead of revisions
    
    The artifacts they load match the semantics of a Release, but we used Revisions
    so far because of technical details (we needed the 'metadata' field of Revision
    that Release lacks) that is no longer relevant (thanks to the metadata storage).
    
    Packages that were loaded by previous versions of the package loader (as revs)
    will be converted to releases. In order to avoid fetching them from the origin,
    the loader will look for an existing extid pointing to a revision (like it used
    to), fetch that revision, extract some fields (directory id, author, date, ...)
    and build a new release using this information.
    
    This commit is unfortunately very large because of all changes in tests, mostly
    just new hashes and renaming 'revision' to 'release' (and various abbreviations
    and capitalizations).
    
    The only meaningful changes are in swh/loader/package/tests/test_loader.py and
    swh/loader/package/loader.py.
    
    To keep this commit as short as possible, I did not yet change individual loaders
    to create releases: they still create revisions, but are converted by the base
    loader. The next commit will refactor them to remove this conversion layer.

commit c0a98a5c4cfe4beac5c0b03bf61459d244b6f132
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 4 17:58:37 2021 +0100

    tests: Remove duplicate checks
    
    All the '*_missing' tests are already done automatically by check_snapshot
    (it recursively checks all objects are present in the storage).

commit 2311ad9b365e23d1a3c532aa005e32db46bc06f3
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 4 17:53:20 2021 +0100

    tests: Hide utilities from stack traces
    
    They clutter the test output because pytest prints the whole code
    of the function raising the assertionerror.
    
    With this magic variable, the error is shown as if it was raised
    directly in the caller's body.

commit 551c55ff04774b2c4e50e54f701c514064607f6a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 4 15:49:45 2021 +0100

    package loaders: Make test failures more helpful
    
    Some tests did the following:
    
    1. build a snapshot
    2. get the snapshot from the storage
    3. compare it with the expected snapshot
    4. get the origin visit from the storage and check it
    
    If the loader built a wrong snapshot, the test fails at step 2,
    and the only information displayed is that the expected snapshot id
    does not exist, which is very unhelpful.
    
    Instead, I reordered them as: 1, 4, 2, 3. This way, if a wrong
    snapshot is build by the loader, it is detected when comparing
    the visit, and pytest shows the two hashes.
    Then, the test can be modified to use the hash that is actually
    generated to show the actual snapshot.
    
    This is consistent with what was already done in the pypi loader.
    
    Additionally, I made the following changes:
    
    1. always check stats last (because a difference in numbers is
       hardly actionable without testing other objects)
    2. add a few more snapshot id checks in visits
    3. deduplicated a hardcoded snapshot id.

commit 89a0bfee48ca1663faddaa08b8fb3f163d5cfc6f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Oct 21 14:35:47 2021 +0200

    deposit: Remove 'parent' deposit
    
    The parent is computed by the deposit as the revision of the latest deposit
    in the same origin before the current one.
    Therefore, it is redundant, as it can be recomputed from metadata
    + revision date.
    
    This is a preliminary change needed to make package loaders produce
    releases instead of revisions, as releases don't have parent relationships

commit aeffe01a2b3918262e4bd715ea52c7d7da27807c
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 4 11:45:36 2021 +0100

    opam: Write package definitions to the extrinsic metadata storage

commit 18bbbae719fc9d165d5b543e54d449f5befc083a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Nov 4 11:43:53 2021 +0100

    Add missing documentation for `get_metadata_authority`.

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/603/ for more details.

This revision was not accepted when it landed; it landed in state Needs Review.Nov 10 2021, 2:21 PM
This revision was automatically updated to reflect the committed changes.