Page MenuHomeSoftware Heritage

{Content|Directory}Loader: Adapt support for checksums
ClosedPublic

Authored by ardumont on Sep 30 2022, 3:16 PM.

Details

Summary

This moves away the loaders from the 'integrity' fields which is now dealt with lister
side. Instead the lister is send a dictionary of hex hashes.

This also improves the loader to check those checksums when retrieving the
artifact (content or tarball). Thanks to a bump in the swh.model version, this is now
able to deal with sha512 checksums checks as well.

This echoes with the ongoing work for the package loaders [1].

Related to T3781
Depends on D8584

[1] D8595

Diff Detail

Repository
rDLDBASE Generic VCS/Package Loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D8587 (id=31000)

Could not rebase; Attempt merge onto f774aba59e...

Updating f774aba..1672fed
Fast-forward
 swh/loader/core/loader.py                          | 212 +++++++++++++++++----
 .../https_example.org/archives_dummy-hello.tar.gz  | Bin 0 -> 221 bytes
 swh/loader/core/tests/test_loader.py               | 130 ++++++++++++-
 3 files changed, 297 insertions(+), 45 deletions(-)
 create mode 100644 swh/loader/core/tests/data/https_example.org/archives_dummy-hello.tar.gz
Changes applied before test
commit 1672fed607576399dcd1aec67a452606a6427fe6
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 30 15:10:49 2022 +0200

    ContentLoader: Fix integrity check
    
    Related to T3781

commit 497f74f3225e4ccf11adce0d6a2bb50b2a471fab
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 30 11:54:13 2022 +0200

    Add Directory Loader to allow tarball ingestion as Directory
    
    In some marginal listing cases (Nix or Guix for now), we can receive raw tarball to
    ingest. This commit adds a loader to ingest those. The output of the ingestion is a
    snapshot with 1 branch, one HEAD branch targetting the ingested directory (contained
    within the tarball).
    
    This expects to receive a mandatory 'integrity' field. It is used to check the tarball
    received out of the origin.
    
    This can also optionally receive a list of mirror urls in case the main origin url is no
    longer available. Those mirror urls are solely used as fallback to retrieve the tarball.
    
    Related to T3781

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/927/ for more details.

vlorentz added inline comments.
swh/loader/core/loader.py
717

why is self.expected_checksum hexadecimal bytes?

swh/loader/core/loader.py
717

good question...

That's coming from the instruction line 677 which was the only way i saw that matched types:

self.expected_checksum: bytes = base64.decodebytes(checksum_value_b64.encode())
swh/loader/core/loader.py
717

I must have made a mistake during my hash computation switch when i changed to a simpler content file (than the original asdf one).
Currently fixing it.

ardumont retitled this revision from ContentLoader: Fix integrity check to ContentLoader: Improve integrity check support.Sep 30 2022, 5:04 PM
ardumont edited the summary of this revision. (Show Details)
  • Build content out of MultiHash
  • Add support for sha512 hash
  • Fix input test checksum

Build is green

Patch application report for D8587 (id=31015)

Could not rebase; Attempt merge onto f774aba59e...

Updating f774aba..3083a3b
Fast-forward
 requirements-swh.txt                               |   2 +-
 swh/loader/core/loader.py                          | 221 +++++++++++++++++----
 .../https_example.org/archives_dummy-hello.tar.gz  | Bin 0 -> 221 bytes
 swh/loader/core/tests/test_loader.py               | 182 +++++++++++++++--
 4 files changed, 350 insertions(+), 55 deletions(-)
 create mode 100644 swh/loader/core/tests/data/https_example.org/archives_dummy-hello.tar.gz
Changes applied before test
commit 3083a3b23fa118158a3e1bf54087492086603e63
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 30 15:10:49 2022 +0200

    ContentLoader: Improve integrity check support
    
    This can deal with sha512 with the new swh.model version.
    
    Related to T3781

commit 497f74f3225e4ccf11adce0d6a2bb50b2a471fab
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 30 11:54:13 2022 +0200

    Add Directory Loader to allow tarball ingestion as Directory
    
    In some marginal listing cases (Nix or Guix for now), we can receive raw tarball to
    ingest. This commit adds a loader to ingest those. The output of the ingestion is a
    snapshot with 1 branch, one HEAD branch targetting the ingested directory (contained
    within the tarball).
    
    This expects to receive a mandatory 'integrity' field. It is used to check the tarball
    received out of the origin.
    
    This can also optionally receive a list of mirror urls in case the main origin url is no
    longer available. Those mirror urls are solely used as fallback to retrieve the tarball.
    
    Related to T3781

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/932/ for more details.

Adapt both Content and Directory loaders to deal with checksums dict instead of
integrity.

ardumont retitled this revision from ContentLoader: Improve integrity check support to {Content|Directory}Loader: Adapt support for checksums.Oct 2 2022, 11:36 AM
ardumont edited the summary of this revision. (Show Details)

Build has FAILED

Patch application report for D8587 (id=31027)

Could not rebase; Attempt merge onto f774aba59e...

Updating f774aba..da7cc37
Fast-forward
 requirements-swh.txt                               |   2 +-
 swh/loader/core/loader.py                          | 234 +++++++++++++++++----
 .../https_example.org/archives_dummy-hello.tar.gz  | Bin 0 -> 221 bytes
 swh/loader/core/tests/test_loader.py               | 194 +++++++++++++++--
 4 files changed, 372 insertions(+), 58 deletions(-)
 create mode 100644 swh/loader/core/tests/data/https_example.org/archives_dummy-hello.tar.gz
Changes applied before test
commit da7cc372fe5a9cdf139de7cf51a2c6cb6dc8b8ed
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 30 15:10:49 2022 +0200

    {Content|Directory}Loader: Adapt support for checksums
    
    This moves away the loaders from the 'integrity' fields which is now dealt with lister
    side. Instead the lister is send a dictionary of hex hashes.
    
    This also improves the loader to check those checksums when retrieving the
    artifact (content or tarball). Thanks to a bump in the swh.model version, this is now
    able to deal with sha512 checksums checks as well.
    
    Related to T3781

commit 497f74f3225e4ccf11adce0d6a2bb50b2a471fab
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 30 11:54:13 2022 +0200

    Add Directory Loader to allow tarball ingestion as Directory
    
    In some marginal listing cases (Nix or Guix for now), we can receive raw tarball to
    ingest. This commit adds a loader to ingest those. The output of the ingestion is a
    snapshot with 1 branch, one HEAD branch targetting the ingested directory (contained
    within the tarball).
    
    This expects to receive a mandatory 'integrity' field. It is used to check the tarball
    received out of the origin.
    
    This can also optionally receive a list of mirror urls in case the main origin url is no
    longer available. Those mirror urls are solely used as fallback to retrieve the tarball.
    
    Related to T3781

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/937/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/937/console

Build has FAILED

Build error unrelated to this diff [1] locally ok with my old tox generation [2]
Maybe a new upstream version bump on importlib-metadata?

[1]

...
importlib-metadata==5.0.0
...
11:37:49    File "/var/lib/jenkins/workspace/DLDBASE/tests-on-diff/.tox/py3/lib/python3.7/site-packages/kombu/utils/compat.py", line 82, in entrypoints
11:37:49      for ep in importlib_metadata.entry_points().get(namespace, [])
11:37:49  AttributeError: 'EntryPoints' object has no attribute 'get'

[2]

$ tox
...
importlib-metadata==4.12.0
...
================================================================ 272 passed, 122 warnings in 62.62s (0:01:02) ================================================================

Build has FAILED

Patch application report for D8587 (id=31027)

Could not rebase; Attempt merge onto f774aba59e...

Updating f774aba..da7cc37
Fast-forward
 requirements-swh.txt                               |   2 +-
 swh/loader/core/loader.py                          | 234 +++++++++++++++++----
 .../https_example.org/archives_dummy-hello.tar.gz  | Bin 0 -> 221 bytes
 swh/loader/core/tests/test_loader.py               | 194 +++++++++++++++--
 4 files changed, 372 insertions(+), 58 deletions(-)
 create mode 100644 swh/loader/core/tests/data/https_example.org/archives_dummy-hello.tar.gz
Changes applied before test
commit da7cc372fe5a9cdf139de7cf51a2c6cb6dc8b8ed
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 30 15:10:49 2022 +0200

    {Content|Directory}Loader: Adapt support for checksums
    
    This moves away the loaders from the 'integrity' fields which is now dealt with lister
    side. Instead the lister is send a dictionary of hex hashes.
    
    This also improves the loader to check those checksums when retrieving the
    artifact (content or tarball). Thanks to a bump in the swh.model version, this is now
    able to deal with sha512 checksums checks as well.
    
    Related to T3781

commit 497f74f3225e4ccf11adce0d6a2bb50b2a471fab
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 30 11:54:13 2022 +0200

    Add Directory Loader to allow tarball ingestion as Directory
    
    In some marginal listing cases (Nix or Guix for now), we can receive raw tarball to
    ingest. This commit adds a loader to ingest those. The output of the ingestion is a
    snapshot with 1 branch, one HEAD branch targetting the ingested directory (contained
    within the tarball).
    
    This expects to receive a mandatory 'integrity' field. It is used to check the tarball
    received out of the origin.
    
    This can also optionally receive a list of mirror urls in case the main origin url is no
    longer available. Those mirror urls are solely used as fallback to retrieve the tarball.
    
    Related to T3781

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/938/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/938/console

Maybe a new upstream version bump on importlib-metadata?

[1]

...
importlib-metadata==5.0.0

Definitely something about that release.
I reproduced it in the docker dev environment.
Adding a workaround in requirements.txt (swh.{lister|loader}: importlib_metadata!=5.0.0).
Those containers are then able to start ^.

anlambert added inline comments.
swh/loader/core/loader.py
740–741

I think you could reuse the download function (whose documentation is incorrect, need to fix this) instead of using get_url_body. This way you do not have to reimplement checking of hash values.

740–741

set(checksum.keys()) is more readable imho

swh/loader/core/loader.py
740–741

I tried that but iirc that failed (i don't recall why).
As we are downloading raw file here (and not tarball), i thought this was related.

Will give it another go (as i really don't remember what the issue was ¯\_(ツ)_/¯).

740–741

sure (idk why i did it that way, must have been more convoluted at some point).

This revision is now accepted and ready to land.Oct 3 2022, 2:08 PM
swh/loader/core/loader.py
740–741

For the record, this seems to work:

diff --git a/swh/loader/core/loader.py b/swh/loader/core/loader.py
index b55a85f..9af55cf 100644
--- a/swh/loader/core/loader.py
+++ b/swh/loader/core/loader.py
@@ -716,7 +716,6 @@ class ContentLoader(NodeLoader):
 
     def fetch_data(self) -> bool:
         """Retrieve the content file as a Content Object"""
-        data: Optional[bytes] = None
         for url in self.mirror_urls:
             url_ = urlparse(url)
             self.log.debug(
@@ -727,36 +726,19 @@ class ContentLoader(NodeLoader):
                 url_.path,
             )
             try:
-                data = get_url_body(url)
+                with tempfile.TemporaryDirectory() as tmpdir:
+                    file_path, _ = download(url, dest=tmpdir, hashes=self.checksums)
+                    with open(file_path, "rb") as file:
+                        self.content = Content.from_data(file.read())
                 # otherwise continue
-            except NotFound:
-                self.log.debug("Not found %s, continue on next mirror url if any", url)
-                continue
-
-            checksum_algos = {c for c in self.checksums.keys()}
-            content_d = MultiHash.from_data(
-                data, hash_names=DEFAULT_ALGORITHMS | checksum_algos
-            ).digest()
-
-            # We now need to determine if we found a matching content (matching the
-            # checksums provided). As an implementation detail, we must drop the
-            # checksums that are unsupported by the Content Model (e.g. sha512)
-            found: bool = False
-            for checksum_algo in checksum_algos:
-                if checksum_algo not in DEFAULT_ALGORITHMS:
-                    # We must drop this unssuported checksum algo from the Content model
-                    actual_checksum = content_d.pop(checksum_algo)
-                else:
-                    actual_checksum = content_d[checksum_algo]
-
-                # and check whether we found a correct content matching the checksum
-                found = actual_checksum == hash_to_bytes(self.checksums[checksum_algo])
-
-            if found:
-                # We have a match, we have our content to ingest
-                content_d["data"] = data
-                content_d["length"] = len(data)
-                self.content = Content.from_dict(content_d)
+            except HTTPError as http_error:
+                if http_error.response.status_code == 404:
+                    self.log.debug(
+                        "Not found %s, continue on next mirror url if any", url
+                    )
+                    continue
+                raise
+            else:
                 # we are done, no more data to fetch
                 return False
swh/loader/core/loader.py
740–741

Awesome, i was rebasing my stuff and getting back to it.
You gain me some time, thx.

ardumont marked an inline comment as done.

Adapt according to latest review

Build is green

Patch application report for D8587 (id=31058)

Rebasing onto dbf7f3dca0...

Current branch diff-target is up to date.
Changes applied before test
commit 39c33a66c27c030c7ae3cfba4a393c2ced468fbc
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Fri Sep 30 15:10:49 2022 +0200

    {Content|Directory}Loader: Adapt support for checksums
    
    This adapts the content/directory loader implementations to use directly a checksums
    dict which is now sent by the listers.
    
    This improves the loader to check those checksums when retrieving the artifact (content
    or tarball). Thanks to a bump in the swh.model version, this is now able to deal with
    sha512 checksums checks as well.
    
    This also aligns with the current package loaders which now are also checking the
    integrity of the tarballs they ingest.
    
    Related to T3781

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/943/ for more details.