Page MenuHomeSoftware Heritage

Conda: Anaconda packages archive loader
ClosedPublic

Authored by franckbret on Sep 28 2022, 4:32 PM.

Details

Summary

For each origin it takes advantage of 'artifacts' data send through
'extra_loader_arguments' of the conda lister, providing versions,
archive url, checksum, etc.
Author extracted from intrinsic metadata.

Related T4579

Diff Detail

Event Timeline

Build has FAILED

Patch application report for D8566 (id=30899)

Rebasing onto 798f749e66...

Current branch diff-target is up to date.
Changes applied before test
commit f3aa7802b6c99025ab26358a97e849ca4f5aaa6f
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed Sep 28 16:23:45 2022 +0200

    Conda: Anaconda packages archive loader
    
    For each origin it takes advantage of 'artifacts' data send through
    'extra_loader_arguments' of the conda lister, providing versions,
    archive url, checksum, etc.
    Author and description are extracted from intrinsic metadata.
    
    Related T4579

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/902/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/902/console

Harbormaster returned this revision to the author for changes because remote builds failed.Sep 28 2022, 4:36 PM
Harbormaster failed remote builds in B31872: Diff 30899!

Fix test_task, 'artifacts' was missing from initialization

Build is green

Patch application report for D8566 (id=30901)

Rebasing onto 798f749e66...

Current branch diff-target is up to date.
Changes applied before test
commit 3e6bb413be3e9ebd34f9aa8949e45b76039557de
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed Sep 28 16:23:45 2022 +0200

    Conda: Anaconda packages archive loader
    
    For each origin it takes advantage of 'artifacts' data send through
    'extra_loader_arguments' of the conda lister, providing versions,
    archive url, checksum, etc.
    Author and description are extracted from intrinsic metadata.
    
    Related T4579

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/903/ for more details.

vlorentz added a subscriber: vlorentz.
vlorentz added inline comments.
swh/loader/package/conda/loader.py
143–147

that's simpler IMO

(also, it will use metadata["about"]["summary"] if metadata["summary"] is empty. I don't know if it matters for any package, but there's not harm in doing that)

150–154

Simpler.

Also, you should add a check that maintainers is indeed a list, or we will silently end up with the first character as author.

This revision now requires changes to proceed.Sep 30 2022, 11:39 AM
franckbret marked 2 inline comments as done.

artifacts are now list

various code simplification after review

its a follow up of lister evolution D8588

Build is green

Patch application report for D8566 (id=31011)

Rebasing onto f774aba59e...

First, rewinding head to replay your work on top of it...
Applying: Conda: Anaconda packages archive loader
Changes applied before test
commit ec63af47373bec4b7ee92d8b27a97cb5cc2db096
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed Sep 28 16:23:45 2022 +0200

    Conda: Anaconda packages archive loader
    
    For each origin it takes advantage of 'artifacts' data send through
    'extra_loader_arguments' of the conda lister, providing versions,
    archive url, checksum, etc.
    Author and description are extracted from intrinsic metadata.
    
    Related T4579

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/929/ for more details.

anlambert added a subscriber: anlambert.

LGTM, two small comments to handle before I can accept it.

swh/loader/package/conda/loader.py
55–61

You should return the first metadata file content you find here:

meta_json_path = dir_path / "info" / "about.json"
meta_yml_path = dir_path / "info" / "recipe" / "meta.yaml"
if meta_json_path.exists():
    metadata = json.loads(meta_json_path.read_text())
elif meta_yml_path.exists():
    metadata = yaml.safe_load(meta_yml_path.read_text())
157

if the description is empty, we do not need the two line breaks, I'll rather do:

message = (
    f"Synthetic release for Conda source package {p_info.name} "
    f"version {p_info.version}"
)

if description:
    message += f"\n\n{description}"
This revision now requires changes to proceed.Sep 30 2022, 4:26 PM

shorter code and empty description handling

Build is green

Patch application report for D8566 (id=31014)

Rebasing onto f774aba59e...

First, rewinding head to replay your work on top of it...
Applying: Conda: Anaconda packages archive loader
Changes applied before test
commit 8988f68e225c58ee9c88527ed2f94bd69244022f
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed Sep 28 16:23:45 2022 +0200

    Conda: Anaconda packages archive loader
    
    For each origin it takes advantage of 'artifacts' data send through
    'extra_loader_arguments' of the conda lister, providing versions,
    archive url, checksum, etc.
    Author and description are extracted from intrinsic metadata.
    
    Related T4579

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/931/ for more details.

Build is green

Patch application report for D8566 (id=31018)

Rebasing onto f774aba59e...

First, rewinding head to replay your work on top of it...
Applying: Conda: Anaconda packages archive loader
Changes applied before test
commit fb68f497dbbb50578c96e9cec6a7b18002f429ac
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed Sep 28 16:23:45 2022 +0200

    Conda: Anaconda packages archive loader
    
    For each origin it takes advantage of 'artifacts' data send through
    'extra_loader_arguments' of the conda lister, providing versions,
    archive url, checksum, etc.
    Author and description are extracted from intrinsic metadata.
    
    Related T4579

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/934/ for more details.

Build is green

Patch application report for D8566 (id=31019)

Rebasing onto f774aba59e...

First, rewinding head to replay your work on top of it...
Applying: Conda: Anaconda packages archive loader
Changes applied before test
commit 3d9585a0fa0a99dbef262ae67799d5c831af3c0a
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed Sep 28 16:23:45 2022 +0200

    Conda: Anaconda packages archive loader
    
    For each origin it takes advantage of 'artifacts' data send through
    'extra_loader_arguments' of the conda lister, providing versions,
    archive url, checksum, etc.
    Author and description are extracted from intrinsic metadata.
    
    Related T4579

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/935/ for more details.

swh/loader/package/conda/loader.py
127–134

I landed D8595 about that feature, you can simply pass the checksums as parameter here to check downloaded tarball integrity:

p_info = CondaPackageInfo(
    name=pkgname,
    filename=filename,
    url=url,
    version=version,
    last_modified=last_modified,
    checksums=data["checksums"],
)
franckbret marked an inline comment as done.

Make use of checksums after D8595 landed

Build is green

Patch application report for D8566 (id=31082)

Rebasing onto c631349aea...

Current branch diff-target is up to date.
Changes applied before test
commit b9f5e3b6ef2e516782c5d386c9d2474e463ee899
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed Sep 28 16:23:45 2022 +0200

    Conda: Anaconda packages archive loader
    
    For each origin it takes advantage of 'artifacts' data send through
    'extra_loader_arguments' of the conda lister, providing versions,
    archive url, checksum, etc.
    Author and description are extracted from intrinsic metadata.
    
    Related T4579

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/953/ for more details.

@anlambert Can we merge this one, do you have more feedback?

@anlambert Can we merge this one, do you have more feedback?

@anlambert Wait, Ijust realized it still make use of description. Gonna make a new commit.

Remove description from message

Build is green

Patch application report for D8566 (id=31516)

Rebasing onto 6d8e2abac5...

Current branch diff-target is up to date.
Changes applied before test
commit 5838ddb1aa0c7a42665b3d87a4b63e34faf42e07
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed Sep 28 16:23:45 2022 +0200

    Conda: Anaconda packages archive loader
    
    For each origin it takes advantage of 'artifacts' data send through
    'extra_loader_arguments' of the conda lister, providing versions,
    archive url, checksum, etc.
    Author extracted from intrinsic metadata.
    
    Related T4579

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1002/ for more details.

Looks good to me, just intrinsic metadata parsing code to improve for errors handling and one test file to update after recent changes in swh-loader-core, see inline comments.

swh/loader/package/conda/loader.py
53–63

We should return the first parsed metadata file and handle parsing errors here:

metadata: Dict[str, Any] = {}

meta_json_path = dir_path / "info" / "about.json"
meta_yml_path = dir_path / "info" / "recipe" / "meta.yaml"

if meta_json_path.exists():
    try:
        metadata = json.loads(meta_json_path.read_text())
    except json.JSONDecodeError:
        pass

if meta_yml_path.exists() and not metadata:
    try:
        metadata = yaml.safe_load(meta_yml_path.read_text())
    except yaml.YAMLError:
        pass

return metadata
swh/loader/package/conda/tests/test_tasks.py
1–24

I recently refactored the loading tasks creation tests for all loaders, please replace by this instead:

import uuid

import pytest

from swh.scheduler.model import ListedOrigin, Lister

NAMESPACE = "swh.loader.package.conda"


@pytest.fixture
def conda_lister():
    return Lister(name="conda", instance_name="example", id=uuid.uuid4())


@pytest.fixture
def conda_listed_origin(conda_lister):
    return ListedOrigin(
        lister_id=conda_lister.id,
        url="https://anaconda.org/channel/some-package",
        visit_type="conda",
        extra_loader_arguments={
            "artifacts": [{"version": "0.0.1", "url": "some-package-0.0.1.tar.bz2"}],
        },
    )


def test_conda_loader_task_for_listed_origin(
    loading_task_creation_for_listed_origin_test,
    conda_lister,
    conda_listed_origin,
):
    loading_task_creation_for_listed_origin_test(
        loader_class_name=f"{NAMESPACE}.loader.CondaLoader",
        task_function_name=f"{NAMESPACE}.tasks.LoadConda",
        lister=conda_lister,
        listed_origin=conda_listed_origin,
    )
This revision now requires changes to proceed.Oct 19 2022, 3:51 PM
swh/loader/package/conda/loader.py
148

Looks like maintainers[0] can be None for some edge cases:

docker-swh-loader-1  | [2022-10-19 13:46:51,264: DEBUG/ForkPoolWorker-3] package_info: CondaPackageInfo(url='https://repo.anaconda.com/pkgs/main/linux-64/mkl-dpcpp-2021.3.0-h66538d2_521.tar.bz2', directory_extrinsic_metadata=[], checksums={'md5': 'e637920edc32a328881369008fc95203', 'sha256': '6c6e9a64ccbe997135297b5d0c59c255246d52d5f589e096a0e2db8ad73243be'}, name='mkl-dpcpp', filename='mkl-dpcpp-2021.3.0-h66538d2_521.tar.bz2', version='linux-64/2021.3.0-h66538d2_521', last_modified='2022-02-21T22:17:58.589000+00:00')
docker-swh-loader-1  | [2022-10-19 13:46:51,524: DEBUG/ForkPoolWorker-3] filename: mkl-dpcpp-2021.3.0-h66538d2_521.tar.bz2
docker-swh-loader-1  | [2022-10-19 13:46:51,524: DEBUG/ForkPoolWorker-3] filepath: /tmp/tmpdvk30qj0/mkl-dpcpp-2021.3.0-h66538d2_521.tar.bz2
docker-swh-loader-1  | [2022-10-19 13:47:13,982: DEBUG/ForkPoolWorker-3] extrinsic_metadata
docker-swh-loader-1  | [2022-10-19 13:48:08,558: DEBUG/ForkPoolWorker-3] uncompressed_path: /tmp/tmpdvk30qj0/src
docker-swh-loader-1  | [2022-10-19 13:48:18,532: DEBUG/ForkPoolWorker-3] Filtered out 1 contents, 0 skipped contents and 0 directories
docker-swh-loader-1  | [2022-10-19 13:48:18,532: DEBUG/ForkPoolWorker-3] Number of skipped contents: 1
docker-swh-loader-1  | [2022-10-19 13:48:18,532: DEBUG/ForkPoolWorker-3] Number of contents: 17
docker-swh-loader-1  | [2022-10-19 13:48:18,532: DEBUG/ForkPoolWorker-3] Number of directories: 10
docker-swh-loader-1  | [2022-10-19 13:48:18,689: ERROR/ForkPoolWorker-3] Failed to load branch releases/linux-64/2021.3.0-h66538d2_521 for https://anaconda.org/main/mkl-dpcpp
docker-swh-loader-1  | Traceback (most recent call last):
docker-swh-loader-1  |   File "/tmp/tmp.0uo1le4WBL/swh-loader-core/swh/loader/package/loader.py", line 688, in load
docker-swh-loader-1  |     res = self._load_release(p_info, origin)
docker-swh-loader-1  |   File "/tmp/tmp.0uo1le4WBL/swh-loader-core/swh/loader/package/loader.py", line 877, in _load_release
docker-swh-loader-1  |     p_info, uncompressed_path, directory=directory.hash
docker-swh-loader-1  |   File "/tmp/tmp.0uo1le4WBL/swh-loader-core/swh/loader/package/conda/loader.py", line 148, in build_release
docker-swh-loader-1  |     author = Person.from_fullname(maintainers[0].encode())
docker-swh-loader-1  | AttributeError: 'NoneType' object has no attribute 'encode'
docker-swh-loader-1  | [2022-10-19 13:48:18,692: DEBUG/ForkPoolWorker-3] package_info: CondaPackageInfo(url='https://repo.anaconda.com/pkgs/main/linux-64/mkl-dpcpp-2021.4.0-h66538d2_640.tar.bz2', directory_extrinsic_metadata=[], checksums={'md5': 'fe3fcabdefbf7fa7c6b7a603c71d3aff', 'sha256': 'b6f25a926bc6b94fb8fa38ca9c7caba894a5bd0b4d6fa948d8038f3dafc65eaa'}, name='mkl-dpcpp', filename='mkl-dpcpp-2021.4.0-h66538d2_640.tar.bz2', version='linux-64/2021.4.0-h66538d2_640', last_modified='2022-02-21T22:23:28.671000+00:00')
docker-swh-loader-1  | [2022-10-19 13:48:18,904: DEBUG/ForkPoolWorker-3] filename: mkl-dpcpp-2021.4.0-h66538d2_640.tar.bz2
docker-swh-loader-1  | [2022-10-19 13:48:18,904: DEBUG/ForkPoolWorker-3] filepath: /tmp/tmphnski1iw/mkl-dpcpp-2021.4.0-h66538d2_640.tar.bz2
docker-swh-loader-1  | [2022-10-19 13:48:36,899: DEBUG/ForkPoolWorker-3] extrinsic_metadata
docker-swh-loader-1  | [2022-10-19 13:49:26,037: DEBUG/ForkPoolWorker-3] uncompressed_path: /tmp/tmphnski1iw/src
docker-swh-loader-1  | [2022-10-19 13:49:37,059: DEBUG/ForkPoolWorker-3] Filtered out 1 contents, 0 skipped contents and 0 directories
docker-swh-loader-1  | [2022-10-19 13:49:37,059: DEBUG/ForkPoolWorker-3] Number of skipped contents: 1
docker-swh-loader-1  | [2022-10-19 13:49:37,059: DEBUG/ForkPoolWorker-3] Number of contents: 17
docker-swh-loader-1  | [2022-10-19 13:49:37,059: DEBUG/ForkPoolWorker-3] Number of directories: 10
docker-swh-loader-1  | [2022-10-19 13:49:37,272: ERROR/ForkPoolWorker-3] Failed to load branch releases/linux-64/2021.4.0-h66538d2_640 for https://anaconda.org/main/mkl-dpcpp
docker-swh-loader-1  | Traceback (most recent call last):
docker-swh-loader-1  |   File "/tmp/tmp.0uo1le4WBL/swh-loader-core/swh/loader/package/loader.py", line 688, in load
docker-swh-loader-1  |     res = self._load_release(p_info, origin)
docker-swh-loader-1  |   File "/tmp/tmp.0uo1le4WBL/swh-loader-core/swh/loader/package/loader.py", line 877, in _load_release
docker-swh-loader-1  |     p_info, uncompressed_path, directory=directory.hash
docker-swh-loader-1  |   File "/tmp/tmp.0uo1le4WBL/swh-loader-core/swh/loader/package/conda/loader.py", line 148, in build_release
docker-swh-loader-1  |     author = Person.from_fullname(maintainers[0].encode())
docker-swh-loader-1  | AttributeError: 'NoneType' object has no attribute 'encode'
docker-swh-loader-1  | [2022-10-19 13:49:37,272: DEBUG/ForkPoolWorker-3] package_info: CondaPackageInfo(url='https://repo.anaconda.com/pkgs/main/linux-64/mkl-dpcpp-2022.0.1-h66538d2_117.tar.bz2', directory_extrinsic_metadata=[], checksums={'md5': 'a858b82a575e3bc331abf2a49d3e9289', 'sha256': '8f6c3946a80e64a2ea703e97bbe1c80e50dc237021e4167feed4c061b7b5774a'}, name='mkl-dpcpp', filename='mkl-dpcpp-2022.0.1-h66538d2_117.tar.bz2', version='linux-64/2022.0.1-h66538d2_117', last_modified='2022-02-22T14:16:39.634000+00:00')
docker-swh-loader-1  | [2022-10-19 13:49:37,522: DEBUG/ForkPoolWorker-3] filename: mkl-dpcpp-2022.0.1-h66538d2_117.tar.bz2
docker-swh-loader-1  | [2022-10-19 13:49:37,522: DEBUG/ForkPoolWorker-3] filepath: /tmp/tmpuijki00g/mkl-dpcpp-2022.0.1-h66538d2_117.tar.bz2
docker-swh-loader-1  | [2022-10-19 13:49:56,567: DEBUG/ForkPoolWorker-3] extrinsic_metadata
docker-swh-loader-1  | [2022-10-19 13:50:39,400: DEBUG/ForkPoolWorker-3] uncompressed_path: /tmp/tmpuijki00g/src
docker-swh-loader-1  | [2022-10-19 13:50:49,083: DEBUG/ForkPoolWorker-3] Filtered out 1 contents, 0 skipped contents and 0 directories
docker-swh-loader-1  | [2022-10-19 13:50:49,084: DEBUG/ForkPoolWorker-3] Number of skipped contents: 1
docker-swh-loader-1  | [2022-10-19 13:50:49,084: DEBUG/ForkPoolWorker-3] Number of contents: 17
docker-swh-loader-1  | [2022-10-19 13:50:49,084: DEBUG/ForkPoolWorker-3] Number of directories: 10
docker-swh-loader-1  | [2022-10-19 13:50:49,313: ERROR/ForkPoolWorker-3] Failed to load branch releases/linux-64/2022.0.1-h66538d2_117 for https://anaconda.org/main/mkl-dpcpp
docker-swh-loader-1  | Traceback (most recent call last):
docker-swh-loader-1  |   File "/tmp/tmp.0uo1le4WBL/swh-loader-core/swh/loader/package/loader.py", line 688, in load
docker-swh-loader-1  |     res = self._load_release(p_info, origin)
docker-swh-loader-1  |   File "/tmp/tmp.0uo1le4WBL/swh-loader-core/swh/loader/package/loader.py", line 877, in _load_release
docker-swh-loader-1  |     p_info, uncompressed_path, directory=directory.hash
docker-swh-loader-1  |   File "/tmp/tmp.0uo1le4WBL/swh-loader-core/swh/loader/package/conda/loader.py", line 148, in build_release
docker-swh-loader-1  |     author = Person.from_fullname(maintainers[0].encode())
docker-swh-loader-1  | AttributeError: 'NoneType' object has no attribute 'encode'
swh/loader/package/conda/loader.py
125

date can be missing from listed artifacts, this needs to be handled:

[2022-10-19 13:52:59,918: ERROR/ForkPoolWorker-4] Failed to get package info for version linux-64/0.20.0-py27_0 of https://anaconda.org/main/llvmlite
docker-swh-loader-1  | Traceback (most recent call last):
docker-swh-loader-1  |   File "/tmp/tmp.0uo1le4WBL/swh-loader-core/swh/loader/package/loader.py", line 634, in load
docker-swh-loader-1  |     for branch_name, p_info in self.get_package_info(version):
docker-swh-loader-1  |   File "/tmp/tmp.0uo1le4WBL/swh-loader-core/swh/loader/package/conda/loader.py", line 125, in get_package_info
docker-swh-loader-1  |     last_modified = iso8601.parse_date(data["date"]).isoformat()
docker-swh-loader-1  | KeyError: 'date'
docker-swh-loader-1  | [2022-10-19 13:52:59,919: ERROR/ForkPoolWorker-4] Failed to get package info for version linux-64/0.20.0-py34_0 of https://anaconda.org/main/llvmlite
docker-swh-loader-1  | Traceback (most recent call last):
docker-swh-loader-1  |   File "/tmp/tmp.0uo1le4WBL/swh-loader-core/swh/loader/package/loader.py", line 634, in load
docker-swh-loader-1  |     for branch_name, p_info in self.get_package_info(version):
docker-swh-loader-1  |   File "/tmp/tmp.0uo1le4WBL/swh-loader-core/swh/loader/package/conda/loader.py", line 125, in get_package_info
docker-swh-loader-1  |     last_modified = iso8601.parse_date(data["date"]).isoformat()
docker-swh-loader-1  | KeyError: 'date'
docker-swh-loader-1  | [2022-10-19 13:52:59,919: ERROR/ForkPoolWorker-4] Failed to get package info for version linux-64/0.20.0-py35_0 of https://anaconda.org/main/llvmlite
docker-swh-loader-1  | Traceback (most recent call last):
docker-swh-loader-1  |   File "/tmp/tmp.0uo1le4WBL/swh-loader-core/swh/loader/package/loader.py", line 634, in load
docker-swh-loader-1  |     for branch_name, p_info in self.get_package_info(version):
docker-swh-loader-1  |   File "/tmp/tmp.0uo1le4WBL/swh-loader-core/swh/loader/package/conda/loader.py", line 125, in get_package_info
docker-swh-loader-1  |     last_modified = iso8601.parse_date(data["date"]).isoformat()
docker-swh-loader-1  | KeyError: 'date'
docker-swh-loader-1  | [2022-10-19 13:52:59,920: ERROR/ForkPoolWorker-4] Failed to get package info for version linux-64/0.20.0-py36_0 of https://anaconda.org/main/llvmlite
docker-swh-loader-1  | Traceback (most recent call last):
docker-swh-loader-1  |   File "/tmp/tmp.0uo1le4WBL/swh-loader-core/swh/loader/package/loader.py", line 634, in load
docker-swh-loader-1  |     for branch_name, p_info in self.get_package_info(version):
docker-swh-loader-1  |   File "/tmp/tmp.0uo1le4WBL/swh-loader-core/swh/loader/package/conda/loader.py", line 125, in get_package_info
docker-swh-loader-1  |     last_modified = iso8601.parse_date(data["date"]).isoformat()
docker-swh-loader-1  | KeyError: 'date'
franckbret marked 4 inline comments as done.

Manage case where author or last_update is empty

Build is green

Patch application report for D8566 (id=31532)

Rebasing onto 6d8e2abac5...

Current branch diff-target is up to date.
Changes applied before test
commit e7ba6316315dde37207526c21409068d1d95b2b6
Author: Franck Bret <franck.bret@octobus.net>
Date:   Wed Sep 28 16:23:45 2022 +0200

    Conda: Anaconda packages archive loader
    
    For each origin it takes advantage of 'artifacts' data send through
    'extra_loader_arguments' of the conda lister, providing versions,
    archive url, checksum, etc.
    Author extracted from intrinsic metadata.
    
    Related T4579

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1003/ for more details.

Looks good to me, thanks !

This revision is now accepted and ready to land.Oct 21 2022, 11:35 AM