diff --git a/docs/package-loader-specifications.rst b/docs/package-loader-specifications.rst index 64250a8..da3b91a 100644 --- a/docs/package-loader-specifications.rst +++ b/docs/package-loader-specifications.rst @@ -1,196 +1,205 @@ .. _package-loader-specifications: Package loader specifications ============================= Release fields -------------- Here is an overview of the fields (+ internal version name + branch name) used by each package loader, after D6616: .. list-table:: Fields used by each package loader :header-rows: 1 * - Loader - internal version - branch name - name - message - synthetic - author - date - Notes * - arch - ``p_info.​version`` - ``release_name(​version, filename)`` - =version - Synthetic release for Arch Linux source package {p_info.name} version {p_info.version} {description} - true - from intrinsic metadata - from extra_loader_arguments['arch_metadata'] - Intrinsic metadata extracted from .PKGINFO file of the package * - archive - passed as arg - ``release_name(​version)`` - =version - "Synthetic release for archive at {p_info.url}\n" - true - "" - passed as arg - * - aur - ``p_info.​version`` - ``release_name(​version, filename)`` - =version - Synthetic release for Aur source package {p_info.name} version {p_info.version} {description} - true - "" - from extra_loader_arguments['aur_metadata'] - Intrinsic metadata extracted from .SRCINFO file of the package * - cpan - ``p_info.​version`` - ``release_name(​version)`` - =version - Synthetic release for Perl source package {name} version {version} {description} - true - from intrinsic metadata if any else from extrinsic - from extrinsic metadata - name, version and description from intrinsic metadata * - cran - ``metadata.get(​"Version", passed as arg)`` - ``release_name(​version)`` - =version - standard message - true - ``metadata.get(​"Maintainer", "")`` - ``metadata.get(​"Date")`` - metadata is intrinsic * - conda - ``p_info.​version`` - ``release_name(​version)`` - =version - Synthetic release for Conda source package {p_info.name} version {p_info.version} - true - from intrinsic metadata - from extrinsic metadata - "" * - crates - ``p_info.​version`` - ``release_name(​version, filename) + "\n\n" + i_metadata.description + "\n"`` - =version - Synthetic release for Crate source package {p_info.name} version {p_info.version} {description} - true - from int metadata - from ext metadata - ``i_metadata`` for intrinsic metadata, ``e_metadata`` for extrinsic metadata * - debian - =``version`` - ``release_name(​version)`` - =``i_version`` - standard message (using ``i_version``) - true - ``metadata​.changelog​.person`` - ``metadata​.changelog​.date`` - metadata is intrinsic. Old revisions have ``dsc`` as type ``i_version`` is the intrinsic version (eg. ``0.7.2-3``) while ``version`` contains the debian suite name (eg. ``stretch/contrib/0.7.2-3``) and is passed as arg * - golang - ``p_info.​version`` - ``release_name(version)`` - =version - Synthetic release for Golang source package {p_info.name} version {p_info.version} - true - "" - from ext metadata - Golang offers basically no metadata outside of version and timestamp * - deposit - HEAD - only HEAD - HEAD - "{client}: Deposit {id} in collection {collection}\n" - true - original author - ```` from SWORD XML - revisions had parents * - maven-loader - passed as arg - HEAD - ``release_name(version)`` - "Synthetic release for archive at {p_info.url}\n" - true - "" - passed as arg - Only one artefact per url (jar/zip src) * - nixguix - URL - URL - URL - None - true - "" - None - it's the URL of the artifact referenced by the derivation * - npm - ``metadata​["version"]`` - ``release_name(​version)`` - =version - standard message - true - from int metadata or "" - from ext metadata or None - * - opam - as given by opam - "{opam_package}​.{version}" - =version - standard message - true - from metadata - None - "{self.opam_package}​.{version}" matches the version names used by opam's backend. metadata is extrinsic * - pubdev - ``p_info.​version`` - ``release_name(​version)`` - =version - Synthetic release for pub.dev source package {p_info.name} version {p_info.version} - true - from extrinsic metadata - from extrinsic metadata - name and version from extrinsic metadata * - puppet - ``p_info.​version`` - ``release_name(​version)`` - =version - Synthetic release for Puppet source package {p_info.name} version {version} {description} - true - from intrinsic metadata - from extrinsic metadata - version and description from intrinsic metadata * - pypi - ``metadata​["version"]`` - ``release_name(​version)`` or ``release_name(​version, filename)`` - =version - ``metadata[​'comment_text']}`` or standard message - true - from int metadata or "" - from ext metadata or None - metadata is intrinsic + * - rubygems + - ``p_info.version`` + - ``release_name(​version)`` + - =version + - Synthetic release for RubyGems source package {p_info.name} version {p_info.version} + - true + - from ext metadata + - from ext metadata + - The source code is extracted from a tarball nested within the gem file using this function:: def release_name(version: str, filename: Optional[str] = None) -> str: if filename: return "releases/%s/%s" % (version, filename) return "releases/%s" % version and "standard message" being:: msg = ( f"Synthetic release for {PACKAGE_MANAGER} source package {name} " f"version {version}\n" ) The ``target_type`` field is always ``dir``, and the target the id of a directory loaded by unpacking a tarball/zip file/... diff --git a/setup.py b/setup.py index 25ecef1..faccf93 100755 --- a/setup.py +++ b/setup.py @@ -1,91 +1,92 @@ #!/usr/bin/env python3 # Copyright (C) 2015-2018 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from io import open from os import path from setuptools import find_packages, setup here = path.abspath(path.dirname(__file__)) # Get the long description from the README file with open(path.join(here, "README.rst"), encoding="utf-8") as f: long_description = f.read() def parse_requirements(name=None): if name: reqf = "requirements-%s.txt" % name else: reqf = "requirements.txt" requirements = [] if not path.exists(reqf): return requirements with open(reqf) as f: for line in f.readlines(): line = line.strip() if not line or line.startswith("#"): continue requirements.append(line) return requirements setup( name="swh.loader.core", description="Software Heritage Base Loader", long_description=long_description, long_description_content_type="text/markdown", python_requires=">=3.7", author="Software Heritage developers", author_email="swh-devel@inria.fr", url="https://forge.softwareheritage.org/diffusion/DLDBASE", packages=find_packages(), # packages's modules scripts=[], # scripts to package install_requires=parse_requirements() + parse_requirements("swh"), setup_requires=["setuptools-scm"], use_scm_version=True, extras_require={"testing": parse_requirements("test")}, include_package_data=True, entry_points=""" [swh.cli.subcommands] loader=swh.loader.cli [swh.workers] loader.content=swh.loader.core:register_content loader.directory=swh.loader.core:register_directory loader.arch=swh.loader.package.arch:register loader.archive=swh.loader.package.archive:register loader.aur=swh.loader.package.aur:register loader.conda=swh.loader.package.conda:register loader.cpan=swh.loader.package.cpan:register loader.cran=swh.loader.package.cran:register loader.crates=swh.loader.package.crates:register loader.debian=swh.loader.package.debian:register loader.deposit=swh.loader.package.deposit:register loader.golang=swh.loader.package.golang:register loader.nixguix=swh.loader.package.nixguix:register loader.npm=swh.loader.package.npm:register loader.opam=swh.loader.package.opam:register loader.pubdev=swh.loader.package.pubdev:register loader.puppet=swh.loader.package.puppet:register loader.pypi=swh.loader.package.pypi:register loader.maven=swh.loader.package.maven:register + loader.rubygems=swh.loader.package.rubygems:register """, classifiers=[ "Programming Language :: Python :: 3", "Intended Audience :: Developers", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)", "Operating System :: OS Independent", "Development Status :: 5 - Production/Stable", ], project_urls={ "Bug Reports": "https://forge.softwareheritage.org/maniphest", "Funding": "https://www.softwareheritage.org/donate", "Source": "https://forge.softwareheritage.org/source/swh-loader-core", "Documentation": "https://docs.softwareheritage.org/devel/swh-loader-core/", }, ) diff --git a/swh/loader/package/rubygems/__init__.py b/swh/loader/package/rubygems/__init__.py new file mode 100644 index 0000000..863552b --- /dev/null +++ b/swh/loader/package/rubygems/__init__.py @@ -0,0 +1,17 @@ +# Copyright (C) 2022 The Software Heritage developers +# See the AUTHORS file at the top-level directory of this distribution +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + + +from typing import Any, Mapping + + +def register() -> Mapping[str, Any]: + """Register the current worker module's definition""" + from .loader import RubyGemsLoader + + return { + "task_modules": [f"{__name__}.tasks"], + "loader": RubyGemsLoader, + } diff --git a/swh/loader/package/rubygems/loader.py b/swh/loader/package/rubygems/loader.py new file mode 100644 index 0000000..21155ff --- /dev/null +++ b/swh/loader/package/rubygems/loader.py @@ -0,0 +1,135 @@ +# Copyright (C) 2022 The Software Heritage developers +# See the AUTHORS file at the top-level directory of this distribution +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + +import json +import logging +import os +from typing import Any, Dict, Iterator, List, Mapping, Optional, Sequence, Tuple + +import attr + +from swh.loader.package.loader import BasePackageInfo, PackageLoader +from swh.loader.package.utils import cached_method, get_url_body, release_name +from swh.model import from_disk +from swh.model.model import ObjectType, Person, Release, Sha1Git, TimestampWithTimezone +from swh.storage.interface import StorageInterface + +logger = logging.getLogger(__name__) + + +@attr.s +class RubyGemsPackageInfo(BasePackageInfo): + name = attr.ib(type=str) + """Name of the package""" + + version = attr.ib(type=str) + """Current version""" + + built_at = attr.ib(type=Optional[TimestampWithTimezone]) + """Version build date""" + + authors = attr.ib(type=List[Person]) + """Authors""" + + +class RubyGemsLoader(PackageLoader[RubyGemsPackageInfo]): + """Load ``.gem`` files from ``RubyGems.org`` into the SWH archive.""" + + visit_type = "rubygems" + + def __init__( + self, + storage: StorageInterface, + url: str, + max_content_size: Optional[int] = None, + **kwargs, + ): + super().__init__(storage, url, max_content_size=max_content_size, **kwargs) + # Lister URLs are in the ``https://rubygems.org/gems/{pkgname}`` format + assert url.startswith("https://rubygems.org/gems/"), ( + "Expected rubygems.org url, got '%s'" % url + ) + self.gem_name = url[len("https://rubygems.org/gems/") :] + # API docs at ``https://guides.rubygems.org/rubygems-org-api/`` + self.api_base_url = "https://rubygems.org/api/v1" + # Mapping of version number to corresponding metadata from the API + self.versions_info: Dict[str, Dict[str, Any]] = {} + + def get_versions(self) -> Sequence[str]: + """Return all versions for the gem being loaded. + + Also stores the detailed information for each version since everything + is present in this API call.""" + versions_info = get_url_body( + f"{self.api_base_url}/versions/{self.gem_name}.json" + ) + versions = [] + + for version_info in json.loads(versions_info): + number = version_info["number"] + self.versions_info[number] = version_info + versions.append(number) + + return versions + + @cached_method + def get_default_version(self) -> str: + latest = get_url_body( + f"{self.api_base_url}/versions/{self.gem_name}/latest.json" + ) + return json.loads(latest)["version"] + + def _load_directory( + self, dl_artifacts: List[Tuple[str, Mapping[str, Any]]], tmpdir: str + ) -> Tuple[str, from_disk.Directory]: + """Override the directory loading to point it to the actual code. + + Gem files are uncompressed tarballs containing: + - ``metadata.gz``: the metadata about this gem + - ``data.tar.gz``: the code and possible binary artifacts + - ``checksums.yaml.gz``: checksums + """ + logger.debug("Unpacking gem file to point to the actual code") + uncompressed_path = self.uncompress(dl_artifacts, dest=tmpdir) + source_code_tarball = os.path.join(uncompressed_path, "data.tar.gz") + + return super()._load_directory([(source_code_tarball, {})], tmpdir) + + def get_package_info( + self, version: str + ) -> Iterator[Tuple[str, RubyGemsPackageInfo]]: + + info = self.versions_info[version] + + authors = info["authors"].split(", ") + p_info = RubyGemsPackageInfo( + url=f"https://rubygems.org/downloads/{self.gem_name}-{version}.gem", + # See format of gem files in ``_load_directory`` + filename=f"{self.gem_name}-{version}.tar", + version=version, + built_at=TimestampWithTimezone.from_iso8601(info["built_at"]), + name=self.gem_name, + authors=[Person.from_fullname(person.encode()) for person in authors], + ) + yield release_name(version), p_info + + def build_release( + self, p_info: RubyGemsPackageInfo, uncompressed_path: str, directory: Sha1Git + ) -> Optional[Release]: + msg = ( + f"Synthetic release for RubyGems source package {p_info.name} " + f"version {p_info.version}\n" + ) + + return Release( + name=p_info.version.encode(), + message=msg.encode(), + date=p_info.built_at, + # TODO multiple authors (T3887) + author=p_info.authors[0], + target_type=ObjectType.DIRECTORY, + target=directory, + synthetic=True, + ) diff --git a/swh/loader/package/rubygems/tasks.py b/swh/loader/package/rubygems/tasks.py new file mode 100644 index 0000000..f1ec50b --- /dev/null +++ b/swh/loader/package/rubygems/tasks.py @@ -0,0 +1,14 @@ +# Copyright (C) 2022 The Software Heritage developers +# See the AUTHORS file at the top-level directory of this distribution +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + +from celery import shared_task + +from swh.loader.package.rubygems.loader import RubyGemsLoader + + +@shared_task(name=__name__ + ".LoadRubyGems") +def load_rubygems(**kwargs): + """Load ruby gems""" + return RubyGemsLoader.from_configfile(**kwargs).load() diff --git a/swh/loader/package/rubygems/tests/data/https_rubygems.org/api_v1_versions_mercurial-wrapper.json b/swh/loader/package/rubygems/tests/data/https_rubygems.org/api_v1_versions_mercurial-wrapper.json new file mode 100644 index 0000000..ddc6e33 --- /dev/null +++ b/swh/loader/package/rubygems/tests/data/https_rubygems.org/api_v1_versions_mercurial-wrapper.json @@ -0,0 +1,36 @@ +[ + { + "authors": "Fabio Neves", + "built_at": "2014-09-11T00:00:00.000Z", + "created_at": "2014-09-25T09:02:44.313Z", + "description": "A simple wrapper around HG command line tool", + "downloads_count": 2770, + "metadata": {}, + "number": "0.8.5", + "summary": "Mercurial command line ruby wrapper", + "platform": "ruby", + "rubygems_version": "\u003e= 0", + "ruby_version": "\u003e= 0", + "prerelease": false, + "licenses": [], + "requirements": [], + "sha": "cee62e168ffd7d36c565e00f29fa6a0b57ef15c4c14055345b1e01148ec4fab8" + }, + { + "authors": "Fabio Neves", + "built_at": "2014-09-11T00:00:00.000Z", + "created_at": "2014-09-18T08:59:42.895Z", + "description": "A simple wrapper around HG command line tool", + "downloads_count": 2415, + "metadata": {}, + "number": "0.8.4", + "summary": "Mercurial command line ruby wrapper", + "platform": "ruby", + "rubygems_version": "\u003e= 0", + "ruby_version": "\u003e= 0", + "prerelease": false, + "licenses": [], + "requirements": [], + "sha": "ec60f0568f4f8744a0da78089a05e51d1c0e9799a1abfb37f63cdf7ed019c862" + } +] \ No newline at end of file diff --git a/swh/loader/package/rubygems/tests/data/https_rubygems.org/api_v1_versions_mercurial-wrapper_latest.json b/swh/loader/package/rubygems/tests/data/https_rubygems.org/api_v1_versions_mercurial-wrapper_latest.json new file mode 100644 index 0000000..00a8210 --- /dev/null +++ b/swh/loader/package/rubygems/tests/data/https_rubygems.org/api_v1_versions_mercurial-wrapper_latest.json @@ -0,0 +1 @@ +{"version":"0.8.5"} \ No newline at end of file diff --git a/swh/loader/package/rubygems/tests/data/https_rubygems.org/downloads_mercurial-wrapper-0.8.4.gem b/swh/loader/package/rubygems/tests/data/https_rubygems.org/downloads_mercurial-wrapper-0.8.4.gem new file mode 100644 index 0000000..8eb6d8b Binary files /dev/null and b/swh/loader/package/rubygems/tests/data/https_rubygems.org/downloads_mercurial-wrapper-0.8.4.gem differ diff --git a/swh/loader/package/rubygems/tests/data/https_rubygems.org/downloads_mercurial-wrapper-0.8.5.gem b/swh/loader/package/rubygems/tests/data/https_rubygems.org/downloads_mercurial-wrapper-0.8.5.gem new file mode 100644 index 0000000..6c4141f Binary files /dev/null and b/swh/loader/package/rubygems/tests/data/https_rubygems.org/downloads_mercurial-wrapper-0.8.5.gem differ diff --git a/swh/loader/package/rubygems/tests/test_rubygems.py b/swh/loader/package/rubygems/tests/test_rubygems.py new file mode 100644 index 0000000..255ed7c --- /dev/null +++ b/swh/loader/package/rubygems/tests/test_rubygems.py @@ -0,0 +1,26 @@ +# Copyright (C) 2022 The Software Heritage developers +# See the AUTHORS file at the top-level directory of this distribution +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + +from swh.loader.package.rubygems.loader import RubyGemsLoader +from swh.loader.tests import get_stats + + +def test_rubygems_loader(swh_storage, requests_mock_datadir): + url = "https://rubygems.org/gems/mercurial-wrapper" + loader = RubyGemsLoader(swh_storage, url) + + assert loader.load()["status"] == "eventful" + + stats = get_stats(swh_storage) + assert { + "content": 8, + "directory": 4, + "origin": 1, + "origin_visit": 1, + "release": 2, + "revision": 0, + "skipped_content": 0, + "snapshot": 1, + } == stats diff --git a/swh/loader/package/rubygems/tests/test_tasks.py b/swh/loader/package/rubygems/tests/test_tasks.py new file mode 100644 index 0000000..ad8dba9 --- /dev/null +++ b/swh/loader/package/rubygems/tests/test_tasks.py @@ -0,0 +1,21 @@ +# Copyright (C) 2022 The Software Heritage developers +# See the AUTHORS file at the top-level directory of this distribution +# License: GNU General Public License version 3, or any later version +# See top-level LICENSE file for more information + + +def test_tasks_rubygems_loader( + mocker, swh_scheduler_celery_app, swh_scheduler_celery_worker, swh_config +): + mock_load = mocker.patch("swh.loader.package.rubygems.loader.RubyGemsLoader.load") + mock_load.return_value = {"status": "eventful"} + + res = swh_scheduler_celery_app.send_task( + "swh.loader.package.rubygems.tasks.LoadRubyGems", + kwargs={"url": "https://rubygems.org/gems/whatever-package"}, + ) + assert res + res.wait() + assert res.successful() + assert mock_load.called + assert res.result == {"status": "eventful"}