diff --git a/docs/package-loader-specifications.rst b/docs/package-loader-specifications.rst index ec1c51c..01abe7d 100644 --- a/docs/package-loader-specifications.rst +++ b/docs/package-loader-specifications.rst @@ -1,187 +1,187 @@ .. _package-loader-specifications: Package loader specifications ============================= Release fields -------------- Here is an overview of the fields (+ internal version name + branch name) used by each package loader, after D6616: .. list-table:: Fields used by each package loader :header-rows: 1 * - Loader - internal version - branch name - name - message - synthetic - author - date - Notes * - arch - ``p_info.​version`` - ``release_name(​version, filename)`` - =version - Synthetic release for Arch Linux source package {p_info.name} version {p_info.version} {description} - true - from intrinsic metadata - from extra_loader_arguments['arch_metadata'] - Intrinsic metadata extracted from .PKGINFO file of the package * - archive - passed as arg - ``release_name(​version)`` - =version - "Synthetic release for archive at {p_info.url}\n" - true - "" - passed as arg - * - aur - ``p_info.​version`` - ``release_name(​version, filename)`` - =version - Synthetic release for Aur source package {p_info.name} version {p_info.version} {description} - true - "" - from extra_loader_arguments['aur_metadata'] - Intrinsic metadata extracted from .SRCINFO file of the package * - cpan - ``p_info.​version`` - ``release_name(​version)`` - =version - Synthetic release for Perl source package {name} version {version} {description} - true - from intrinsic metadata if any else from extrinsic - from extrinsic metadata - name, version and description from intrinsic metadata * - cran - ``metadata.get(​"Version", passed as arg)`` - ``release_name(​version)`` - =version - standard message - true - ``metadata.get(​"Maintainer", "")`` - ``metadata.get(​"Date")`` - metadata is intrinsic * - crates - ``p_info.​version`` - ``release_name(​version, filename) + "\n\n" + i_metadata.description + "\n"`` - =version - Synthetic release for Crate source package {p_info.name} version {p_info.version} {description} - true - from int metadata - from ext metadata - ``i_metadata`` for intrinsic metadata, ``e_metadata`` for extrinsic metadata * - debian - =``version`` - ``release_name(​version)`` - =``i_version`` - standard message (using ``i_version``) - true - ``metadata​.changelog​.person`` - ``metadata​.changelog​.date`` - metadata is intrinsic. Old revisions have ``dsc`` as type ``i_version`` is the intrinsic version (eg. ``0.7.2-3``) while ``version`` contains the debian suite name (eg. ``stretch/contrib/0.7.2-3``) and is passed as arg * - golang - ``p_info.​version`` - ``release_name(version)`` - =version - Synthetic release for Golang source package {p_info.name} version {p_info.version} - true - "" - from ext metadata - Golang offers basically no metadata outside of version and timestamp * - deposit - HEAD - only HEAD - HEAD - "{client}: Deposit {id} in collection {collection}\n" - true - original author - ```` from SWORD XML - revisions had parents * - maven-loader - passed as arg - HEAD - ``release_name(version)`` - "Synthetic release for archive at {p_info.url}\n" - true - "" - passed as arg - Only one artefact per url (jar/zip src) * - nixguix - URL - URL - URL - None - true - "" - None - it's the URL of the artifact referenced by the derivation * - npm - ``metadata​["version"]`` - ``release_name(​version)`` - =version - standard message - true - from int metadata or "" - from ext metadata or None - * - opam - as given by opam - "{opam_package}​.{version}" - =version - standard message - true - from metadata - None - "{self.opam_package}​.{version}" matches the version names used by opam's backend. metadata is extrinsic * - pubdev - ``p_info.​version`` - ``release_name(​version)`` - =version - - Synthetic release for pub.dev source package {name} version {version} {description} + - Synthetic release for pub.dev source package {p_info.name} version {p_info.version} - true - from extrinsic metadata - from extrinsic metadata - - name, version and description from intrinsic metadata + - name and version from extrinsic metadata * - puppet - ``p_info.​version`` - ``release_name(​version)`` - =version - Synthetic release for Puppet source package {p_info.name} version {version} {description} - true - from intrinsic metadata - from extrinsic metadata - version and description from intrinsic metadata * - pypi - ``metadata​["version"]`` - ``release_name(​version)`` or ``release_name(​version, filename)`` - =version - ``metadata[​'comment_text']}`` or standard message - true - from int metadata or "" - from ext metadata or None - metadata is intrinsic using this function:: def release_name(version: str, filename: Optional[str] = None) -> str: if filename: return "releases/%s/%s" % (version, filename) return "releases/%s" % version and "standard message" being:: msg = ( f"Synthetic release for {PACKAGE_MANAGER} source package {name} " f"version {version}\n" ) The ``target_type`` field is always ``dir``, and the target the id of a directory loaded by unpacking a tarball/zip file/... diff --git a/swh/loader/package/pubdev/loader.py b/swh/loader/package/pubdev/loader.py index d78fe9b..4bffa3b 100644 --- a/swh/loader/package/pubdev/loader.py +++ b/swh/loader/package/pubdev/loader.py @@ -1,195 +1,152 @@ # Copyright (C) 2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import json -from pathlib import Path -from typing import Any, Dict, Iterator, Optional, Sequence, Tuple +from typing import Dict, Iterator, Optional, Sequence, Tuple import attr from packaging.version import parse as parse_version -import yaml from swh.loader.package.loader import BasePackageInfo, PackageLoader from swh.loader.package.utils import ( EMPTY_AUTHOR, Person, cached_method, get_url_body, release_name, ) from swh.model.model import ObjectType, Release, Sha1Git, TimestampWithTimezone from swh.storage.interface import StorageInterface @attr.s class PubDevPackageInfo(BasePackageInfo): name = attr.ib(type=str) """Name of the package""" version = attr.ib(type=str) """Current version""" last_modified = attr.ib(type=str) """Last modified date as release date""" author = attr.ib(type=Person) """Author""" - description = attr.ib(type=str) - """Description""" - - -def extract_intrinsic_metadata(dir_path: Path) -> Dict[str, Any]: - """Extract intrinsic metadata from pubspec.yaml file at dir_path. - - Each pub.dev package version has a pubspec.yaml file at the root of the archive. - - See https://dart.dev/tools/pub/pubspec for pubspec specifications. - - Args: - dir_path: A directory on disk where a pubspec.yaml must be present - - Returns: - A dict mapping from yaml parser - """ - pubspec_path = dir_path / "pubspec.yaml" - return yaml.safe_load(pubspec_path.read_text()) - class PubDevLoader(PackageLoader[PubDevPackageInfo]): visit_type = "pubdev" PUBDEV_BASE_URL = "https://pub.dev/" def __init__( self, storage: StorageInterface, url: str, **kwargs, ): super().__init__(storage=storage, url=url, **kwargs) self.url = url assert url.startswith(self.PUBDEV_BASE_URL) self.package_info_url = url.replace( self.PUBDEV_BASE_URL, f"{self.PUBDEV_BASE_URL}api/" ) - def _raw_info(self) -> bytes: - return get_url_body(self.package_info_url) - @cached_method def info(self) -> Dict: """Return the project metadata information (fetched from pub.dev registry)""" # Use strict=False in order to correctly manage case where \n is present in a string - info = json.loads(self._raw_info(), strict=False) + info = json.loads(get_url_body(self.package_info_url), strict=False) # Arrange versions list as a new dict with `version` as key versions = {v["version"]: v for v in info["versions"]} info["versions"] = versions return info def get_versions(self) -> Sequence[str]: """Get all released versions of a PubDev package Returns: A sequence of versions Example:: ["0.1.1", "0.10.2"] """ versions = list(self.info()["versions"].keys()) versions.sort(key=parse_version) return versions def get_default_version(self) -> str: """Get the newest release version of a PubDev package Returns: A string representing a version Example:: "0.1.2" """ latest = self.info()["latest"] return latest["version"] def get_package_info(self, version: str) -> Iterator[Tuple[str, PubDevPackageInfo]]: """Get release name and package information from version Package info comes from extrinsic metadata (from self.info()) Args: version: Package version (e.g: "0.1.0") Returns: Iterator of tuple (release_name, p_info) """ v = self.info()["versions"][version] assert v["version"] == version url = v["archive_url"] name = v["pubspec"]["name"] filename = f"{name}-{version}.tar.gz" last_modified = v["published"] + checksums = {"sha256": v["archive_sha256"]} if v.get("archive_sha256") else {} - if "authors" in v["pubspec"]: + authors = v.get("pubspec", {}).get("authors") + if authors and isinstance(authors, list): # TODO: here we have a list of author, see T3887 - author = Person.from_fullname(v["pubspec"]["authors"][0].encode()) - elif "author" in v["pubspec"] and v["pubspec"]["author"] is not None: + author = Person.from_fullname(authors[0].encode()) + elif v.get("pubspec", {}).get("author"): author = Person.from_fullname(v["pubspec"]["author"].encode()) else: author = EMPTY_AUTHOR - description = v["pubspec"]["description"] - p_info = PubDevPackageInfo( name=name, filename=filename, url=url, version=version, last_modified=last_modified, author=author, - description=description, - checksums={"sha256": v["archive_sha256"]}, + checksums=checksums, ) yield release_name(version), p_info def build_release( self, p_info: PubDevPackageInfo, uncompressed_path: str, directory: Sha1Git ) -> Optional[Release]: - # Extract intrinsic metadata from uncompressed_path/pubspec.yaml - intrinsic_metadata = extract_intrinsic_metadata(Path(uncompressed_path)) - - name: str = intrinsic_metadata["name"] - version: str = intrinsic_metadata["version"] - assert version == p_info.version - - # author from intrinsic_metadata should not take precedence over the one - # returned by the api, see https://dart.dev/tools/pub/pubspec#authorauthors - author: Person = p_info.author - - if "description" in intrinsic_metadata and intrinsic_metadata["description"]: - description = intrinsic_metadata["description"] - else: - description = p_info.description - message = ( - f"Synthetic release for pub.dev source package {name} " - f"version {version}\n\n" - f"{description}\n" + f"Synthetic release for pub.dev source package {p_info.name} " + f"version {p_info.version}\n" ) return Release( - name=version.encode(), - author=author, + name=p_info.version.encode(), + author=p_info.author, date=TimestampWithTimezone.from_iso8601(p_info.last_modified), message=message.encode(), target_type=ObjectType.DIRECTORY, target=directory, synthetic=True, ) diff --git a/swh/loader/package/pubdev/tests/test_pubdev.py b/swh/loader/package/pubdev/tests/test_pubdev.py index 9267c24..757b143 100644 --- a/swh/loader/package/pubdev/tests/test_pubdev.py +++ b/swh/loader/package/pubdev/tests/test_pubdev.py @@ -1,326 +1,325 @@ # Copyright (C) 2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import pytest from swh.loader.package.pubdev.loader import PubDevLoader from swh.loader.package.utils import EMPTY_AUTHOR from swh.loader.tests import assert_last_visit_matches, check_snapshot, get_stats from swh.model.hashutil import hash_to_bytes from swh.model.model import ( ObjectType, Person, Release, Snapshot, SnapshotBranch, TargetType, TimestampWithTimezone, ) EXPECTED_PACKAGES = [ { "url": "https://pub.dev/packages/Autolinker", # one version }, { "url": "https://pub.dev/packages/pdf", # multiple versions }, { "url": "https://pub.dev/packages/bezier", # multiple authors }, { "url": "https://pub.dev/packages/authentication", # empty author }, { "url": "https://pub.dev/packages/abstract_io", # loose versions names }, { "url": "https://pub.dev/packages/audio_manager", # loose ++ versions names }, ] def test_get_versions(requests_mock_datadir, swh_storage): loader = PubDevLoader( swh_storage, url=EXPECTED_PACKAGES[1]["url"], ) assert loader.get_versions() == [ "1.0.0", "3.8.2", ] def test_sort_loose_versions(requests_mock_datadir, swh_storage): """Sometimes version name does not follow semver""" loader = PubDevLoader( swh_storage, url=EXPECTED_PACKAGES[4]["url"], ) assert loader.get_versions() == ["0.1.2+4", "0.1.2+5", "0.1.2+6"] def test_sort_loose_versions_1(requests_mock_datadir, swh_storage): """Sometimes version name does not follow semver and mix patterns""" loader = PubDevLoader( swh_storage, url=EXPECTED_PACKAGES[5]["url"], ) assert loader.get_versions() == [ "0.0.1", "0.0.2", "0.1.1", "0.1.2", "0.1.3", "0.1.4", "0.1.5", "0.2.1", "0.2.1+hotfix.1", "0.2.1+hotfix.2", "0.2.1+3", "0.3.1", "0.3.1+1", "0.5.1", "0.5.1+1", "0.5.1+2", "0.5.1+3", "0.5.1+4", "0.5.1+5", "0.5.2", "0.5.2+1", "0.5.3", "0.5.3+1", "0.5.3+2", "0.5.3+3", "0.5.4", "0.5.4+1", "0.5.5", "0.5.5+1", "0.5.5+2", "0.5.5+3", "0.5.6", "0.5.7", "0.5.7+1", "0.6.1", "0.6.2", "0.7.1", "0.7.2", "0.7.3", "0.8.1", "0.8.2", ] def test_get_default_version(requests_mock_datadir, swh_storage): loader = PubDevLoader( swh_storage, url=EXPECTED_PACKAGES[1]["url"], ) assert loader.get_default_version() == "3.8.2" def test_pubdev_loader_load_one_version(datadir, requests_mock_datadir, swh_storage): loader = PubDevLoader( swh_storage, url=EXPECTED_PACKAGES[0]["url"], ) load_status = loader.load() assert load_status["status"] == "eventful" assert load_status["snapshot_id"] is not None - expected_snapshot_id = "245092931ba809e6c54ebda8f865fb5a969a4134" - expected_release_id = "919f267ea050539606344d49d14bf594c4386e5a" + expected_snapshot_id = "dffca49aec93fcf1fa63fa25bf9a04c833a30d73" + expected_release_id = "1e2e7226ac9136f2eb7ce28f32ca08fff28590b1" assert expected_snapshot_id == load_status["snapshot_id"] expected_snapshot = Snapshot( id=hash_to_bytes(load_status["snapshot_id"]), branches={ b"releases/0.1.1": SnapshotBranch( target=hash_to_bytes(expected_release_id), target_type=TargetType.RELEASE, ), b"HEAD": SnapshotBranch( target=b"releases/0.1.1", target_type=TargetType.ALIAS, ), }, ) check_snapshot(expected_snapshot, swh_storage) stats = get_stats(swh_storage) assert { "content": 1, "directory": 1, "origin": 1, "origin_visit": 1, "release": 1, "revision": 0, "skipped_content": 0, "snapshot": 1, } == stats assert swh_storage.release_get([hash_to_bytes(expected_release_id)])[0] == Release( name=b"0.1.1", - message=b"Synthetic release for pub.dev source package Autolinker version" - b" 0.1.1\n\nPort of Autolinker.js to dart\n", + message=b"Synthetic release for pub.dev source package Autolinker version 0.1.1\n", target=hash_to_bytes("3fb6d4f2c0334d1604357ae92b2dd38a55a78194"), target_type=ObjectType.DIRECTORY, synthetic=True, author=Person( fullname=b"hackcave ", name=b"hackcave", email=b"hackers@hackcave.org", ), date=TimestampWithTimezone.from_iso8601("2014-12-24T22:34:02.534090+00:00"), id=hash_to_bytes(expected_release_id), ) assert_last_visit_matches( swh_storage, url=EXPECTED_PACKAGES[0]["url"], status="full", type="pubdev", snapshot=expected_snapshot.id, ) def test_pubdev_loader_load_multiple_versions( datadir, requests_mock_datadir, swh_storage ): loader = PubDevLoader( swh_storage, url=EXPECTED_PACKAGES[1]["url"], ) load_status = loader.load() assert load_status["status"] == "eventful" assert load_status["snapshot_id"] is not None - expected_snapshot_id = "43d5b68a9fa973aa95e56916aaef70841ccbc2a0" + expected_snapshot_id = "b03a4ef56b1a3bd4812f8e37f439c261cf4fd2c7" assert expected_snapshot_id == load_status["snapshot_id"] expected_snapshot = Snapshot( id=hash_to_bytes(load_status["snapshot_id"]), branches={ b"releases/1.0.0": SnapshotBranch( - target=hash_to_bytes("fbf8e40af675096681954553d737861e10b57216"), + target=hash_to_bytes("6f6eecd1ced321778d6a4bc60af4fb0e93178307"), target_type=TargetType.RELEASE, ), b"releases/3.8.2": SnapshotBranch( - target=hash_to_bytes("627a5d586e3fb4e7319b17f1aee268fe2fb8e01c"), + target=hash_to_bytes("012bac381e2b9cda7de2da0391bc2969bf80ff97"), target_type=TargetType.RELEASE, ), b"HEAD": SnapshotBranch( target=b"releases/3.8.2", target_type=TargetType.ALIAS, ), }, ) check_snapshot(expected_snapshot, swh_storage) stats = get_stats(swh_storage) assert { "content": 1 + 1, "directory": 1 + 1, "origin": 1, "origin_visit": 1, "release": 1 + 1, "revision": 0, "skipped_content": 0, "snapshot": 1, } == stats assert_last_visit_matches( swh_storage, url=EXPECTED_PACKAGES[1]["url"], status="full", type="pubdev", snapshot=expected_snapshot.id, ) def test_pubdev_loader_multiple_authors(datadir, requests_mock_datadir, swh_storage): loader = PubDevLoader( swh_storage, url=EXPECTED_PACKAGES[2]["url"], ) load_status = loader.load() assert load_status["status"] == "eventful" assert load_status["snapshot_id"] is not None - expected_snapshot_id = "4fa9f19d1d6ccc70921c8c50b278f510db63aa36" - expected_release_id = "538c98fd69a42d8d0561a7ca95b354de2143a3ab" + expected_snapshot_id = "2af571a302514bf17807dc114fff15501f8c1387" + expected_release_id = "87331a7804673cb00a339b504d2345769b7ae34a" assert expected_snapshot_id == load_status["snapshot_id"] expected_snapshot = Snapshot( id=hash_to_bytes(load_status["snapshot_id"]), branches={ b"releases/1.1.5": SnapshotBranch( target=hash_to_bytes(expected_release_id), target_type=TargetType.RELEASE, ), b"HEAD": SnapshotBranch( target=b"releases/1.1.5", target_type=TargetType.ALIAS, ), }, ) check_snapshot(expected_snapshot, swh_storage) release = swh_storage.release_get([hash_to_bytes(expected_release_id)])[0] assert release.author == Person( fullname=b"Aaron Barrett ", name=b"Aaron Barrett", email=b"aaron@aaronbarrett.com", ) def test_pubdev_loader_empty_author(datadir, requests_mock_datadir, swh_storage): loader = PubDevLoader( swh_storage, url=EXPECTED_PACKAGES[3]["url"], ) load_status = loader.load() assert load_status["status"] == "eventful" assert load_status["snapshot_id"] is not None - expected_snapshot_id = "0c7fa6b9fced23c648d2093ad5597622683f8aed" - expected_release_id = "7d8c05181069aa1049a3f0bc1d13bedc34625d47" + expected_snapshot_id = "8b86c9fb49bbf3e2b4513dc35a2838c67e8895bc" + expected_release_id = "d6ba845e28fba2a51e2ed358664cad645a2591ca" assert expected_snapshot_id == load_status["snapshot_id"] expected_snapshot = Snapshot( id=hash_to_bytes(load_status["snapshot_id"]), branches={ b"releases/0.0.1": SnapshotBranch( target=hash_to_bytes(expected_release_id), target_type=TargetType.RELEASE, ), b"HEAD": SnapshotBranch( target=b"releases/0.0.1", target_type=TargetType.ALIAS, ), }, ) check_snapshot(expected_snapshot, swh_storage) release = swh_storage.release_get([hash_to_bytes(expected_release_id)])[0] assert release.author == EMPTY_AUTHOR def test_pubdev_invalid_origin(swh_storage): with pytest.raises(AssertionError): PubDevLoader( swh_storage, "http://nowhere/api/packages/42", )