Introduce RPM loader for ingesting .rpm files from Fedora archives
Related to T4648
Differential D8753
feat: Introduce RPM loader KShivendu on Oct 23 2022, 8:23 AM. Authored by
Details
Introduce RPM loader for ingesting .rpm files from Fedora archives Related to T4648
Diff Detail
Event TimelineThere are a very large number of changes, so older changes are hidden. Show Older Changes Comment Actions Build has FAILED Patch application report for D8753 (id=31662)Rebasing onto e6847f3616... First, rewinding head to replay your work on top of it... Applying: feat: Bare minimum implementation of RPM loader Applying: fix expected format of extra loader args packages Changes applied before testcommit 6395a16c63ff664859937d1b36c42cc550b7e93f Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Thu Oct 27 15:49:23 2022 +0530 fix expected format of extra loader args packages commit 7faa3fa4b076c8418263595de38be99e50851477 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Sun Oct 23 11:49:47 2022 +0530 feat: Bare minimum implementation of RPM loader Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1007/ Comment Actions Build has FAILED Patch application report for D8753 (id=31662)Rebasing onto e6847f3616... First, rewinding head to replay your work on top of it... Applying: feat: Bare minimum implementation of RPM loader Applying: fix expected format of extra loader args packages Changes applied before testcommit 4b4a9edcbb4a6ceae1f77ea5284eecad8eea9aa4 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Thu Oct 27 15:49:23 2022 +0530 fix expected format of extra loader args packages commit 0cacbf42c96ebc6216785219793a2becea15c6e3 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Sun Oct 23 11:49:47 2022 +0530 feat: Bare minimum implementation of RPM loader Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1008/ Comment Actions Build is green Patch application report for D8753 (id=31669)Rebasing onto e6847f3616... First, rewinding head to replay your work on top of it... Applying: feat: Bare minimum implementation of RPM loader Applying: fix expected format of extra loader args packages Changes applied before testcommit c0e079ebf7daad9eb55c52b2867915ec3857f534 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Thu Oct 27 15:49:23 2022 +0530 fix expected format of extra loader args packages commit 7ac2627d3bdd9d5906a33a63e389a0d51ed5433d Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Sun Oct 23 11:49:47 2022 +0530 feat: Bare minimum implementation of RPM loader See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1009/ for more details. Comment Actions Build is green Patch application report for D8753 (id=31722)Rebasing onto e6847f3616... First, rewinding head to replay your work on top of it... Applying: feat: Bare minimum implementation of RPM loader Applying: fix expected format of extra loader args packages Applying: feat: Use subprocess.check_call and extract .tar obtained from .rpm Changes applied before testcommit 5bacc3fe4e69a0268d8de1a965847a4431472375 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Thu Nov 3 10:57:15 2022 +0530 feat: Use subprocess.check_call and extract .tar obtained from .rpm commit 5c745970be8da82e005c20ba841e70fdfc472e60 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Thu Oct 27 15:49:23 2022 +0530 fix expected format of extra loader args packages commit 5f2ebc04ebabd0fce1541e08168e2f8bdbbe5c48 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Sun Oct 23 11:49:47 2022 +0530 feat: Bare minimum implementation of RPM loader See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1014/ for more details. Comment Actions
Comment Actions Build is green Patch application report for D8753 (id=31723)Rebasing onto e6847f3616... First, rewinding head to replay your work on top of it... Applying: feat: Bare minimum implementation of RPM loader Applying: fix expected format of extra loader args packages Applying: feat: Use subprocess.check_call and extract .tar obtained from .rpm Changes applied before testcommit 29efc4de50ef4f7a78931ab6cd7e69f6f0f5a4ff Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Thu Nov 3 10:57:15 2022 +0530 feat: Use subprocess.check_call and extract .tar obtained from .rpm commit 2990666402966c7dc5a8ef5d9099cc226f094b43 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Thu Oct 27 15:49:23 2022 +0530 fix expected format of extra loader args packages commit cefb9e46d99c60ad5c4a2e9740c6adcb734dd92a Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Sun Oct 23 11:49:47 2022 +0530 feat: Bare minimum implementation of RPM loader See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1015/ for more details.
Comment Actions
Comment Actions Build is green Patch application report for D8753 (id=31774)Rebasing onto bf2cb039d5... First, rewinding head to replay your work on top of it... Applying: feat: Bare minimum implementation of RPM loader Using index info to reconstruct a base tree... M setup.py Falling back to patching base and 3-way merge... Auto-merging setup.py CONFLICT (content): Merge conflict in setup.py Patch failed at 0001 feat: Bare minimum implementation of RPM loader Resolve all conflicts manually, mark them as resolved with "git add/rm <conflicted_files>", then run "git rebase --continue". You can instead skip this commit: run "git rebase --skip". To abort and get back to the state before "git rebase", run "git rebase --abort". Rebase failed (ret=1)! Could not rebase; Attempt merge onto bf2cb039d5... Already up to date. Changes applied before testcommit ca5e6d2094d167faacc5b32a85d38c8b5ce7ec5d Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Sun Nov 6 00:34:34 2022 +0530 feat: Make the lister incremental and use build_time as release date commit 0bd2aabe0aa8723a29845d0fc2e82c81192dcbe0 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Thu Nov 3 10:57:15 2022 +0530 feat: Use subprocess.check_call and extract .tar obtained from .rpm commit 779ba82f3ad56e25ec824fcf2212551674b8d57e Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Thu Oct 27 15:49:23 2022 +0530 fix expected format of extra loader args packages commit b26268d12c6820051f0507c035e29db6dac824b4 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Sun Oct 23 11:49:47 2022 +0530 feat: Bare minimum implementation of RPM loader See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1017/ for more details. Comment Actions Build is green Patch application report for D8753 (id=31775)Rebasing onto bf2cb039d5... First, rewinding head to replay your work on top of it... Applying: feat: Bare minimum implementation of RPM loader Using index info to reconstruct a base tree... M setup.py Falling back to patching base and 3-way merge... Auto-merging setup.py CONFLICT (content): Merge conflict in setup.py Patch failed at 0001 feat: Bare minimum implementation of RPM loader Resolve all conflicts manually, mark them as resolved with "git add/rm <conflicted_files>", then run "git rebase --continue". You can instead skip this commit: run "git rebase --skip". To abort and get back to the state before "git rebase", run "git rebase --abort". Rebase failed (ret=1)! Could not rebase; Attempt merge onto bf2cb039d5... Already up to date. Changes applied before testcommit eedd1308cb2c0797e184ef97e24f783a5e43ff5f Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Sun Oct 23 11:49:47 2022 +0530 feat: Bare minimum implementation of RPM loader feat: Make the lister incremental and use build_time as release date See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1018/ for more details. Comment Actions feat: Remove microseconds from buildTime metadata to match real values from fedora lister Comment Actions Build is green Patch application report for D8753 (id=31802)Rebasing onto bf2cb039d5... First, rewinding head to replay your work on top of it... Applying: feat: Incremental RPM loader implementation Using index info to reconstruct a base tree... M setup.py Falling back to patching base and 3-way merge... Auto-merging setup.py CONFLICT (content): Merge conflict in setup.py Patch failed at 0001 feat: Incremental RPM loader implementation Resolve all conflicts manually, mark them as resolved with "git add/rm <conflicted_files>", then run "git rebase --continue". You can instead skip this commit: run "git rebase --skip". To abort and get back to the state before "git rebase", run "git rebase --abort". Rebase failed (ret=1)! Could not rebase; Attempt merge onto bf2cb039d5... Already up to date. Changes applied before testcommit 6562b90c15ca7a4706370db14bf7af24fe592804 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Sun Oct 23 11:49:47 2022 +0530 feat: Incremental RPM loader implementation See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1019/ for more details. Comment Actions @KShivendu , I added some inline comments to improve the loader output. I also noticed that many rpm archives contain a spec file and some source tarballs in their content. diff --git a/swh/loader/package/rpm/loader.py b/swh/loader/package/rpm/loader.py index 2b93dc2..179d5aa 100644 --- a/swh/loader/package/rpm/loader.py +++ b/swh/loader/package/rpm/loader.py @@ -1,19 +1,24 @@ -# Copyright (C) 2019-2021 The Software Heritage developers +# Copyright (C) 2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information +from __future__ import annotations + import logging -from os import path, remove +from os import path, remove, walk import string import subprocess +import tempfile from typing import Any, Dict, Iterator, List, Mapping, Optional, Sequence, Tuple import attr +from packaging.version import parse as parse_version from swh.core.tarball import uncompress from swh.loader.package.loader import BasePackageInfo, PackageLoader from swh.loader.package.utils import EMPTY_AUTHOR +from swh.model import from_disk from swh.model.model import ObjectType, Release, Sha1Git, TimestampWithTimezone from swh.storage.interface import StorageInterface @@ -22,28 +27,25 @@ logger = logging.getLogger(__name__) @attr.s class RpmPackageInfo(BasePackageInfo): - raw_info = attr.ib(type=Dict[str, Any]) name = attr.ib(type=str) build_time = attr.ib(type=str, default=None) """Build time of the package in iso format. (e.g. 2017-02-10T04:59:31+00:00)""" EXTID_TYPE = "rpm-sha256" - MANIFEST_FORMAT = string.Template("$version $url") + MANIFEST_FORMAT = string.Template("$name $version $build_time") @classmethod - def from_metadata( - cls, a_metadata: Dict[str, Any], origin: str, version: str - ) -> "RpmPackageInfo": + def from_metadata(cls, a_metadata: Dict[str, Any], version: str) -> RpmPackageInfo: filename = a_metadata["url"].split("/")[-1] assert filename.endswith(".rpm") return cls( + name=a_metadata["name"], # nginx url=a_metadata["url"], # url of the .rpm file filename=filename, # nginx-1.18.0-5.fc34.src.rpm - version=version, # 34/Everything/1.18.0 + version=version, # 1.18.0-5.fc34 build_time=a_metadata["buildTime"], - raw_info=a_metadata, - name=a_metadata["name"], + checksums=a_metadata["checksums"], ) @@ -54,7 +56,7 @@ class RpmLoader(PackageLoader[RpmPackageInfo]): self, storage: StorageInterface, url: str, - packages: Mapping[str, Any], + packages: Dict[str, Dict[str, Any]], **kwargs: Any, ): """RPM Loader implementation. @@ -64,7 +66,7 @@ class RpmLoader(PackageLoader[RpmPackageInfo]): packages: versioned packages and associated artifacts, example:: { - '34/Everything/1.18.0': { + '1.18.0-5.fc34': { 'name': 'nginx', 'version': '1.18.0' 'release': 34, @@ -78,17 +80,20 @@ class RpmLoader(PackageLoader[RpmPackageInfo]): super().__init__(storage=storage, url=url, **kwargs) self.url = url self.packages = packages + self.tarball_branches: Dict[bytes, Dict[str, Any]] = {} def get_versions(self) -> Sequence[str]: - """Returns the keys of the packages input (e.g. 34/Everything/1.18.0, etc...)""" - return list(self.packages) + """Returns the keys of the packages input (e.g. 1.18.0-5.fc34, etc...)""" + return list(sorted(self.packages, key=parse_version)) + + def get_default_version(self) -> str: + """Get the newest release version of a rpm package""" + return self.get_versions()[-1] def get_package_info(self, version: str) -> Iterator[Tuple[str, RpmPackageInfo]]: yield ( version, - RpmPackageInfo.from_metadata( - self.packages[version], self.origin.url, version - ), + RpmPackageInfo.from_metadata(self.packages[version], version), ) def uncompress( @@ -100,13 +105,43 @@ class RpmLoader(PackageLoader[RpmPackageInfo]): def build_release( self, p_info: RpmPackageInfo, uncompressed_path: str, directory: Sha1Git ) -> Optional[Release]: + + # extract tarballs that might be located in the root directory of the rpm + # package and adds a dedicated branch for it in the snapshot + root, _, files = next(walk(uncompressed_path)) + for file in files: + file_path = path.join(root, file) + with tempfile.TemporaryDirectory() as tmpdir: + try: + uncompress(file_path, tmpdir) + except Exception: + # not a tarball + continue + + tarball_dir = from_disk.Directory.from_disk( + path=tmpdir.encode("utf-8"), + max_content_length=self.max_content_size, + ) + + contents, skipped_contents, directories = from_disk.iter_directory( + tarball_dir + ) + self.storage.skipped_content_add(skipped_contents) + self.storage.content_add(contents) + self.storage.directory_add(directories) + + self.tarball_branches[file.encode()] = { + "target_type": "directory", + "target": tarball_dir.hash, + } + msg = ( f"Synthetic release for Rpm source package {p_info.name} " f"version {p_info.version}\n" ) return Release( - name=p_info.name.encode(), + name=p_info.version.encode(), message=msg.encode(), author=EMPTY_AUTHOR, date=TimestampWithTimezone.from_iso8601(p_info.build_time), @@ -115,6 +150,9 @@ class RpmLoader(PackageLoader[RpmPackageInfo]): synthetic=True, ) + def extra_branches(self) -> Dict[bytes, Mapping[str, Any]]: + return self.tarball_branches + def extract_rpm_package(rpm_path: str, dest: str) -> str: """Extracts an RPM package."""
Comment Actions
Comment Actions Build is green Patch application report for D8753 (id=31860)Rebasing onto bf2cb039d5... First, rewinding head to replay your work on top of it... Applying: feat: Incremental RPM loader implementation Using index info to reconstruct a base tree... M setup.py Falling back to patching base and 3-way merge... Auto-merging setup.py CONFLICT (content): Merge conflict in setup.py Patch failed at 0001 feat: Incremental RPM loader implementation Resolve all conflicts manually, mark them as resolved with "git add/rm <conflicted_files>", then run "git rebase --continue". You can instead skip this commit: run "git rebase --skip". To abort and get back to the state before "git rebase", run "git rebase --abort". Rebase failed (ret=1)! Could not rebase; Attempt merge onto bf2cb039d5... Already up to date. Changes applied before testcommit 7ce76fe96751f35dd56fbb2aec9ce56bc7579d53 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Sun Oct 23 11:49:47 2022 +0530 feat: Incremental RPM loader implementation See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1025/ for more details. Comment Actions Build is green Patch application report for D8753 (id=31862)Rebasing onto bf2cb039d5... First, rewinding head to replay your work on top of it... Applying: feat: Incremental RPM loader implementation Using index info to reconstruct a base tree... M setup.py Falling back to patching base and 3-way merge... Auto-merging setup.py CONFLICT (content): Merge conflict in setup.py Patch failed at 0001 feat: Incremental RPM loader implementation Resolve all conflicts manually, mark them as resolved with "git add/rm <conflicted_files>", then run "git rebase --continue". You can instead skip this commit: run "git rebase --skip". To abort and get back to the state before "git rebase", run "git rebase --abort". Rebase failed (ret=1)! Could not rebase; Attempt merge onto bf2cb039d5... Already up to date. Changes applied before testcommit 70e2577b2b36d048738b6de47de73b808d042374 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Sun Oct 23 11:49:47 2022 +0530 feat: Incremental RPM loader implementation See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1026/ for more details. Comment Actions Build is green Patch application report for D8753 (id=31889)Rebasing onto 31ab1aa69e... Current branch diff-target is up to date. Changes applied before testcommit 44089c1677721ff720c1f43db696df8945df1140 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Sun Oct 23 11:49:47 2022 +0530 feat: Incremental RPM loader implementation See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1030/ for more details.
Comment Actions @KShivendu , we are almost in a landable state for the RPM loader. I gave a last round of tests to the fedora lister and RPM loader in docker and now that D8847 got landed for the webapp, I have a better understanding of what must be done to have results similar to the debian lister/loader. I pushed D8848 to slightly update the fedora lister regarding the package versions we should provide to the generic RPM loader. After these changes, I had to update the RPM loader the following way to have an output similar to the debian loader (diff generated from the current state of that Phab diff): diff --git a/swh/loader/package/rpm/loader.py b/swh/loader/package/rpm/loader.py index 8b22685..cf30028 100644 --- a/swh/loader/package/rpm/loader.py +++ b/swh/loader/package/rpm/loader.py @@ -17,7 +17,7 @@ from packaging.version import parse as parse_version from swh.core.tarball import uncompress from swh.loader.package.loader import BasePackageInfo, PackageLoader -from swh.loader.package.utils import EMPTY_AUTHOR +from swh.loader.package.utils import EMPTY_AUTHOR, release_name from swh.model import from_disk from swh.model.model import ObjectType, Release, Sha1Git, TimestampWithTimezone from swh.storage.interface import StorageInterface @@ -28,11 +28,13 @@ logger = logging.getLogger(__name__) @attr.s class RpmPackageInfo(BasePackageInfo): name = attr.ib(type=str) + intrinsic_version = attr.ib(type=str) + """Intrinsic version of the package, independent from the distribution it was found""" build_time = attr.ib(type=str, default=None) """Build time of the package in iso format. (e.g. 2017-02-10T04:59:31+00:00)""" EXTID_TYPE = "rpm-sha256" - MANIFEST_FORMAT = string.Template("$name $version $build_time") + MANIFEST_FORMAT = string.Template("$name $intrinsic_version $build_time") @classmethod def from_metadata(cls, a_metadata: Dict[str, Any], version: str) -> RpmPackageInfo: @@ -43,7 +45,8 @@ class RpmPackageInfo(BasePackageInfo): name=a_metadata["name"], # nginx url=a_metadata["url"], # url of the .rpm file filename=filename, # nginx-1.18.0-5.fc34.src.rpm - version=version, # 1.18.0-5.fc34 + version=version, # fedora34/everything/1.18.0-5 + intrinsic_version=a_metadata["version"], # 1.18.0-5 build_time=a_metadata["buildTime"], checksums=a_metadata["checksums"], ) @@ -66,9 +69,9 @@ class RpmLoader(PackageLoader[RpmPackageInfo]): packages: versioned packages and associated artifacts, example:: { - '1.18.0-5.fc34': { + 'fedora34/everything/1.18.0-5': { 'name': 'nginx', - 'version': '1.18.0' + 'version': '1.18.0-5' 'release': 34, 'edition': 'Everything', 'buildTime': '2022-11-01T12:00:55+00:00', @@ -87,7 +90,7 @@ class RpmLoader(PackageLoader[RpmPackageInfo]): self.tarball_branches: Dict[bytes, Mapping[str, Any]] = {} def get_versions(self) -> Sequence[str]: - """Returns the keys of the packages input (e.g. 1.18.0-5.fc34, etc...)""" + """Returns the keys of the packages input (e.g. fedora34/everything/1.18.0-5, etc...)""" return list(sorted(self.packages, key=parse_version)) def get_default_version(self) -> str: @@ -95,7 +98,10 @@ class RpmLoader(PackageLoader[RpmPackageInfo]): return self.get_versions()[-1] def get_package_info(self, version: str) -> Iterator[Tuple[str, RpmPackageInfo]]: - yield (version, RpmPackageInfo.from_metadata(self.packages[version], version)) + yield ( + release_name(version), + RpmPackageInfo.from_metadata(self.packages[version], version), + ) def uncompress( self, dl_artifacts: List[Tuple[str, Mapping[str, Any]]], dest: str @@ -141,7 +147,7 @@ class RpmLoader(PackageLoader[RpmPackageInfo]): ) return Release( - name=p_info.version.encode(), + name=p_info.intrinsic_version.encode(), message=msg.encode(), author=EMPTY_AUTHOR, date=TimestampWithTimezone.from_iso8601(p_info.build_time), Tests still need to be updated though.
Comment Actions
Comment Actions Build is green Patch application report for D8753 (id=31894)Rebasing onto 31ab1aa69e... Current branch diff-target is up to date. Changes applied before testcommit d071b137000a47dc53a057563362242c9bc77c39 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Sun Oct 23 11:49:47 2022 +0530 feat: Incremental RPM loader implementation See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1031/ for more details. Comment Actions @KShivendu , before landing this please fix the year in some license headers and update tests to match fedora lister output.
Comment Actions Build is green Patch application report for D8753 (id=31898)Rebasing onto 31ab1aa69e... Current branch diff-target is up to date. Changes applied before testcommit a196c85d5ab1ed3ce09db6800559641c309f51e9 Author: KShivendu <shivendu@iitbhilai.ac.in> Date: Sun Oct 23 11:49:47 2022 +0530 feat: Incremental RPM loader implementation See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1032/ for more details. |