Implement a base "package" loader for package managers
Closed, MigratedEdits Locked
Actions

Description

To extend the archive coverage, source packages provided by package managers (see extended list at https://libraries.io/) must be considered.

Currently, it exists a PyPI loader (T419) deployed in production and work related on the npm ingestion is in progress (T1378).

Nevertheless, ingesting source code from package managers is a process somehow similar for all of them, notably it includes:

the querying of an API (usually a RESTful one) to get relevant metadata about a package
the retrieval of the package source code (usually in tarball form) for ingestion into the archive

This calls for a common base implementation for loading content from package managers into the archive.

That task is here to track the progress on the subject.

Revisions and Commits

rDLDBASE Generic VCS/Package Loader
	Abandoned	D1694 swh.loader.package: Implement a method to prepare package visit

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T1418 Loaders
Migrated	gitlab-migration	T1425 refactor the loader stack for package managers
Migrated	gitlab-migration	T1389 Implement a base "package" loader for package managers
Migrated	gitlab-migration	T2021 Re-implement pypi loader with package loader
Migrated	gitlab-migration	T2022 Re-implement npm loader with base loader
Migrated	gitlab-migration	T2023 Re-implement gnu loader with package loader
Migrated	gitlab-migration	T2024 Re-implement deposit loader with package loader mechanism
Migrated	gitlab-migration	T2025 Re-implement debian loader with package loader mechanism
Migrated	gitlab-migration	T2017 package loader: Discuss revision metadata normalization

Event Timeline

anlambert triaged this task as Wishlist priority.Nov 27 2018, 12:23 PM

anlambert created this task.

vlorentz added a parent task: T1425: refactor the loader stack for package managers.Dec 18 2018, 4:57 PM

anlambert raised the priority of this task from Wishlist to Normal.Feb 5 2019, 2:31 PM

The table below summarizes how to list all packages and get their metadata from well-known package managers.

Package manager	Packages listing url	Package metadata url	Package source tarball url
packagist (PHP)	https://packagist.org/packages/list.json	`https://repo.packagist.org/p/[vendor]/[package].json`	available for each package version in the metadata
		example: https://repo.packagist.org/p/monolog/monolog.json
		`https://packagist.org/packages/[vendor]/[package].json`	available for each package version in the metadata
		example: https://packagist.org/packages/monolog/monolog.json
Go	https://go-search.org/api?action=packages	`https://go-search.org/api?action=package&id=[package]`	not available in metadata
		example: https://go-search.org/api?action=package&id=github.com%2fdaviddengcn%2fgcse	Go packages are hosted on GitHub, so no need to write a dedicated loader here
npm (Javascript)	https://replicate.npmjs.com/_all_docs?limit=100	`https://replicate.npmjs.com/[package]/`	available for each package version in the metadata
		example: https://replicate.npmjs.com/webpack/
Maven central (Java)	http://central.maven.org/maven2/	`http://central.maven.org/maven2/[group]/[package]/maven-metadata.xml`	`http://central.maven.org/maven2/[group]/[package]/[version]/[package]-[version]-sources.jar`
		example: http://central.maven.org/maven2/junit/junit/maven-metadata.xml	example: http://central.maven.org/maven2/junit/junit/4.9/junit-4.9-sources.jar
		`http://central.maven.org/maven2/[group]/[package]/[version]/[package]-[version]-sources.pom`
		example: http://central.maven.org/maven2/junit/junit/4.9/junit-4.9.pom
Rubygems	No public api endpoint available	`https://rubygems.org/api/v2/rubygems/[package]/versions/[version].json`	available in the medata
	`gem list -r --all` can be used instead	example: https://rubygems.org/api/v2/rubygems/coulda/versions/0.7.1.json
NuGet (.NET)	https://api.nuget.org/v3/catalog0/index.json	`https://api.nuget.org/v3/catalog0/data/[catalog-commit-date]/[package].[version].json`	`https://api.nuget.org/v3-flatcontainer/[package]/[version]/[package].[version].nupkg`
		example: https://api.nuget.org/v3/catalog0/data/2015.02.01.11.18.40/windowsazure.storage.1.0.0.json	example: https://api.nuget.org/v3-flatcontainer/windowsazure.storage/1.0.0/windowsazure.storage.1.0.0.nupkg
		`https://api.nuget.org/v3-flatcontainer/[package]/index.json`
		example: https://api.nuget.org/v3-flatcontainer/windowsazure.storage/index.json
		`https://api.nuget.org/v3-flatcontainer/[package]/[version]/[package].nuspec`
		example: https://api.nuget.org/v3-flatcontainer/windowsazure.storage/1.0.0/windowsazure.storage.nuspec
CRAN (R)	https://cran.r-project.org/web/packages/available_packages_by_name.html	`https://cran.r-project.org/web/packages/[package]/index.html`	tarball url from latest release available in metadata
			tarballs from previous releases available from `https://cran.r-project.org/src/contrib/Archive/[package]`

nahimilega added a subscriber: nahimilega.Mar 19 2019, 12:36 PM

We've discussed a plausible plan for a "base package manager loader" with @ardumont and, to some extent, @anlambert.

The conversation was triggered by @nahimilega asking which loader would be used to load GNU origins (in part in D1482), and how that should be implemented.

This is a request for feedback from other members of the team (specifically ping @douardda @zack on the design considerations).

Pattern for a package manager "base loader"

There's a common pattern for loaders of package managers origins that we should be able to merge. This pattern applies for loading *one* package from a given package manager, into a snapshot referencing all the available versions of the package.

0. List the package versions available, and the files that need to be fetched

There's two options here :

The loader lists the package versions at load time

When a simple API is provided, this is the cleanest option as it minimizes the risk of mismatch between the versions available at list time, and at load time. That's what we currently do in the PyPI and npm loaders.

In that case, the lister only generates recurrent tasks referencing the package name.

The lister provides the package versions as arguments to the loader task

That's what's currently implemented in the Debian loader, as the operation of "listing a Debian archive" gives you the full set of package metadata with all versions. This is also what will happen for the GNU lister: the metadata provided gives all version info at once in a large json file.

The lister will generate a one-shot task to load each package for the given repository, with the full information needed to do the data fetching.

In both cases, the "input data" will take the form of a mapping from version number or a snapshot branch name to a set of files to be downloaded, possibly with a default version for generating the snapshot HEAD.

[[ for each package version

1. Fetch the files for one package version

By default, this can be implemented as a simple HTTP request.
Loaders with more specific requirements can override this : the PyPI loader checks the integrity of the downloaded files; the Debian loader has to download and check several files for one package version.

2. Extract the downloaded files

By default, this would be a universal archive/tarball extraction.

Loaders for specific formats can override this method (for instance, the Debian loader uses dpkg-source -x).

3. Convert the extracted directory to a set of Software Heritage objects

Using swh.model.from_disk.

4. Extract the metadata from the unpacked directories

This would only be applicable for "smart" loaders like npm (parsing the package.json), PyPI (parsing the PKG-INFO file) or Debian (parsing debian/changelog and debian/control).

On "minimal-metadata" sources such as the GNU archive, the lister should provide the minimal set of metadata needed to populate the revision/release objects (authors, dates) as an argument to the task.

5. Generate the revision/release objects for the given version.

From the data generated at steps 3 and 4.

end for each ]]

6. Generate and load the snapshot

Using the revisions/releases collected at step 5., and the branch information from step 0., generate a snapshot and load it into the Software Heritage archive.

Implementation plan

This base loader should live as a new "pattern" in the swh-loader-core repository. We're not sure how we should implement the hook points for differing behavior by different loaders. I would use inheritance and method overriding but there might be a better way. I think we'll need to talk over some code (of the base class, and of at least one of the loaders) to be sure that we're going in the right direction.

Refactoring plan

The discussion with @ardumont led to the conclusion that this base class should be able to completely replace the swh-loader-tar and swh-loader-dir; With careful implementation of the lister (moving all the parsing complexity there), the so-called "GNU loader" could be implemented by calling directly into this base class as well, rather than overriding any of its methods.

The next obvious candidate for refactoring would be the deposit loader, as it's quite simple and it's the only user of the loader-dir/loader-tar scaffold.

Once these are settled, we can move the PyPI, npm and Debian loaders over to the new base implementation. We have the feeling that they already closely match this pattern and that porting the code over shouldn't be too hard (famous last words).

Thoughts?

ardumont mentioned this in D1482: GNU Lister.May 27 2019, 10:45 AM

olasd mentioned this in T1776: packagist (PHP) Lister.Jun 6 2019, 3:39 PM

nahimilega mentioned this in rDLS709ba8a6e55c: swh.lister.gnu: Add functionality to list all the tarballs for a package..Jun 11 2019, 12:07 PM

Thanks @olasd, @ardumont, and @anlambert for this, it's a great plan and I like it a lot !

Just a few comments on the sidelines:

The lister will generate a one-shot task to load each package for the given repository, with the full information needed to do the data fetching.

This seemed clear from a different part of the description, but fwiw here I'm assuming the plan is to only load the version of the packages not already known/ingested in the past.

We're not sure how we should implement the hook points for differing behavior by different loaders. I would use inheritance and method overriding but there might be a better way.

This feels like a good candidate for a template method + method overriding as you suggested. Everything that has chances of being the same for all loaders will be pre-implemented in a base class, that also calls pre/post hooks to main loading steps, so that derived packages that are fine with the default behavior but need to do additional stuff can do so without having to override. The more peculiar cases (like the dpkg-source cases you mentioned), will also (or only) do the overrides. (But of course ack on the fact this will be better discussed in actual early implementations/code reviews.)

Last comment, which is not a blocker but something that deserves highlighting: a risk of this approach is that of "non monotonic snapshots". With VCS loaders, snapshots tend to be monotonic, i.e., future snapshots have more tags than previous ones, because old releases tend to stay around. Here we will depend on the list of versions returned by the package manager API to make it so. This will probably be true in many cases (e.g., PyPI, NPM) but it's already false in others (like Debian, I think, where it's not easy to list all versions of packages that have been moved from the main archive to archive.debian.org; from what we've been told by Cran, that might be the case for R too). So our snapshots for (some) package managers will probably be "sliding windows" on the most recent N versions of a given package. Not a big deal, it's just the way it is, but we should keep this in mind, and maybe document it in the description of each loader.

In T1389#33215, @zack wrote:

Thanks @olasd, @ardumont, and @anlambert for this, it's a great plan and I like it a lot !

Just a few comments on the sidelines:

The lister will generate a one-shot task to load each package for the given repository, with the full information needed to do the data fetching.

This seemed clear from a different part of the description, but fwiw here I'm assuming the plan is to only load the version of the packages not already known/ingested in the past.

If we're able to determine that beforehand (e.g. by using the checksum of the original artifact provided by the API), then yes.

I have the feeling most package managers don't provide that information and we'll need some other way to avoid re-processing identical packages. Maybe seeing if they support If-Modified-Since or somesuch.

Last comment, which is not a blocker but something that deserves highlighting: a risk of this approach is that of "non monotonic snapshots". With VCS loaders, snapshots tend to be monotonic, i.e., future snapshots have more tags than previous ones, because old releases tend to stay around. Here we will depend on the list of versions returned by the package manager API to make it so. This will probably be true in many cases (e.g., PyPI, NPM) but it's already false in others (like Debian, I think, where it's not easy to list all versions of packages that have been moved from the main archive to archive.debian.org; from what we've been told by Cran, that might be the case for R too). So our snapshots for (some) package managers will probably be "sliding windows" on the most recent N versions of a given package. Not a big deal, it's just the way it is, but we should keep this in mind, and maybe document it in the description of each loader.

Debian origins are indeed inherently non-monotonic, as only the versions which are currently reachable in any currently live suite are shown.

I think it might make sense to provide, at some point, "cumulative snapshots" presenting all versions we've ever archived for such origins, but that's more of a UI/UX issue than it is an issue loaders themselves should solve.

ardumont mentioned this in D1612: swh.lister.gnu: Change origin type to tar.Jun 20 2019, 9:58 AM

Extending over the plan by @olasd. Here are some of my thoughts on the implementation of base loader.

The format in which the lister returns the data is different for every lister. Some directly return the URL needed to visit to get the metadata and package source code URL(like pypi), whereas some return the link of the package source code(like gnu).

Convert information passed by the lister into a standard format.

It is always easy to generalise when the data is in a known format.
It should be the first step, should be called the task of the loader.
This step includes making requests to the metadata URL(if any) to get the package source code URL, and then converting the data obtained into a specific standard format that would ease in generalisation for further process.
We could convert into some standard format like (gnu lister already returns the data in something like this format)

Name - <name of the package>
tarballs:[{
    url:
    sha:
    time last modified:
    description:
    type:
    author:
    ...
}, ... ]

There will be some fields that are necessary to be present like <name of package>, <url>. Moreover, others are conditional depending on if the information is provided.

Define the method to generate artifact to check for identical packages.

Current candidates for avoiding re-processing identical package

Using some metadata provided like SHA hash(example PyPI) or the time last modified(example gnu).

There needs to be a field in the data provided which can be used to avoiding re-processing identical package(like SHA or time last modified)
This field is specific to the package manager, hence needed to be known beforehand to make further execution identical. We can keep a variable for this purpose.
Like for pypi

self.generate_artifact_field = 'sha'

If-Modified-Since

For defining the candidate to avoid reprocessing, we could have two different classes, one which uses If-Modified-Since and other which uses a field in the data provided.
(analogy) Like in lister we have simple lister and indexing lister. Different class to serve a different purpose.

Once the information is converted into the standard format, and the method of artifact generation is known, we could generalise further steps. Now the work of base loader starts.

The work that would be done by the base loader after receiving the information in a standard format

Generating artifact
Getting old artifacts and comparing,
Downloading and extracting
Making snapshots and storing them
cleanup and flush

Methods to generalize these process can be devised, hence eliminating the need to write code for these steps in every loader separately.
However, they can be overridden if any specific need by the package manager.

(stating the obvious) We should try to make methods for every small task like generate_artifact, compare, download so that if needed, anyone can be easily overridden without disturbing the whole stack. (like overriding the extraction process for Debian loader)

Some more hook point we need to provide in between this process as some package manager may need them

Filter the downloaded file.

Remove the unnecessary files and binaries that are downloaded but are not to be archived.

Then gather more metadata from the downloaded files and append it in the standard format that is generated.

Some packages like pypi have a file for metadata which could give us useful information(like author name etc.) hence this data is needed to be appended to the standard format.

Refracting

I suppose, for some tasks, we could take inspiration from current loaders. Like for downloading the remote tarballs, we could take the help of the code already present in tarball loader. But it would be best if this the base loaders are build from scratch rather than refracting any existing code.

PS: For listers like gnu lister, this approach could implement their loader in as few as 10-20 lines of code

nahimilega added a revision: D1694: swh.loader.package: Implement a method to prepare package visit.Jul 7 2019, 10:10 PM

In T1389#27957, @anlambert wrote:

Package manager Packages listing url Package metadata url Package source tarball url

Rubygems No public api endpoint available https://rubygems.org/api/v2/rubygems/[package]/versions/[version].json available in the medata

We can use the search API:

http 'https://rubygems.org/api/v1/search.yaml?query=*&page=1'

and increase the page number until the result is empty.

@douardda We have a separate task(T1777) for rubygem lister. I guess we can add a separate column in the table by @anlambert showing status of the lister implementation and the tasks related to it.

ardumont mentioned this in D2025: [wip] swh.lister.functionalPackages: add lister getting sources from a JSON file.Sep 23 2019, 10:50 AM

ardumont mentioned this in D2028: [wip] package-loader: Implement a common package loader mechanism.Sep 24 2019, 10:08 AM

ardumont mentioned this in T1788: indexer-license: Investigate timeouts.Sep 27 2019, 9:53 AM

ardumont mentioned this in T2017: package loader: Discuss revision metadata normalization.Sep 27 2019, 12:18 PM

ardumont created subtask T2023: Re-implement gnu loader with package loader.Oct 1 2019, 1:14 PM

ardumont renamed this task from Implement a base loader for package managers to Implement a base "package" loader for package managers.Oct 1 2019, 1:19 PM

ardumont changed the task status from Open to Work in Progress.

ardumont changed the status of subtask T2022: Re-implement npm loader with base loader from Open to Work in Progress.

ardumont changed the status of subtask T2021: Re-implement pypi loader with package loader from Open to Work in Progress.

Current work is in the swh-loader-repository within the package-loader branch.

ardumont added a project: Origin-GNU.Oct 1 2019, 6:40 PM

ardumont added a subtask: T2017: package loader: Discuss revision metadata normalization.Oct 1 2019, 6:46 PM

ardumont closed subtask T2017: package loader: Discuss revision metadata normalization as Resolved.

ardumont changed the status of subtask T2024: Re-implement deposit loader with package loader mechanism from Open to Work in Progress.Oct 3 2019, 6:03 PM

ardumont added a project: Origin-CRAN.Oct 5 2019, 2:29 PM

ardumont added a project: Origin-Debian.

ardumont mentioned this in rDSTOc83f1f9d0a7c: swh.storage.filter: Add filtering storage implementation.Oct 8 2019, 4:42 PM

ardumont mentioned this in rDSTO03d5a2cf7ca8: swh.storage.buffer: Add buffering proxy storage implementation.

ardumont changed the status of subtask T2025: Re-implement debian loader with package loader mechanism from Open to Work in Progress.Oct 13 2019, 6:37 PM

ardumont mentioned this in D2145: package.tar: Add a generic archive loader implementation (merge with gnu's).Oct 15 2019, 6:36 PM

ardumont mentioned this in T1875: staging infra: Reproduce existing production setup in a compact way.Oct 22 2019, 9:10 AM

ardumont mentioned this in rDSTOCc83f1f9d0a7c: swh.storage.filter: Add filtering storage implementation.Oct 30 2019, 5:22 PM

ardumont mentioned this in rDSTOC03d5a2cf7ca8: swh.storage.buffer: Add buffering proxy storage implementation.

ardumont mentioned this in D2214: Land new package loader implementations.Nov 4 2019, 2:04 PM

ardumont mentioned this in rDLDBASE6288012791ab: Merge branch 'package-loader'.Nov 4 2019, 2:10 PM

ardumont mentioned this in D2275: docker-dev: Migrate to new debian loader.Nov 14 2019, 1:46 PM

ardumont mentioned this in rCDFDfb55566d8199: loader: Use new configuration stanza to filter and buffer data.Nov 15 2019, 11:27 AM

ardumont closed subtask T2021: Re-implement pypi loader with package loader as Resolved.Nov 19 2019, 12:25 PM

ardumont closed subtask T2022: Re-implement npm loader with base loader as Resolved.

ardumont closed subtask T2023: Re-implement gnu loader with package loader as Resolved.

ardumont closed subtask T2024: Re-implement deposit loader with package loader mechanism as Resolved.

ardumont closed subtask T2025: Re-implement debian loader with package loader mechanism as Resolved.

ardumont removed a subtask: T2026: Implement cran loader with package manager mechanism.

ardumont mentioned this in D2305: docker-dev: Use loader-package implems from swh.loader.core modules.Nov 19 2019, 2:33 PM

ardumont mentioned this in D2306: .mrconfig: Stop referencing old package loader implementations.Nov 19 2019, 2:48 PM

ardumont mentioned this in D2307: req-swh*: Remove old package loader implementations references.Nov 19 2019, 2:52 PM

ardumont mentioned this in rCDFDcaa78d56528c: loaders: Use new loader-package from the swh.loader.core modules.Nov 19 2019, 3:05 PM

ardumont mentioned this in rDSCHc973ec029d20: req-swh*: Remove old package loader backend names.Nov 19 2019, 4:28 PM

ardumont mentioned this in rDDOC70de3057fa48: req-swh*: Remove old package loader implementations references.Nov 19 2019, 4:58 PM

ardumont mentioned this in rSPSITEf5862f854332: sesi_rocquencourt: Deploy loader_archive.Nov 20 2019, 9:06 AM

ardumont closed this task as Resolved.Nov 26 2019, 12:24 PM

ardumont claimed this task.

ardumont removed a subtask: T2098: Deploy package loaders.

ardumont mentioned this in rDENVfb55566d8199: loader: Use new configuration stanza to filter and buffer data.Nov 26 2019, 1:35 PM

ardumont mentioned this in rDENVcaa78d56528c: loaders: Use new loader-package from the swh.loader.core modules.

gitlab-migration changed the status of subtask T2017: package loader: Discuss revision metadata normalization from Resolved to Migrated.Jan 8 2023, 4:28 PM

gitlab-migration changed the status of subtask T2024: Re-implement deposit loader with package loader mechanism from Resolved to Migrated.

gitlab-migration changed the status of subtask T2025: Re-implement debian loader with package loader mechanism from Resolved to Migrated.

This task has been migrated to GitLab.

gitlab-migration changed the status of subtask T2021: Re-implement pypi loader with package loader from Resolved to Migrated.Jan 8 2023, 9:59 PM

gitlab-migration changed the status of subtask T2022: Re-implement npm loader with base loader from Resolved to Migrated.

gitlab-migration changed the status of subtask T2023: Re-implement gnu loader with package loader from Resolved to Migrated.

Implement a base "package" loader for package managersClosed, MigratedEdits LockedActions