Page MenuHomeSoftware Heritage

Implement a base loader for package managers
Open, NormalPublic

Description

To extend the archive coverage, source packages provided by package managers (see extended list at https://libraries.io/) must be considered.

Currently, it exists a PyPI loader (T419) deployed in production and work related on the npm ingestion is in progress (T1378).

Nevertheless, ingesting source code from package managers is a process somehow similar for all of them, notably it includes:

  • the querying of an API (usually a RESTful one) to get relevant metadata about a package
  • the retrieval of the package source code (usually in tarball form) for ingestion into the archive

This calls for a common base implementation for loading content from package managers into the archive.

That task is here to track the progress on the subject.

Event Timeline

anlambert triaged this task as Wishlist priority.
anlambert raised the priority of this task from Wishlist to Normal.Feb 5 2019, 2:31 PM
anlambert added a comment.EditedFeb 7 2019, 4:32 PM

The table below summarizes how to list all packages and get their metadata from well-known package managers.

Package managerPackages listing url Package metadata urlPackage source tarball url
packagist (PHP)https://packagist.org/packages/list.jsonhttps://repo.packagist.org/p/[vendor]/[package].jsonavailable for each package version in the metadata
example: https://repo.packagist.org/p/monolog/monolog.json
https://packagist.org/packages/[vendor]/[package].jsonavailable for each package version in the metadata
example: https://packagist.org/packages/monolog/monolog.json
Gohttps://go-search.org/api?action=packageshttps://go-search.org/api?action=package&id=[package]not available in metadata
example: https://go-search.org/api?action=package&id=github.com%2fdaviddengcn%2fgcseGo packages are hosted on GitHub, so no need to write a dedicated loader here
npm (Javascript)https://replicate.npmjs.com/_all_docs?limit=100https://replicate.npmjs.com/[package]/available for each package version in the metadata
example: https://replicate.npmjs.com/webpack/
Maven central (Java)http://central.maven.org/maven2/http://central.maven.org/maven2/[group]/[package]/maven-metadata.xmlhttp://central.maven.org/maven2/[group]/[package]/[version]/[package]-[version]-sources.jar
example: http://central.maven.org/maven2/junit/junit/maven-metadata.xmlexample: http://central.maven.org/maven2/junit/junit/4.9/junit-4.9-sources.jar
http://central.maven.org/maven2/[group]/[package]/[version]/[package]-[version]-sources.pom
example: http://central.maven.org/maven2/junit/junit/4.9/junit-4.9.pom
RubygemsNo public api endpoint availablehttps://rubygems.org/api/v2/rubygems/[package]/versions/[version].jsonavailable in the medata
gem list -r --all can be used insteadexample: https://rubygems.org/api/v2/rubygems/coulda/versions/0.7.1.json
NuGet (.NET)https://api.nuget.org/v3/catalog0/index.jsonhttps://api.nuget.org/v3/catalog0/data/[catalog-commit-date]/[package].[version].jsonhttps://api.nuget.org/v3-flatcontainer/[package]/[version]/[package].[version].nupkg
example: https://api.nuget.org/v3/catalog0/data/2015.02.01.11.18.40/windowsazure.storage.1.0.0.jsonexample: https://api.nuget.org/v3-flatcontainer/windowsazure.storage/1.0.0/windowsazure.storage.1.0.0.nupkg
https://api.nuget.org/v3-flatcontainer/[package]/index.json
example: https://api.nuget.org/v3-flatcontainer/windowsazure.storage/index.json
https://api.nuget.org/v3-flatcontainer/[package]/[version]/[package].nuspec
example: https://api.nuget.org/v3-flatcontainer/windowsazure.storage/1.0.0/windowsazure.storage.nuspec
CRAN (R)https://cran.r-project.org/web/packages/available_packages_by_name.htmlhttps://cran.r-project.org/web/packages/[package]/index.htmltarball url from latest release available in metadata
tarballs from previous releases available from https://cran.r-project.org/src/contrib/Archive/[package]

We've discussed a plausible plan for a "base package manager loader" with @ardumont and, to some extent, @anlambert.

The conversation was triggered by @nahimilega asking which loader would be used to load GNU origins (in part in D1482), and how that should be implemented.

This is a request for feedback from other members of the team (specifically ping @douardda @zack on the design considerations).

Pattern for a package manager "base loader"

There's a common pattern for loaders of package managers origins that we should be able to merge. This pattern applies for loading *one* package from a given package manager, into a snapshot referencing all the available versions of the package.

0. List the package versions available, and the files that need to be fetched

There's two options here :

  • The loader lists the package versions at load time

When a simple API is provided, this is the cleanest option as it minimizes the risk of mismatch between the versions available at list time, and at load time. That's what we currently do in the PyPI and npm loaders.

In that case, the lister only generates recurrent tasks referencing the package name.

  • The lister provides the package versions as arguments to the loader task

That's what's currently implemented in the Debian loader, as the operation of "listing a Debian archive" gives you the full set of package metadata with all versions. This is also what will happen for the GNU lister: the metadata provided gives all version info at once in a large json file.

The lister will generate a one-shot task to load each package for the given repository, with the full information needed to do the data fetching.

In both cases, the "input data" will take the form of a mapping from version number or a snapshot branch name to a set of files to be downloaded, possibly with a default version for generating the snapshot HEAD.

[[ for each package version

1. Fetch the files for one package version

By default, this can be implemented as a simple HTTP request.
Loaders with more specific requirements can override this : the PyPI loader checks the integrity of the downloaded files; the Debian loader has to download and check several files for one package version.

2. Extract the downloaded files

By default, this would be a universal archive/tarball extraction.

Loaders for specific formats can override this method (for instance, the Debian loader uses dpkg-source -x).

3. Convert the extracted directory to a set of Software Heritage objects

Using swh.model.from_disk.

4. Extract the metadata from the unpacked directories

This would only be applicable for "smart" loaders like npm (parsing the package.json), PyPI (parsing the PKG-INFO file) or Debian (parsing debian/changelog and debian/control).

On "minimal-metadata" sources such as the GNU archive, the lister should provide the minimal set of metadata needed to populate the revision/release objects (authors, dates) as an argument to the task.

5. Generate the revision/release objects for the given version.

From the data generated at steps 3 and 4.

end for each ]]

6. Generate and load the snapshot

Using the revisions/releases collected at step 5., and the branch information from step 0., generate a snapshot and load it into the Software Heritage archive.

Implementation plan

This base loader should live as a new "pattern" in the swh-loader-core repository. We're not sure how we should implement the hook points for differing behavior by different loaders. I would use inheritance and method overriding but there might be a better way. I think we'll need to talk over some code (of the base class, and of at least one of the loaders) to be sure that we're going in the right direction.

Refactoring plan

The discussion with @ardumont led to the conclusion that this base class should be able to completely replace the swh-loader-tar and swh-loader-dir; With careful implementation of the lister (moving all the parsing complexity there), the so-called "GNU loader" could be implemented by calling directly into this base class as well, rather than overriding any of its methods.

The next obvious candidate for refactoring would be the deposit loader, as it's quite simple and it's the only user of the loader-dir/loader-tar scaffold.

Once these are settled, we can move the PyPI, npm and Debian loaders over to the new base implementation. We have the feeling that they already closely match this pattern and that porting the code over shouldn't be too hard (famous last words).

Thoughts?