Page MenuHomeSoftware Heritage

Implement a base "package" loader for package managers
Closed, MigratedEdits Locked

Description

To extend the archive coverage, source packages provided by package managers (see extended list at https://libraries.io/) must be considered.

Currently, it exists a PyPI loader (T419) deployed in production and work related on the npm ingestion is in progress (T1378).

Nevertheless, ingesting source code from package managers is a process somehow similar for all of them, notably it includes:

  • the querying of an API (usually a RESTful one) to get relevant metadata about a package
  • the retrieval of the package source code (usually in tarball form) for ingestion into the archive

This calls for a common base implementation for loading content from package managers into the archive.

That task is here to track the progress on the subject.

Related Objects

Event Timeline

anlambert triaged this task as Wishlist priority.Nov 27 2018, 12:23 PM
anlambert created this task.
anlambert raised the priority of this task from Wishlist to Normal.Feb 5 2019, 2:31 PM

The table below summarizes how to list all packages and get their metadata from well-known package managers.

Package managerPackages listing url Package metadata urlPackage source tarball url
packagist (PHP)https://packagist.org/packages/list.jsonhttps://repo.packagist.org/p/[vendor]/[package].jsonavailable for each package version in the metadata
example: https://repo.packagist.org/p/monolog/monolog.json
https://packagist.org/packages/[vendor]/[package].jsonavailable for each package version in the metadata
example: https://packagist.org/packages/monolog/monolog.json
Gohttps://go-search.org/api?action=packageshttps://go-search.org/api?action=package&id=[package]not available in metadata
example: https://go-search.org/api?action=package&id=github.com%2fdaviddengcn%2fgcseGo packages are hosted on GitHub, so no need to write a dedicated loader here
npm (Javascript)https://replicate.npmjs.com/_all_docs?limit=100https://replicate.npmjs.com/[package]/available for each package version in the metadata
example: https://replicate.npmjs.com/webpack/
Maven central (Java)http://central.maven.org/maven2/http://central.maven.org/maven2/[group]/[package]/maven-metadata.xmlhttp://central.maven.org/maven2/[group]/[package]/[version]/[package]-[version]-sources.jar
example: http://central.maven.org/maven2/junit/junit/maven-metadata.xmlexample: http://central.maven.org/maven2/junit/junit/4.9/junit-4.9-sources.jar
http://central.maven.org/maven2/[group]/[package]/[version]/[package]-[version]-sources.pom
example: http://central.maven.org/maven2/junit/junit/4.9/junit-4.9.pom
RubygemsNo public api endpoint availablehttps://rubygems.org/api/v2/rubygems/[package]/versions/[version].jsonavailable in the medata
gem list -r --all can be used insteadexample: https://rubygems.org/api/v2/rubygems/coulda/versions/0.7.1.json
NuGet (.NET)https://api.nuget.org/v3/catalog0/index.jsonhttps://api.nuget.org/v3/catalog0/data/[catalog-commit-date]/[package].[version].jsonhttps://api.nuget.org/v3-flatcontainer/[package]/[version]/[package].[version].nupkg
example: https://api.nuget.org/v3/catalog0/data/2015.02.01.11.18.40/windowsazure.storage.1.0.0.jsonexample: https://api.nuget.org/v3-flatcontainer/windowsazure.storage/1.0.0/windowsazure.storage.1.0.0.nupkg
https://api.nuget.org/v3-flatcontainer/[package]/index.json
example: https://api.nuget.org/v3-flatcontainer/windowsazure.storage/index.json
https://api.nuget.org/v3-flatcontainer/[package]/[version]/[package].nuspec
example: https://api.nuget.org/v3-flatcontainer/windowsazure.storage/1.0.0/windowsazure.storage.nuspec
CRAN (R)https://cran.r-project.org/web/packages/available_packages_by_name.htmlhttps://cran.r-project.org/web/packages/[package]/index.htmltarball url from latest release available in metadata
tarballs from previous releases available from https://cran.r-project.org/src/contrib/Archive/[package]

We've discussed a plausible plan for a "base package manager loader" with @ardumont and, to some extent, @anlambert.

The conversation was triggered by @nahimilega asking which loader would be used to load GNU origins (in part in D1482), and how that should be implemented.

This is a request for feedback from other members of the team (specifically ping @douardda @zack on the design considerations).

Pattern for a package manager "base loader"

There's a common pattern for loaders of package managers origins that we should be able to merge. This pattern applies for loading *one* package from a given package manager, into a snapshot referencing all the available versions of the package.

0. List the package versions available, and the files that need to be fetched

There's two options here :

  • The loader lists the package versions at load time

When a simple API is provided, this is the cleanest option as it minimizes the risk of mismatch between the versions available at list time, and at load time. That's what we currently do in the PyPI and npm loaders.

In that case, the lister only generates recurrent tasks referencing the package name.

  • The lister provides the package versions as arguments to the loader task

That's what's currently implemented in the Debian loader, as the operation of "listing a Debian archive" gives you the full set of package metadata with all versions. This is also what will happen for the GNU lister: the metadata provided gives all version info at once in a large json file.

The lister will generate a one-shot task to load each package for the given repository, with the full information needed to do the data fetching.

In both cases, the "input data" will take the form of a mapping from version number or a snapshot branch name to a set of files to be downloaded, possibly with a default version for generating the snapshot HEAD.

[[ for each package version

1. Fetch the files for one package version

By default, this can be implemented as a simple HTTP request.
Loaders with more specific requirements can override this : the PyPI loader checks the integrity of the downloaded files; the Debian loader has to download and check several files for one package version.

2. Extract the downloaded files

By default, this would be a universal archive/tarball extraction.

Loaders for specific formats can override this method (for instance, the Debian loader uses dpkg-source -x).

3. Convert the extracted directory to a set of Software Heritage objects

Using swh.model.from_disk.

4. Extract the metadata from the unpacked directories

This would only be applicable for "smart" loaders like npm (parsing the package.json), PyPI (parsing the PKG-INFO file) or Debian (parsing debian/changelog and debian/control).

On "minimal-metadata" sources such as the GNU archive, the lister should provide the minimal set of metadata needed to populate the revision/release objects (authors, dates) as an argument to the task.

5. Generate the revision/release objects for the given version.

From the data generated at steps 3 and 4.

end for each ]]

6. Generate and load the snapshot

Using the revisions/releases collected at step 5., and the branch information from step 0., generate a snapshot and load it into the Software Heritage archive.

Implementation plan

This base loader should live as a new "pattern" in the swh-loader-core repository. We're not sure how we should implement the hook points for differing behavior by different loaders. I would use inheritance and method overriding but there might be a better way. I think we'll need to talk over some code (of the base class, and of at least one of the loaders) to be sure that we're going in the right direction.

Refactoring plan

The discussion with @ardumont led to the conclusion that this base class should be able to completely replace the swh-loader-tar and swh-loader-dir; With careful implementation of the lister (moving all the parsing complexity there), the so-called "GNU loader" could be implemented by calling directly into this base class as well, rather than overriding any of its methods.

The next obvious candidate for refactoring would be the deposit loader, as it's quite simple and it's the only user of the loader-dir/loader-tar scaffold.

Once these are settled, we can move the PyPI, npm and Debian loaders over to the new base implementation. We have the feeling that they already closely match this pattern and that porting the code over shouldn't be too hard (famous last words).

Thoughts?

Thanks @olasd, @ardumont, and @anlambert for this, it's a great plan and I like it a lot !

Just a few comments on the sidelines:

The lister will generate a one-shot task to load each package for the given repository, with the full information needed to do the data fetching.

This seemed clear from a different part of the description, but fwiw here I'm assuming the plan is to only load the version of the packages not already known/ingested in the past.

We're not sure how we should implement the hook points for differing behavior by different loaders. I would use inheritance and method overriding but there might be a better way.

This feels like a good candidate for a template method + method overriding as you suggested. Everything that has chances of being the same for all loaders will be pre-implemented in a base class, that also calls pre/post hooks to main loading steps, so that derived packages that are fine with the default behavior but need to do additional stuff can do so without having to override. The more peculiar cases (like the dpkg-source cases you mentioned), will also (or only) do the overrides. (But of course ack on the fact this will be better discussed in actual early implementations/code reviews.)

Last comment, which is not a blocker but something that deserves highlighting: a risk of this approach is that of "non monotonic snapshots". With VCS loaders, snapshots tend to be monotonic, i.e., future snapshots have more tags than previous ones, because old releases tend to stay around. Here we will depend on the list of versions returned by the package manager API to make it so. This will probably be true in many cases (e.g., PyPI, NPM) but it's already false in others (like Debian, I think, where it's not easy to list all versions of packages that have been moved from the main archive to archive.debian.org; from what we've been told by Cran, that might be the case for R too). So our snapshots for (some) package managers will probably be "sliding windows" on the most recent N versions of a given package. Not a big deal, it's just the way it is, but we should keep this in mind, and maybe document it in the description of each loader.

In T1389#33215, @zack wrote:

Thanks @olasd, @ardumont, and @anlambert for this, it's a great plan and I like it a lot !

Just a few comments on the sidelines:

The lister will generate a one-shot task to load each package for the given repository, with the full information needed to do the data fetching.

This seemed clear from a different part of the description, but fwiw here I'm assuming the plan is to only load the version of the packages not already known/ingested in the past.

If we're able to determine that beforehand (e.g. by using the checksum of the original artifact provided by the API), then yes.

I have the feeling most package managers don't provide that information and we'll need some other way to avoid re-processing identical packages. Maybe seeing if they support If-Modified-Since or somesuch.

Last comment, which is not a blocker but something that deserves highlighting: a risk of this approach is that of "non monotonic snapshots". With VCS loaders, snapshots tend to be monotonic, i.e., future snapshots have more tags than previous ones, because old releases tend to stay around. Here we will depend on the list of versions returned by the package manager API to make it so. This will probably be true in many cases (e.g., PyPI, NPM) but it's already false in others (like Debian, I think, where it's not easy to list all versions of packages that have been moved from the main archive to archive.debian.org; from what we've been told by Cran, that might be the case for R too). So our snapshots for (some) package managers will probably be "sliding windows" on the most recent N versions of a given package. Not a big deal, it's just the way it is, but we should keep this in mind, and maybe document it in the description of each loader.

Debian origins are indeed inherently non-monotonic, as only the versions which are currently reachable in any currently live suite are shown.

I think it might make sense to provide, at some point, "cumulative snapshots" presenting all versions we've ever archived for such origins, but that's more of a UI/UX issue than it is an issue loaders themselves should solve.

Extending over the plan by @olasd. Here are some of my thoughts on the implementation of base loader.

The format in which the lister returns the data is different for every lister. Some directly return the URL needed to visit to get the metadata and package source code URL(like pypi), whereas some return the link of the package source code(like gnu).

Convert information passed by the lister into a standard format.

It is always easy to generalise when the data is in a known format.
It should be the first step, should be called the task of the loader.
This step includes making requests to the metadata URL(if any) to get the package source code URL, and then converting the data obtained into a specific standard format that would ease in generalisation for further process.
We could convert into some standard format like (gnu lister already returns the data in something like this format)

Name - <name of the package>
tarballs:[{
    url:
    sha:
    time last modified:
    description:
    type:
    author:
    ...
}, ... ]

There will be some fields that are necessary to be present like <name of package>, <url>. Moreover, others are conditional depending on if the information is provided.

Define the method to generate artifact to check for identical packages.

Current candidates for avoiding re-processing identical package

  • Using some metadata provided like SHA hash(example PyPI) or the time last modified(example gnu).

There needs to be a field in the data provided which can be used to avoiding re-processing identical package(like SHA or time last modified)
This field is specific to the package manager, hence needed to be known beforehand to make further execution identical. We can keep a variable for this purpose.
Like for pypi

self.generate_artifact_field = 'sha'
  • If-Modified-Since

For defining the candidate to avoid reprocessing, we could have two different classes, one which uses If-Modified-Since and other which uses a field in the data provided.
(analogy) Like in lister we have simple lister and indexing lister. Different class to serve a different purpose.

Once the information is converted into the standard format, and the method of artifact generation is known, we could generalise further steps. Now the work of base loader starts.

The work that would be done by the base loader after receiving the information in a standard format

  • Generating artifact
  • Getting old artifacts and comparing,
  • Downloading and extracting
  • Making snapshots and storing them
  • cleanup and flush

Methods to generalize these process can be devised, hence eliminating the need to write code for these steps in every loader separately.
However, they can be overridden if any specific need by the package manager.

(stating the obvious) We should try to make methods for every small task like generate_artifact, compare, download so that if needed, anyone can be easily overridden without disturbing the whole stack. (like overriding the extraction process for Debian loader)

Some more hook point we need to provide in between this process as some package manager may need them

  • Filter the downloaded file.

Remove the unnecessary files and binaries that are downloaded but are not to be archived.

  • Then gather more metadata from the downloaded files and append it in the standard format that is generated.

Some packages like pypi have a file for metadata which could give us useful information(like author name etc.) hence this data is needed to be appended to the standard format.

Refracting

I suppose, for some tasks, we could take inspiration from current loaders. Like for downloading the remote tarballs, we could take the help of the code already present in tarball loader. But it would be best if this the base loaders are build from scratch rather than refracting any existing code.

PS: For listers like gnu lister, this approach could implement their loader in as few as 10-20 lines of code

Package managerPackages listing url Package metadata urlPackage source tarball url
RubygemsNo public api endpoint availablehttps://rubygems.org/api/v2/rubygems/[package]/versions/[version].jsonavailable in the medata

We can use the search API:

http 'https://rubygems.org/api/v1/search.yaml?query=*&page=1'

and increase the page number until the result is empty.

@douardda We have a separate task(T1777) for rubygem lister. I guess we can add a separate column in the table by @anlambert showing status of the lister implementation and the tasks related to it.

ardumont renamed this task from Implement a base loader for package managers to Implement a base "package" loader for package managers.Oct 1 2019, 1:19 PM
ardumont changed the task status from Open to Work in Progress.
ardumont changed the status of subtask T2022: Re-implement npm loader with base loader from Open to Work in Progress.
ardumont changed the status of subtask T2021: Re-implement pypi loader with package loader from Open to Work in Progress.

Current work is in the swh-loader-repository within the package-loader branch.

ardumont claimed this task.
gitlab-migration changed the status of subtask T2022: Re-implement npm loader with base loader from Resolved to Migrated.
gitlab-migration changed the status of subtask T2023: Re-implement gnu loader with package loader from Resolved to Migrated.