Page MenuHomeSoftware Heritage

Implementation of Fedora Lister
Open, NormalPublic

Description

Fedora provides its package metadata (as HTML) on https://packages.fedoraproject.org while their source code is hosted on a custom forge called Pagure.

https://packages.fedoraproject.org hosts 67603 packages, but the corresponding Pagure instance hosts only 35709 (git) repos. This is because of redundancies.
For example: 4ti2/4ti2, 4ti2/4ti2-devel, and 4ti2/4ti2-libs have different names and metadata but they point to the same git repo on Pagure.

Here's the approach I think we should take to ingest Fedora packages:

  • Fetch the git repositories using the /projects endpoint. The default page size is 50 but can be increased to 100. Also, each page in the response provides a "next" URL in the response body.
  • Extract package name from each repository and visit https://packages.fedoraproject.org/pkgs/<pkg>/ (Assuming <pkg> is the extracted package name).
  • The list contains the names and URLs of all the packages associated with the repository.
  • Visit each of the package URLs and extract the metadata using BeautifulSoup.

This will list each Fedora package as a separate origin while allowing multiple origins to point to the same repo.

Related Objects

StatusAssignedTask
OpenKShivendu
OpenNone