Page MenuHomeSoftware Heritage

Implementation of Fedora Lister
Closed, MigratedEdits Locked

Description

Fedora provides its package metadata (as HTML) on https://packages.fedoraproject.org while their source code is hosted on a custom forge called Pagure.

https://packages.fedoraproject.org hosts 67603 packages, but the corresponding Pagure instance hosts only 35709 (git) repos. This is because of redundancies.
For example: 4ti2/4ti2, 4ti2/4ti2-devel, and 4ti2/4ti2-libs have different names and metadata but they point to the same git repo on Pagure.

Here's the approach I think we should take to ingest Fedora packages:

  • Fetch the git repositories using the /projects endpoint. The default page size is 50 but can be increased to 100. Also, each page in the response provides a "next" URL in the response body.
  • Extract package name from each repository and visit https://packages.fedoraproject.org/pkgs/<pkg>/ (Assuming <pkg> is the extracted package name).
  • The list contains the names and URLs of all the packages associated with the repository.
  • Visit each of the package URLs and extract the metadata using BeautifulSoup.

This will list each Fedora package as a separate origin while allowing multiple origins to point to the same repo.

Related Objects

StatusAssignedTask
Migratedgitlab-migration
Migratedgitlab-migration

Event Timeline

KShivendu triaged this task as Normal priority.Aug 20 2022, 9:58 AM
KShivendu created this task.
KShivendu created this object in space S1 Public.
bchauvet added a parent task: Unknown Object (Maniphest Task).Sep 27 2022, 4:05 PM
bchauvet mentioned this in Unknown Object (Maniphest Task).
bchauvet mentioned this in Unknown Object (Maniphest Task).Oct 11 2022, 9:56 AM
bchauvet edited parent tasks, added: Unknown Object (Maniphest Task); removed: Unknown Object (Maniphest Task).Oct 21 2022, 9:55 AM