Fedora provides its package metadata (as HTML) on https://packages.fedoraproject.org while their source code is hosted on a custom forge called Pagure.
https://packages.fedoraproject.org hosts 67603 packages, but the corresponding Pagure instance hosts only 35709 (git) repos. This is because of redundancies.
For example: 4ti2/4ti2, 4ti2/4ti2-devel, and 4ti2/4ti2-libs have different names and metadata but they point to the same git repo on Pagure.
Here's the approach I think we should take to ingest Fedora packages:
- Fetch the git repositories using the /projects endpoint. The default page size is 50 but can be increased to 100. Also, each page in the response provides a "next" URL in the response body.
- Extract package name from each repository and visit https://packages.fedoraproject.org/pkgs/<pkg>/ (Assuming <pkg> is the extracted package name).
- The list contains the names and URLs of all the packages associated with the repository.
- Visit each of the package URLs and extract the metadata using BeautifulSoup.
This will list each Fedora package as a separate origin while allowing multiple origins to point to the same repo.