We want to be able to list all packages on PyPI and, more importantly, to list new (releases of) packages since the last state of PyPI that has been ingested into Software Heritage.
Description
Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T419 ingest PyPI into the Software Heritage archive (meta task) | ||
Migrated | gitlab-migration | T422 PyPI lister |
Event Timeline
They have multiple apis:
- basic json one [1] which permits to request information on a per project basis (no listing) [1] (~> foresee the use of this one for the loader)
- xmlrpc deprecated one [2] (this one lists ~> that would be for the lister use)
- html page (listing all packages)
- rss feed (update events)
As already mentioned in their faq, they push towards mirroring, quoting [3]:
If your consumer is actually an organization or service that will be downloading a lot of packages from PyPI, consider using your own index mirror or cache.
That's not a sustainable way. If we choose that path for all the forges we need to archive... that will be difficult in terms of infrastructure and maintenance.
But in lights of T1030, there might be a legitimate way for us to do so (the need for a local pypi instance server to ease the ci tooling).
Maybe then, we could then use this swh's pypi mirror both for this and T1030?
[1] https://warehouse.readthedocs.io/api-reference/
[2] https://warehouse.readthedocs.io/api-reference/xml-rpc/
[3] https://warehouse.readthedocs.io/api-reference/#rate-limiting
Agreed: we do not maintain actual mirrors of other big "things" we archive (e.g., GitHub, GitLab.com, Debian, etc.) and for a reason. We really want to hook into existing PyPi APIs to incrementally ingest new stuff that arrive there, without maintaining an actual mirror.