We want to be able to list all packages on PyPI and, more importantly, to list new (releases of) packages since the last state of PyPI that has been ingested into Software Heritage.
|Migrated||gitlab-migration||T419 ingest PyPI into the Software Heritage archive (meta task)|
|Migrated||gitlab-migration||T422 PyPI lister|
- Mentioned In
- rDLSed64d24634fe: pypi.lister: Normalize pypi name to PyPI
rDLS5b20eff7d365: pypi.lister: Use https://pypi.org/project/<name>/ uri as project_url
rDSCH62331cfaacae: sql/scheduler-data: Normalize PyPI name
rSPSITE3974d16fec59: data/defaults: Reference pypi lister tasks module
rSPSITE67ba60cded20: data/location: Add lister pypi to rocq workers
rSPSITE642d65e47d26: deploy/worker: Add new pypi lister manifest
D406: PyPi Lister
rDLS6ff3b908595d: swh.lister.pypi: Add a pypi lister implementation using xmlprc api
rDLS3a65fbb4c8bc: swh.lister.pypi: Use pypi's legacy html based api to list packages
rDLS33ee7851040c: swh.lister.pypi: Use pypi's legacy html based api to list packages
rDLS1d3891e1f6d3: swh.lister.pypi: Use xmlrpc api to list pypi's origins
rDLS1bcda422117d: swh.lister.pypi: Add a pypi lister implementation using xmlprc api
rDLS20c78e21bde9: swh.lister.pypi: Move to pypi listing using the legacy api
- Mentioned Here
- T1030: Provide a pip-compatible index of python modules
They have multiple apis:
- basic json one  which permits to request information on a per project basis (no listing)  (~> foresee the use of this one for the loader)
- xmlrpc deprecated one  (this one lists ~> that would be for the lister use)
- html page (listing all packages)
- rss feed (update events)
As already mentioned in their faq, they push towards mirroring, quoting :
If your consumer is actually an organization or service that will be downloading a lot of packages from PyPI, consider using your own index mirror or cache.
That's not a sustainable way. If we choose that path for all the forges we need to archive... that will be difficult in terms of infrastructure and maintenance.
But in lights of T1030, there might be a legitimate way for us to do so (the need for a local pypi instance server to ease the ci tooling).
Maybe then, we could then use this swh's pypi mirror both for this and T1030?
Agreed: we do not maintain actual mirrors of other big "things" we archive (e.g., GitHub, GitLab.com, Debian, etc.) and for a reason. We really want to hook into existing PyPi APIs to incrementally ingest new stuff that arrive there, without maintaining an actual mirror.