Page MenuHomeSoftware Heritage

PyPI lister
Closed, MigratedEdits Locked

Description

We want to be able to list all packages on PyPI and, more importantly, to list new (releases of) packages since the last state of PyPI that has been ingested into Software Heritage.

Event Timeline

They have multiple apis:

  • basic json one [1] which permits to request information on a per project basis (no listing) [1] (~> foresee the use of this one for the loader)
  • xmlrpc deprecated one [2] (this one lists ~> that would be for the lister use)
  • html page (listing all packages)
  • rss feed (update events)

As already mentioned in their faq, they push towards mirroring, quoting [3]:

If your consumer is actually an organization or service that will be downloading a lot of packages from PyPI, consider using your own index mirror or cache.

That's not a sustainable way. If we choose that path for all the forges we need to archive... that will be difficult in terms of infrastructure and maintenance.

But in lights of T1030, there might be a legitimate way for us to do so (the need for a local pypi instance server to ease the ci tooling).
Maybe then, we could then use this swh's pypi mirror both for this and T1030?

[1] https://warehouse.readthedocs.io/api-reference/
[2] https://warehouse.readthedocs.io/api-reference/xml-rpc/
[3] https://warehouse.readthedocs.io/api-reference/#rate-limiting

If your consumer is actually an organization or service that will be downloading a lot of packages from PyPI, consider using your own index mirror or cache.

That's not a sustainable way. If we choose that path for all the forges we need to archive... that will be difficult in terms of infrastructure and maintenance.

Agreed: we do not maintain actual mirrors of other big "things" we archive (e.g., GitHub, GitLab.com, Debian, etc.) and for a reason. We really want to hook into existing PyPi APIs to incrementally ingest new stuff that arrive there, without maintaining an actual mirror.

ardumont changed the task status from Open to Work in Progress.Aug 1 2018, 3:10 PM
ardumont claimed this task.