Currently, the PyPI loader loads the entire response of the PyPI API (example: https://pypi.org/pypi/swh.core/json ) as snapshot metadata.
The NPM loader used to do that as well, but we changed it in D4142 so it only writes metadata on revisions, so the amount of metadata in storage doesn't quadratically (N snapshots, each with 1 to N releases).
This is less of an issue with PyPI because projects usually have less metadata than NPM projects, but we should do the same.
However, it's a little trickier, because the part we are interested in saving is formatted like this:
{ // ... "releases": { "1.0.0": [ { "filename": "file1.whl", "packagetype": "bdistwheel", // ... potentially interesting metadata here ... }, { "filename": "file2.whl", "packagetype": "bdistwheel", // ... potentially interesting metadata here ... }, { "filename": "file3.tar.gz", "packagetype": "sdist", // ... potentially interesting metadata here ... }, ], // ... } // ... }
We load only packages with type sdist. There is usually only one such file per release, so we could write the metadata of all files there. But this is not always true, so when there are multiple sdist files, we create one branch for each, so there are multiple branches for the same version.
Possible solutions:
- keep writing all this to the snapshot (properly solves the issue, doesn't require transormation on metadata, but takes a lot of space)
- write the metadata of all files on the revision/directory created from each sdist (works fine when there is only one sdist, which is the most common case; but is not very satisfying when there's more than one + potentially lots of duplicates for edge case packages)
- discard the metadata on non-bdist packages. (the metadata on wheels is not very interesting if we don't archive the wheels themselves, but they might be useful)