Page MenuHomeSoftware Heritage

Decide what to do with PyPI snapshot metadata
Open, NormalPublic

Description

Currently, the PyPI loader loads the entire response of the PyPI API (example: https://pypi.org/pypi/swh.core/json ) as snapshot metadata.

The NPM loader used to do that as well, but we changed it in D4142 so it only writes metadata on revisions, so the amount of metadata in storage doesn't quadratically (N snapshots, each with 1 to N releases).

This is less of an issue with PyPI because projects usually have less metadata than NPM projects, but we should do the same.
However, it's a little trickier, because the part we are interested in saving is formatted like this:

{
    // ...
    "releases": {
        "1.0.0": [
            {
                "filename": "file1.whl",
                "packagetype": "bdistwheel",
                // ... potentially interesting metadata here ...
            },
            {
                "filename": "file2.whl",
                "packagetype": "bdistwheel",
                // ... potentially interesting metadata here ...
            },
            {
                "filename": "file3.tar.gz",
                "packagetype": "sdist",
                // ... potentially interesting metadata here ...
            },
        ],
        // ...
    }
    // ...
}

We load only packages with type sdist. There is usually only one such file per release, so we could write the metadata of all files there. But this is not always true, so when there are multiple sdist files, we create one branch for each, so there are multiple branches for the same version.

Possible solutions:

  1. keep writing all this to the snapshot (properly solves the issue, doesn't require transormation on metadata, but takes a lot of space)
  1. write the metadata of all files on the revision/directory created from each sdist (works fine when there is only one sdist, which is the most common case; but is not very satisfying when there's more than one + potentially lots of duplicates for edge case packages)
  1. discard the metadata on non-bdist packages. (the metadata on wheels is not very interesting if we don't archive the wheels themselves, but they might be useful)

Event Timeline

vlorentz triaged this task as Normal priority.Tue, Oct 6, 10:19 AM
vlorentz created this task.
vlorentz updated the task description. (Show Details)Tue, Oct 13, 9:45 AM
olasd added a subscriber: olasd.Tue, Oct 13, 9:59 AM

In practice, is there many meaningful differences between the wheel metadata and the sdist metadata? If not then I think option 3 would be the most sensible.

They are metadata on the file itself (file name, checksums, has signature, upload time, file-specific comment (often empty), yank status), so they have nothing in common

So they're metadata specific to files that we don't archive at all because they're not source? That doesn't sound very useful to keep at all. We don't keep the binary indexes from Debian repositories, for instance.

We don't keep the binary indexes from Debian repositories, for instance.

good point! thanks