Page MenuHomeSoftware Heritage

Strip first level directory when loading tarballs from PyPI?
Closed, MigratedEdits Locked

Description

PyPI loader currently injects the sdist tarball into SWH archive, including the single directory at the root of it.

For example, if you look at https://archive.softwareheritage.org/browse/origin/directory/?origin_url=https://pypi.org/project/swh.model/&release=6.4.0, you see:

Sadly, this does not lead to the best user experience when going from one version to the other:

Should the content of the directory be injected in the archive instead?

Event Timeline

lunar created this object in space S1 Public.
vlorentz triaged this task as Normal priority.Sep 8 2022, 12:01 PM
vlorentz added a subscriber: vlorentz.

I don't think it would be appropriate to remove that directory; we try to reproduce tarball faithfully. And there might be other entries at the root (eg. when loading .jar, there would typically be only two directories at the root).

But clearly it's not great. I wonder if we could do something about this in swh-web instead

I agree that the UX of switching branches from a release to another on snapshots of PyPI origins is not good.

Generally, when we archive tarballs, we try to be as faithful as possible to the original layout, which is why the Python source distributions that we've archived do have this top-level directory, which is what we find in the .tar.gz served by PyPI.

Furthermore, the Python source distribution layout has only been standardized recently (https://packaging.python.org/en/latest/specifications/source-distribution-format/, blessed via https://peps.python.org/pep-0643/ in October 2020). There are (a very small, but non-zero, number of) source distributions which contain no top-level directory, or a top-level directory that does not match the package metadata. Finally, I'm not sure that PyPI enforces this sdist format for new uploads either, so new ones may pop up in the future.

Diverging from the layout of the original tarball may make efforts to keep the metadata needed to efficiently rebuild original tarballs (via disarchive) harder.

So, overall, I'm not sure that we should be changing the layout of the archived content.

However, the UX issue would be worth fixing in a generic way. Maybe it would make sense for "directory + path" Web UI views to skip over "trivial directories" (directories with a single entry that is itself a directory)? I'm not sure how such "skipped levels" should be represented so that they can generically be used when switching branches?

But clearly it's not great. I wonder if we could do something about this in swh-web instead

I agree the browsing experience for that kind of origin is not great, maybe we could redirect to that directory content if the root directory for an origin branch only contains a single directory ?

In T4512#90697, @olasd wrote:

Diverging from the layout of the original tarball may make efforts to keep the metadata needed to efficiently rebuild original tarballs (via disarchive) harder.

As far as I can tell, Disarchive ignores file names and identifies them by hashes. It would increase the size of pristine-tar's diffs, though; but it's not something that should affect us in the foreseeable future.