Page MenuHomeSoftware Heritage

lookup ingested tarballs (or similar source code containers) by container checksum
Open, LowPublic

Description

Package repositories (Pypi and Hackage for instance) provide a checksum for their package. Unfortunately, this checksum is computed on the tarball itself, and not on the content.
A direct consequence is that the checksum of a release downloaded on Software Heritage is not equal to the checksum of the same release exposed by the package repositories (because SWH doesn't preserve file permissions, timestamps,...).

In the context of Nix, we often compare the checksum provided by package repositories to the checksum of the downloaded artifact. In this case, the Nix ckecksum verification fails if we download an artifact from SWH. It would actually be really nice if package repositories could expose a checksum on the content and not on the container (the tarball)!

Do you think it would be possible/pertinent to create a swhid for tarballs?

I'm thinking on something such as swh:1:tar:XXXX. To compute the hash, the file would first be unpacked and the checksum would be computed on the content. To verify this hash, we would know we have to unpack the file before computing the hash.

Note there are corner cases that could be hard to manage, such as archives without any top level directory.

Event Timeline

lewo created this task.Jun 2 2020, 7:03 PM
lewo created this object in space S1 Public.

Thanks for submitting this!

I have the feeling that upstreams that want to use SWH as fallback could store the SWHID of the directory of the decompressed contents of the archive (after trimming a potential top level directory), and that would capture most if not all of the needed information to substitute the archive.

The main concern would be for build systems depending on "quirks" of the archive format (e.g. sticky bits, different permissions, different owners, modification times...), but that shouldn't be a problem in the vast majority of cases.

As a stop-gap, we could provide a lookup service from the hash of original tarballs (which we store as metadata of the generated revisions) to SWHIDs of (decompressed) directories.

I wonder what @rdicosmo / @zack / @moranegg think about this, hence my ping.

An important issue indeed :-)

A tar (or zip, or ar... etc.) is a container that once unpacked yields just a directory, so the SWHID for directories (swh:1:dir: ...) already provides everything necessary to identify its content (excluding the quirks mentioned by @olasd): we do not need anything new (whether a top level directory is present or not is not a problem).

The real questions are:

  • where the SWHID corresponding to the pure payload of a container should be stored?
  • who is the source of trust for this correspondence?

There are two main possibilities we can consider.

Package distributor as source of trust

One solution would be for package managers to provide multiple checksums for their packages:

  • the checksum(s) of the container (as now done by most of them)
  • the swh:1:dir: of the unpacked payload, computed using swh-identify on the resulting directory

Providing multiple checksums is not uncommon (see https://packages.debian.org/buster/amd64/exim4-base/download for example), and
adding a SWHID identifier would be straightforward to implement in those package managers that build the .tar archives themselves (like Debian). More work would be needed for package managers that just refer to existing archives and do not pack/unpack them (like the case of Hackage packages referenced in Nix that @lewo mentioned in a separate conversation).

Software Heritage as source of trust

The alternative is to keep this information in Software Heritage. Technically, it's not a big deal: whenever we harvest a container, we need just to store the container checksum(s) in the extrinsic metadata table for the directory node corresponding to the ingested/unpacked payload. No need to change the SWHIDs, we would just need to add an API entry that provides the correspondence between container checksums(s) and swh:dir identifiers (both ways!).

Discussion

Here are the key pros and cons I see for each of these approaches

Having the package distributors compute the SWHID and keeping the correspondence information has the advantage that the source of trust is the package distributor itself, so the community using that package manager has full control on the process and needs to only trust the same people that they alredy trust for the integrity of the packages.

The disadvantage is that one needs to go around and convince all package distributors to do so, and everybody needs to recompute the SWHID even if the same content has been seen/used/distributed in other places.

Promoting Software Heritage as the blessed keeper of the correspondence information has the disadvantage to make Software Heritage the source of trust: it remains to be seen whether all package distributors would be fine with this. If we go this way, we may really need to go further on the blockchain path, and ensure anybody can verify that we do not alter the correspondence table (voluntarily or by mistake) at a later time.
The clear advantage is that the SWHID computation would be done only once and by us.

Way forward

Keeping the correspondence table is something we may want to do anyway (or maybe we do already? @olasd?) to avoid reingesting a container we have already seen; and we can do it all the way down (for example, if a .tar.gz file is ingested, one may learn the checksums of the .tar.gz and of the .tar: keeping them both in the table may save time later and adds to our knowledge base).

So we should certainly add this feature to the roadmap, plan to provide the functionality, and then let package distributor choose whether to trust us or not, without forcing anybody to use the correspondence table.

zack added a comment.Jun 14 2020, 8:56 AM

Making explicit a direct answer to one of @lewo's question (hinted at by both @olasd and @rdicosmo): no, we do not want a new type of SWHID (swh:1:tar:...) for source code containers, which from our point of view are ephemeral.

But the need to lookup code by container checksums is real, and I agree with the overview of the options detailed by @rdicosmo.

Spreading the use of swh:1:dir:... so that it becomes commonplace among our upstreams would be great, but will take time.

Meanwhile, we can provide a lookup service from container checksums to dir SWHID, as suggested by @olasd. We already have the information stored (for the outer container at least), it's just a matter of adding a index of sorts for it.

zack renamed this task from A swhid for archives to lookup ingested tarballs (or similar source code containers) by container checksum.Jun 14 2020, 8:56 AM
zack triaged this task as Low priority.

@rdicosmo The discussion of the "source of trust" is an important one, and it's interesting to see how we can address it going forward.

The proposal of a correspondence table, as I wrote on swh-devel, leaves open the question of today's and yesterday's software, assuming SWHIDs become the de facto standard tomorrow. How can I check the integrity of code fetched from SWH if all I have is its tarball's SHA256 from its release announcement? How can I check its authenticity if all I have is an OpenPGP signature computed over a tarball?

@rdicosmo The discussion of the "source of trust" is an important one, and it's interesting to see how we can address it going forward.

The proposal of a correspondence table, as I wrote on swh-devel, leaves open the question of today's and yesterday's software, assuming SWHIDs become the de facto standard tomorrow. How can I check the integrity of code fetched from SWH if all I have is its tarball's SHA256 from its release announcement? How can I check its authenticity if all I have is an OpenPGP signature computed over a tarball?

The issue may be less problematic than it seems. Let me offer a few considerations for legacy tarballs (those for which we cannot ask the developer to put out the corresponding SWHID today :-)) :

  • if the only thing that you have is a SHA256 from the release announcement for a tarball that has never been ingested in SWH, and has been lost, there is not much you can do
  • if you still have that tarball at hand, then it can be ingested in SWH, and we keep the correspondence between SWHID and SHA256; in principle, you need to trust us, but one can foresee having external parties checking that the correspondence is real while the tarball is still there, and adding their observation to the chain of trust means you need to trust us less and less
zimoun added a subscriber: zimoun.Jun 23 2020, 6:12 PM
  • if you still have that tarball at hand, then it can be ingested in SWH, and we keep the correspondence between SWHID and SHA256; in principle, you need to trust us, but one can foresee having external parties checking that the correspondence is real while the tarball is still there, and adding their observation to the chain of trust means you need to trust us less and less

By we keep the correspondence between SWHID and SHA256 you mean you on the SWH side?

In T2430#45767, @zimoun wrote:
  • if you still have that tarball at hand, then it can be ingested in SWH, and we keep the correspondence between SWHID and SHA256; in principle, you need to trust us, but one can foresee having external parties checking that the correspondence is real while the tarball is still there, and adding their observation to the chain of trust means you need to trust us less and less

By we keep the correspondence between SWHID and SHA256 you mean you on the SWH side?

Yes... we are willing to do this, and work out the details on how to ensure that this correspondence can be trusted :-)

Thanks for your feedback, @rdicosmo!

Yes... we are willing to do this, and work out the details on how to ensure that this correspondence can be trusted :-)

I believe the only way the correspondence can be trusted is if the original tarball is still available and everyone can unpack it and compute the SWHID on the unpacked tree.

I see SWH still archives raw tarballs from ftp.gnu.org, such as grep-3.4.tar.xz (Jan. 2020). Perhaps an option would be to keep doing that, at least in some cases, and at the same time run a campaign encouraging developers to move away from container hashes/signatures?

Migration away from tarballs is already happening as more and more software is distributed straight from content-addressed VCS repositories, though progress has been relatively slow since we first discussed it in 2016.

Hello!

Do I get it right that the primary reason why tarballs aren't systematically archived is that doing so would be too expensive storage-wise (no deduplication)?

If that's the case, here's an idea that may be worth exploring: storing container metadata (tar, gzip, etc.) instead of the container itself. The hypothesis, which would have to be confirmed, is that in many cases we should be able to reproduce bit-identical containers from that metadata plus the corresponding SWH directory.

Thoughts?

zack added a comment.Jul 2 2020, 12:07 PM

@civodul I wanted to raise the topic of storing container metadata (in the style of what tools like pristine-tar do) here too, so thanks for giving me the chance :-)
I agree it might be a technical solution, *but*, I'm not sure I see the point.
Didn't you agree that having a "lookup service" from tarball/container checksums to SWHIDs (the Software Heritage identifiers, that can then be used to lookup stuff in the archive) would be enough to satisfy distro needs?
If yes, then "archiving container metadata" could be replaced by simply having a way to add entries to the lookup table. And allowing distros to do so is option that we can explore. (Once the service exists, of course.)

Hi @zack,

In T2430#46040, @zack wrote:

@civodul I wanted to raise the topic of storing container metadata (in the style of what tools like pristine-tar do) here too, so thanks for giving me the chance :-)

I didn't know about pristine-tar, but it looks like precisely what I was looking for, thank you! :-)

I agree it might be a technical solution, *but*, I'm not sure I see the point.
Didn't you agree that having a "lookup service" from tarball/container checksums to SWHIDs (the Software Heritage identifiers, that can then be used to lookup stuff in the archive) would be enough to satisfy distro needs?

No; like I wrote above, I think the only way such a lookup service could be trusted is if the original tarball is actually available somewhere so that people/distros can check by themselves that the lookup service is right—which defeats the point of a lookup service.

Trusting an unverifiable correspondence table is not an option from a security standpoint. However, if pristine-tar "deltas" were available for each entry in the lookup service, then the correspondence table would be verifiable. Perhaps that's an option to consider.

If yes, then "archiving container metadata" could be replaced by simply having a way to add entries to the lookup table. And allowing distros to do so is option that we can explore. (Once the service exists, of course.)

I think there are two problems to solve: that of today's distros, which refer to tarballs, and that of tomorrow's distros.

Today's distros want to check the hash of tarballs; there's no way around it without compromising on security and reproducibility.

When we had this discussion in 2016, we thought tarballs would rather quickly disappear, but that's taking more time than we thought.

For packages still distributed as tarballs, probably we should change distro tools and practices to store content hashes (like Git tree hashes) instead of tarball hashes. Though again, that raises the question of authentication as long as upstream signs tarballs, not content hashes. I'm concerned about archived code that people will not be able to authenticate.

I'll ponder this some more and we'll see what we can do in Guix for now and for later. Thanks for listening! :-)