Page MenuHomeSoftware Heritage

lookup ingested tarballs (or similar source code containers) by container checksum
Open, LowPublic

Description

Package repositories (Pypi and Hackage for instance) provide a checksum for their package. Unfortunately, this checksum is computed on the tarball itself, and not on the content.
A direct consequence is that the checksum of a release downloaded on Software Heritage is not equal to the checksum of the same release exposed by the package repositories (because SWH doesn't preserve file permissions, timestamps,...).

In the context of Nix, we often compare the checksum provided by package repositories to the checksum of the downloaded artifact. In this case, the Nix ckecksum verification fails if we download an artifact from SWH. It would actually be really nice if package repositories could expose a checksum on the content and not on the container (the tarball)!

Do you think it would be possible/pertinent to create a swhid for tarballs?

I'm thinking on something such as swh:1:tar:XXXX. To compute the hash, the file would first be unpacked and the checksum would be computed on the content. To verify this hash, we would know we have to unpack the file before computing the hash.

Note there are corner cases that could be hard to manage, such as archives without any top level directory.

Event Timeline

lewo created this task.Jun 2 2020, 7:03 PM
lewo created this object in space S1 Public.

Thanks for submitting this!

I have the feeling that upstreams that want to use SWH as fallback could store the SWHID of the directory of the decompressed contents of the archive (after trimming a potential top level directory), and that would capture most if not all of the needed information to substitute the archive.

The main concern would be for build systems depending on "quirks" of the archive format (e.g. sticky bits, different permissions, different owners, modification times...), but that shouldn't be a problem in the vast majority of cases.

As a stop-gap, we could provide a lookup service from the hash of original tarballs (which we store as metadata of the generated revisions) to SWHIDs of (decompressed) directories.

I wonder what @rdicosmo / @zack / @moranegg think about this, hence my ping.

An important issue indeed :-)

A tar (or zip, or ar... etc.) is a container that once unpacked yields just a directory, so the SWHID for directories (swh:1:dir: ...) already provides everything necessary to identify its content (excluding the quirks mentioned by @olasd): we do not need anything new (whether a top level directory is present or not is not a problem).

The real questions are:

  • where the SWHID corresponding to the pure payload of a container should be stored?
  • who is the source of trust for this correspondence?

There are two main possibilities we can consider.

Package distributor as source of trust

One solution would be for package managers to provide multiple checksums for their packages:

  • the checksum(s) of the container (as now done by most of them)
  • the swh:1:dir: of the unpacked payload, computed using swh-identify on the resulting directory

Providing multiple checksums is not uncommon (see https://packages.debian.org/buster/amd64/exim4-base/download for example), and
adding a SWHID identifier would be straightforward to implement in those package managers that build the .tar archives themselves (like Debian). More work would be needed for package managers that just refer to existing archives and do not pack/unpack them (like the case of Hackage packages referenced in Nix that @lewo mentioned in a separate conversation).

Software Heritage as source of trust

The alternative is to keep this information in Software Heritage. Technically, it's not a big deal: whenever we harvest a container, we need just to store the container checksum(s) in the extrinsic metadata table for the directory node corresponding to the ingested/unpacked payload. No need to change the SWHIDs, we would just need to add an API entry that provides the correspondence between container checksums(s) and swh:dir identifiers (both ways!).

Discussion

Here are the key pros and cons I see for each of these approaches

Having the package distributors compute the SWHID and keeping the correspondence information has the advantage that the source of trust is the package distributor itself, so the community using that package manager has full control on the process and needs to only trust the same people that they alredy trust for the integrity of the packages.

The disadvantage is that one needs to go around and convince all package distributors to do so, and everybody needs to recompute the SWHID even if the same content has been seen/used/distributed in other places.

Promoting Software Heritage as the blessed keeper of the correspondence information has the disadvantage to make Software Heritage the source of trust: it remains to be seen whether all package distributors would be fine with this. If we go this way, we may really need to go further on the blockchain path, and ensure anybody can verify that we do not alter the correspondence table (voluntarily or by mistake) at a later time.
The clear advantage is that the SWHID computation would be done only once and by us.

Way forward

Keeping the correspondence table is something we may want to do anyway (or maybe we do already? @olasd?) to avoid reingesting a container we have already seen; and we can do it all the way down (for example, if a .tar.gz file is ingested, one may learn the checksums of the .tar.gz and of the .tar: keeping them both in the table may save time later and adds to our knowledge base).

So we should certainly add this feature to the roadmap, plan to provide the functionality, and then let package distributor choose whether to trust us or not, without forcing anybody to use the correspondence table.

zack added a comment.Jun 14 2020, 8:56 AM

Making explicit a direct answer to one of @lewo's question (hinted at by both @olasd and @rdicosmo): no, we do not want a new type of SWHID (swh:1:tar:...) for source code containers, which from our point of view are ephemeral.

But the need to lookup code by container checksums is real, and I agree with the overview of the options detailed by @rdicosmo.

Spreading the use of swh:1:dir:... so that it becomes commonplace among our upstreams would be great, but will take time.

Meanwhile, we can provide a lookup service from container checksums to dir SWHID, as suggested by @olasd. We already have the information stored (for the outer container at least), it's just a matter of adding a index of sorts for it.

zack renamed this task from A swhid for archives to lookup ingested tarballs (or similar source code containers) by container checksum.Jun 14 2020, 8:56 AM
zack triaged this task as Low priority.

@rdicosmo The discussion of the "source of trust" is an important one, and it's interesting to see how we can address it going forward.

The proposal of a correspondence table, as I wrote on swh-devel, leaves open the question of today's and yesterday's software, assuming SWHIDs become the de facto standard tomorrow. How can I check the integrity of code fetched from SWH if all I have is its tarball's SHA256 from its release announcement? How can I check its authenticity if all I have is an OpenPGP signature computed over a tarball?

@rdicosmo The discussion of the "source of trust" is an important one, and it's interesting to see how we can address it going forward.

The proposal of a correspondence table, as I wrote on swh-devel, leaves open the question of today's and yesterday's software, assuming SWHIDs become the de facto standard tomorrow. How can I check the integrity of code fetched from SWH if all I have is its tarball's SHA256 from its release announcement? How can I check its authenticity if all I have is an OpenPGP signature computed over a tarball?

The issue may be less problematic than it seems. Let me offer a few considerations for legacy tarballs (those for which we cannot ask the developer to put out the corresponding SWHID today :-)) :

  • if the only thing that you have is a SHA256 from the release announcement for a tarball that has never been ingested in SWH, and has been lost, there is not much you can do
  • if you still have that tarball at hand, then it can be ingested in SWH, and we keep the correspondence between SWHID and SHA256; in principle, you need to trust us, but one can foresee having external parties checking that the correspondence is real while the tarball is still there, and adding their observation to the chain of trust means you need to trust us less and less
zimoun added a subscriber: zimoun.Jun 23 2020, 6:12 PM
  • if you still have that tarball at hand, then it can be ingested in SWH, and we keep the correspondence between SWHID and SHA256; in principle, you need to trust us, but one can foresee having external parties checking that the correspondence is real while the tarball is still there, and adding their observation to the chain of trust means you need to trust us less and less

By we keep the correspondence between SWHID and SHA256 you mean you on the SWH side?

In T2430#45767, @zimoun wrote:
  • if you still have that tarball at hand, then it can be ingested in SWH, and we keep the correspondence between SWHID and SHA256; in principle, you need to trust us, but one can foresee having external parties checking that the correspondence is real while the tarball is still there, and adding their observation to the chain of trust means you need to trust us less and less

By we keep the correspondence between SWHID and SHA256 you mean you on the SWH side?

Yes... we are willing to do this, and work out the details on how to ensure that this correspondence can be trusted :-)

Thanks for your feedback, @rdicosmo!

Yes... we are willing to do this, and work out the details on how to ensure that this correspondence can be trusted :-)

I believe the only way the correspondence can be trusted is if the original tarball is still available and everyone can unpack it and compute the SWHID on the unpacked tree.

I see SWH still archives raw tarballs from ftp.gnu.org, such as grep-3.4.tar.xz (Jan. 2020). Perhaps an option would be to keep doing that, at least in some cases, and at the same time run a campaign encouraging developers to move away from container hashes/signatures?

Migration away from tarballs is already happening as more and more software is distributed straight from content-addressed VCS repositories, though progress has been relatively slow since we first discussed it in 2016.

Hello!

Do I get it right that the primary reason why tarballs aren't systematically archived is that doing so would be too expensive storage-wise (no deduplication)?

If that's the case, here's an idea that may be worth exploring: storing container metadata (tar, gzip, etc.) instead of the container itself. The hypothesis, which would have to be confirmed, is that in many cases we should be able to reproduce bit-identical containers from that metadata plus the corresponding SWH directory.

Thoughts?

zack added a comment.Jul 2 2020, 12:07 PM

@civodul I wanted to raise the topic of storing container metadata (in the style of what tools like pristine-tar do) here too, so thanks for giving me the chance :-)
I agree it might be a technical solution, *but*, I'm not sure I see the point.
Didn't you agree that having a "lookup service" from tarball/container checksums to SWHIDs (the Software Heritage identifiers, that can then be used to lookup stuff in the archive) would be enough to satisfy distro needs?
If yes, then "archiving container metadata" could be replaced by simply having a way to add entries to the lookup table. And allowing distros to do so is option that we can explore. (Once the service exists, of course.)

Hi @zack,

In T2430#46040, @zack wrote:

@civodul I wanted to raise the topic of storing container metadata (in the style of what tools like pristine-tar do) here too, so thanks for giving me the chance :-)

I didn't know about pristine-tar, but it looks like precisely what I was looking for, thank you! :-)

I agree it might be a technical solution, *but*, I'm not sure I see the point.
Didn't you agree that having a "lookup service" from tarball/container checksums to SWHIDs (the Software Heritage identifiers, that can then be used to lookup stuff in the archive) would be enough to satisfy distro needs?

No; like I wrote above, I think the only way such a lookup service could be trusted is if the original tarball is actually available somewhere so that people/distros can check by themselves that the lookup service is right—which defeats the point of a lookup service.

Trusting an unverifiable correspondence table is not an option from a security standpoint. However, if pristine-tar "deltas" were available for each entry in the lookup service, then the correspondence table would be verifiable. Perhaps that's an option to consider.

If yes, then "archiving container metadata" could be replaced by simply having a way to add entries to the lookup table. And allowing distros to do so is option that we can explore. (Once the service exists, of course.)

I think there are two problems to solve: that of today's distros, which refer to tarballs, and that of tomorrow's distros.

Today's distros want to check the hash of tarballs; there's no way around it without compromising on security and reproducibility.

When we had this discussion in 2016, we thought tarballs would rather quickly disappear, but that's taking more time than we thought.

For packages still distributed as tarballs, probably we should change distro tools and practices to store content hashes (like Git tree hashes) instead of tarball hashes. Though again, that raises the question of authentication as long as upstream signs tarballs, not content hashes. I'm concerned about archived code that people will not be able to authenticate.

I'll ponder this some more and we'll see what we can do in Guix for now and for later. Thanks for listening! :-)

vlorentz added a subscriber: vlorentz.EditedAug 6 2020, 4:19 PM

I started looking into this, using https://nix-community.github.io/nixpkgs-swh/sources-unstable.json as a source of archive files.

pristine-tar works pretty well after round-tripping through SWH, although it fails sometimes with .bz2. The delta is between 1 and 5% of the size of the original file for almost all the archives I tried.

I hacked together "pristine-zip", a clone of pristine-tar for .zip files ]]; and half the time the delta is less than 1% of the original file, but for the other half, storing the delta is almost as bad as storing the original file, because I can't guess what the original compressor. And I didn't even start testing how stable pristine-zip is if you transfer the delta between different systems.

However, after a quick glance at Nix's sources, it seems like all the .zip files contain generated fonts, which is outside the scope of SWH anyway (even though we currently archive them as a side-effect of archiving source code).

So in short, I think we could store these delta files for .tar.{gz,bz2,xz}, but we must keep in mind it's an imperfect solution (doesn't work all the time + will break eventually as compression software is updated)

Hi all,

I’m a Guix contributor, and I’ve built some software to address this issue: Disarchive. The basic concept was dreamed up by @civodul, and I have been implementing it and working out the kinks. It’s on its way to becoming a tool that Guix can use to recover source code archives. Ideally it could be useful to the broader software archival community as well, which is why I’m presenting it here.

The purpose of the software is to disassemble a source code archive (e.g., a Gzip’d tarball) into its data and metadata. The data is what is normally the purview of Software Heritage: the files and directories themselves. The metadata is all the tarball usernames, Gzip timestamps, etc. Later, it can use this metadata along with the files themselves to reconstruct the original source code archive.

This is similar to what pristine-tar does, but our goal is to make the metadata transparent and readable. Where pristine-tar relies on binary deltas to avoid having to interpret the metadata, we strive to interpret the metadata and store it in a structured way. This makes the resulting database a lot nicer, since it clearly describes the parts of the source code archive not captured by Software Heritage. Of course, this comes at a price, which is development effort. So far it is coming along nicely, but we only support Gzip and tarballs. Judging by the pristine-tar source code, adding bzip2 and XZ support should be pretty easy. (I know less about ZIP, but @vlorentz’s results suggest it could be a bit tricky.)

Now for an example. Let’s say we want to disassemble the GMP-ECM 7.0.4 tarball (which is hosted on InriaForge, and will disappear in a few months). First we tell Disarchive where to store metadata:

export DISARCHIVE_DB=/tmp/disarchive-db

Now we can run Disarchive:

disarchive save ecm-7.0.4.tar.gz

This produces three files in the database directory: one for the Gzip layer, one for the tarball layer, and one for the directory reference. Each of these files are human-readable structured data. For instance, the Gzip file looks like this:

(gzip-member
  (version 0)
  (name "ecm-7.0.4.tar.gz")
  (input
    (sha256
      "ee3a58443d65a0ad7e61e5b7f14468796a06af15608cb8cc1aaaa55958bce60d"))
  (header
    (mtime 1476178165)
    (extra-flags 2)
    (os 3))
  (footer
    (crc 1359967944)
    (isize 6010880))
  (compressor gnu-best)
  (digest
    (sha256
      "0cf7b3eee8462cc6f98b418b47630e1eb6b3f4f8c3fc1fb005b08e2a1811ba43")))

The “input” field is a reference to the tarball, and “digest” is the hash of the original Gzip’d file. The “header” and “footer” fields contain the Gzip metadata, and “compressor” is an opaque reference to the compression algorithm used (in this case, GNU Gzip with the “--best” flag).

The tarball file is similar, but contains tarball metadata. The directory reference file stores a list of references to the original directory data. Currently this means a Software Heritage directory ID, but in the future it could include other content-addressed archives.

To reassemble the archive, we simply run:

disarchive load 0cf7b3eee8462cc6f98b418b47630e1eb6b3f4f8c3fc1fb005b08e2a1811ba43 ecm-7.0.4.tar.gz

This will fetch the directory from the SWH archive, rebuild the tarball, compress it, and write the output to “ecm-7.0.4.tar.gz”. The output will be bit-for-bit identical to the original. In this case it pulls all the metadata from the local database, but it can also get it over the Web. This means that if the database was mirrored on GitHub, it could be accessed directly from the Web interface (with the added bonus of being automatically archived by SWH).

Our current plan is to do the disassembling on our CI infrastructure, and then aggregate the results and store them in Git. The recovery process would happen on the user’s computer, accessing the Disarchive database over the Web.

Nothing about the tool is Guix-specific, and my hope is that it could be useful to other projects as well. To that end, your comments and suggestions are most welcome!

Thank you @samplet for sharing this great work: I am really looking forward to see you and @vlorentz compare notes and see whether we can archive the three extra files you produce as extrinsic metadata in the SWH archive!

A quick suggestion about the swh entry in the directory reference file : please use the full SWHID, which is future proof, instead of just the hash. It would also be better to use swhid as the field name. In this case, this would give

(swhid "swh:1:dir:7803753618040f30f97daa729006d1a34da0e8df"))

Pinging also @lewo who might also be quite interested.

@samplet

I tried this as well, before trying pristine-tar. An issue I ran into is that storing the value of fields isn't enough, because there are multiple way to represent them in tar (eg. numbers are usually null-terminated strings, so \x00\x00\x00\x00, 0\x00\x00\x00, 0000, and \x00123 represent the same value).

I ended up dumping the entire tar headers to deal with this issue, but it produces metadata files much bigger than pristine-tar.

I see in your tarball metadata file that you're only storing values; how do you deal with restoring the right serialization?

@rdicosmo the full ID is a better choice. Thanks!

@vlorentz I wrote a custom tarball reader that keeps track of all of the formatting and such. To cut down on the size, the format allows for specifying defaults for an entire tarball. Most tarballs are self-consistent, so they tend to use the same formatting throughout. This means I only have to store it once. I made a careful effort to make sure the tarball processing is lossless (there are a few bugs left to be squashed, though).

For an example, take a look at the metadata for “bitcoin-0.19.1.tar”. At the beginning of the file there is a “default-header” field that specifies:

(default-header
  ...
  (devmajor-format (width 0))
  (devminor-format (width 0)))

This means that the “devmajor” and “devminor” fields represent zero by an empty string rather than the expected "0000000".

At the moment, a database for around 3,912 archives is 295M. That works out to about 77K per archive. It gets way smaller if compressed, because the field names are repetitive. We could do better by removing the “size” and “checksum” fields in cases where they are not surprising. Then we would only need to store the filename and timestamp for the majority of tarball entries. I haven’t done any experiments with pristine-tar, so I don’t know how it compares. Generally, the Disarchive metadata is smaller than the original tarball overhead.

lewo added a comment.Aug 27 2020, 10:28 PM

@samplet wow! that's pretty cool! Thank you ;)

I really want to try it out and start to run it on our CI to see how it behaves.
It would be nice to "disarchive" archives in the nixguix loader and reference these extra files in the release metadata;)

@lewo Great! The code is still “technical preview” quality, so there are some rough edges. Also, I’m willing to make architectural changes to better suit SWH. For instance, the whole “database” model might not be necessary if we want to store the metadata along with the files themselves. Feel free to ask for whatever help or changes you need! In the meantime, I will work on fixing some bugs and cleaning things up.

@vlorentz Just a follow-up about the size. I ran 3,884 archives from my set through pristine-tar (the other 28 resulted in errors). Here’s a table comparing the sizes. Each row only considers the 3,884 archives that made it through pristine-tar. The pristine-tar format is Gzip’d, which is why I checked the Gzip’d size for Disarchive. Note this is raw bytes rather than size on disk.

FormatTotal bytesBytes/archiveKiB/archive
pristine-tar 42,982,55111,06710.8
Disarchive (raw)237,297,97361,09659.7
Disarchive (Gzip) 22,698,724 5,844 5.7

@samplet so it's an improvement over pristine-tar, great!

I'll try out Disarchive, to see if it can reproduce all the tarballs SWH has seen.

Update: I just tried out my script to use Disarchive as an alternative to pristine-tar. It was pretty easy since they both work similarly (from the outside, ofc).

And in addition to being more space-efficient, disarchive is also faster than pristine-tar :)