Page MenuHomeSoftware Heritage
Paste P537

about asking gnu sysadmins to add checksum to their listing manifest
ActivePublic

Authored by ardumont on Oct 1 2019, 12:10 PM.
Hello,
It's the time of the year where I ask you (again!) for your help to better archive GNU source code in the Software Heritage archive.
Would it be possible to change the format of the GNU file listing [1] to also include SHA256 checksums?
[1]: https://ftp.gnu.org/tree.json.gz
Doing so would (1) help us (Software Heritage) avoid re-downloading files (hence also reducing load on your end) and (2) help your users detecting corruption of download files.
Cheers,
--
tony / Antoine R. Dumont (@ardumont)
-----------------------------------------------------------------
gpg fingerprint BF00 203D 741A C9D5 46A8 BE07 52E2 E984 0D10 C3B8

Event Timeline

ardumont edited the content of this paste. (Show Details)
ardumont changed the title of this paste from about asking gnu sysadmins for adding one checksum to their listing manifest to about asking gnu sysadmins to add checksum to their listing manifest.

Reply:

Ian Kelling via RT <sysadmin@gnu.org>
Tue, Oct 1, 7:33 PM (10 hours ago)

to ardumont

On Tue Oct 01 07:03:16 2019, ardumont@softwareheritage.org wrote:
> Hello,
> 
> It's the time of the year where I ask you (again!) for your help to
> better archive GNU source code in the Software Heritage archive.
> 
> Would it be possible to change the format of the GNU file listing [1] to
> also include SHA256 checksums?
> 
> [1] https://ftp.gnu.org/tree.json.gz
> 
> Doing so would (1) help us (Software Heritage) avoid re-downloading
> files (hence also reducing load on your end) and (2) help your users
> detect corruption of downloaded files.

The timestamp in tree.json.gz is equivalent to sha256 sums for avoiding
redownloading.

For corruption, since mid 2003, files come with a corresponding gpg .sig
file which is enforced by automation, and for files before that we have
https://ftp.gnu.org/before-2003-08-01.md5sums.asc. The .sig files are
how users should detect file corruption, which is better than sha256
because it provides authenticity at the same time. I think we would
rather not publish sha256 sums because users will end up using that
instead of .sig files and we don't want them to use a less secure
method. The GPG authors agree with using .sig files for downloading
software https://gnupg.org/download/integrity_check.html. I encourage
you to download the .sig files and put them in the metadata about
the software.

Another way to avoid redownloading is running wget -N FILE or equivalent
will return 304 Not Modified.

Cheers.

I'm happy someone replied and i'll thank him in due time.
Not that i intend to reply the following to the email.
I'm throwing away my thoughts here.

The timestamp in tree.json.gz is equivalent to sha256 sums for avoiding
redownloading.

Yes, that's our fallback policy.
Somehow, the checksums feels more "secure" though.
But that may be a biased opinion (since we are swimming in there for our dag model).

For corruption,... (translated) use the .sig file and integrate it into metadata, it's better, stronger, safer

That means, we need to integrate gpg (and its key store state) within all our workers...
So no, intuitively, that will complexify the code too much because too many more cogs to deal with.

Also, what's troubling is that using hash here is considered less secure... but in the end, the .sig file needs to be checked with hash (sha1)...
¯\_(ツ)_/¯

It's also a user interface opinion divergence... I, as a user, would rather use 1 tool, not 2 to check corruption.
Forcing gpg use (which is not that hard but then not so easy either) to users mean one more tool to learn...
And it's not enough, as we need to use the sha1sum tool anyway to check the associated .sig file.

Another way to avoid redownloading is running wget -N FILE or equivalent
will return 304 Not Modified.

man wget:

-N
--timestamping
    Turn on time-stamping.

But reading the man page, that's assuming we have the file locally already, which we don't because distributed so no.

Hello,

Thanks for your insight on this ;)

On Tue, Oct 1, 2019 at 7:33 PM Ian Kelling via RT <sysadmin@gnu.org> wrote:
On Tue Oct 01 07:03:16 2019, ardumont@softwareheritage.org wrote:

Hello,

It's the time of the year where I ask you (again!) for your help to
better archive GNU source code in the Software Heritage archive.

Would it be possible to change the format of the GNU file listing [1] to
also include SHA256 checksums?

[1] https://ftp.gnu.org/tree.json.gz

Doing so would (1) help us (Software Heritage) avoid re-downloading
files (hence also reducing load on your end) and (2) help your users
detect corruption of downloaded files.

The timestamp in tree.json.gz is equivalent to sha256 sums for avoiding
redownloading.

Well, yes, solely in the context of using the .sig file though.

For corruption, since mid 2003, files come with a corresponding gpg .sig
file which is enforced by automation,

We saw the .sig files, before-m5dsums file and the manual you mention.
I did not know about the enforce automation policy though, interesting.

and for files before that we have
https://ftp.gnu.org/before-2003-08-01.md5sums.asc. The .sig files are
how users should detect file corruption, which is better than sha256
because it provides authenticity at the same time. I think we would
rather not publish sha256 sums because users will end up using that
instead of .sig files and we don't want them to use a less secure
method.

Indeed.

The GPG authors agree with using .sig files for downloading
software https://gnupg.org/download/integrity_check.html. I encourage
you to download the .sig files and put them in the metadata about
the software.

That's a good suggestion.
We entertained the idea before but were not so sure
about the impacts (-> blackboard session needed :)
Gpg drags some state to keep in sync (keys for example).
Plus we use distributed stateless workers...
The checksums sounded like a middle-ground.
Like i said, we need to think more about this.

In any case, thanks for your good points.

vAnother way to avoid redownloading is running wget -N FILE or equivalent
will return 304 Not Modified.

At that moment in time (another "visit" got triggered), we don't have the artifacts on disk.
We are manipulating hashes as artifact identifier. And check whether we already ingested it or not.
As you said, another way would be to use the associated last modified timestamp. Which we identified
but saw it as not that good an heuristic (in the context of not using gpg that is).

Thanks for the heads up though ;)