Event Timeline
Reply:
Ian Kelling via RT <sysadmin@gnu.org> Tue, Oct 1, 7:33 PM (10 hours ago) to ardumont On Tue Oct 01 07:03:16 2019, ardumont@softwareheritage.org wrote: > Hello, > > It's the time of the year where I ask you (again!) for your help to > better archive GNU source code in the Software Heritage archive. > > Would it be possible to change the format of the GNU file listing [1] to > also include SHA256 checksums? > > [1] https://ftp.gnu.org/tree.json.gz > > Doing so would (1) help us (Software Heritage) avoid re-downloading > files (hence also reducing load on your end) and (2) help your users > detect corruption of downloaded files. The timestamp in tree.json.gz is equivalent to sha256 sums for avoiding redownloading. For corruption, since mid 2003, files come with a corresponding gpg .sig file which is enforced by automation, and for files before that we have https://ftp.gnu.org/before-2003-08-01.md5sums.asc. The .sig files are how users should detect file corruption, which is better than sha256 because it provides authenticity at the same time. I think we would rather not publish sha256 sums because users will end up using that instead of .sig files and we don't want them to use a less secure method. The GPG authors agree with using .sig files for downloading software https://gnupg.org/download/integrity_check.html. I encourage you to download the .sig files and put them in the metadata about the software. Another way to avoid redownloading is running wget -N FILE or equivalent will return 304 Not Modified. Cheers.
I'm happy someone replied and i'll thank him in due time.
Not that i intend to reply the following to the email.
I'm throwing away my thoughts here.
The timestamp in tree.json.gz is equivalent to sha256 sums for avoiding
redownloading.
Yes, that's our fallback policy.
Somehow, the checksums feels more "secure" though.
But that may be a biased opinion (since we are swimming in there for our dag model).
For corruption,... (translated) use the .sig file and integrate it into metadata, it's better, stronger, safer
That means, we need to integrate gpg (and its key store state) within all our workers...
So no, intuitively, that will complexify the code too much because too many more cogs to deal with.
Also, what's troubling is that using hash here is considered less secure... but in the end, the .sig file needs to be checked with hash (sha1)...
¯\_(ツ)_/¯
It's also a user interface opinion divergence... I, as a user, would rather use 1 tool, not 2 to check corruption.
Forcing gpg use (which is not that hard but then not so easy either) to users mean one more tool to learn...
And it's not enough, as we need to use the sha1sum tool anyway to check the associated .sig file.
Another way to avoid redownloading is running wget -N FILE or equivalent
will return 304 Not Modified.
man wget:
-N --timestamping Turn on time-stamping.
But reading the man page, that's assuming we have the file locally already, which we don't because distributed so no.
Hello,
Thanks for your insight on this ;)
On Tue, Oct 1, 2019 at 7:33 PM Ian Kelling via RT <sysadmin@gnu.org> wrote:
On Tue Oct 01 07:03:16 2019, ardumont@softwareheritage.org wrote:
Hello,
It's the time of the year where I ask you (again!) for your help to
better archive GNU source code in the Software Heritage archive.Would it be possible to change the format of the GNU file listing [1] to
also include SHA256 checksums?[1] https://ftp.gnu.org/tree.json.gz
Doing so would (1) help us (Software Heritage) avoid re-downloading
files (hence also reducing load on your end) and (2) help your users
detect corruption of downloaded files.
The timestamp in tree.json.gz is equivalent to sha256 sums for avoiding
redownloading.
Well, yes, solely in the context of using the .sig file though.
For corruption, since mid 2003, files come with a corresponding gpg .sig
file which is enforced by automation,
We saw the .sig files, before-m5dsums file and the manual you mention.
I did not know about the enforce automation policy though, interesting.
and for files before that we have
https://ftp.gnu.org/before-2003-08-01.md5sums.asc. The .sig files are
how users should detect file corruption, which is better than sha256
because it provides authenticity at the same time. I think we would
rather not publish sha256 sums because users will end up using that
instead of .sig files and we don't want them to use a less secure
method.
Indeed.
The GPG authors agree with using .sig files for downloading
software https://gnupg.org/download/integrity_check.html. I encourage
you to download the .sig files and put them in the metadata about
the software.
That's a good suggestion.
We entertained the idea before but were not so sure
about the impacts (-> blackboard session needed :)
Gpg drags some state to keep in sync (keys for example).
Plus we use distributed stateless workers...
The checksums sounded like a middle-ground.
Like i said, we need to think more about this.
In any case, thanks for your good points.
vAnother way to avoid redownloading is running wget -N FILE or equivalent
will return 304 Not Modified.
At that moment in time (another "visit" got triggered), we don't have the artifacts on disk.
We are manipulating hashes as artifact identifier. And check whether we already ingested it or not.
As you said, another way would be to use the associated last modified timestamp. Which we identified
but saw it as not that good an heuristic (in the context of not using gpg that is).
Thanks for the heads up though ;)