Page MenuHomeSoftware Heritage

(periodically) ingest GNU package releases
Open, NormalPublic

Description

We have done only a one-off ingestion of GNU package releases (from https://ftp.gnu.org/) back in 2015.
We should periodically ingest new GNU package releases, automatic the listing process.

ftp.gnu is now available only via HTTP (and no longer via FTP), but an up-to-date directory listing is available at https://ftp.gnu.org/tree.json.gz (thanks Ludovic Courtès for the heads up on this).

Event Timeline

zack created this task.Nov 16 2018, 12:08 PM
zack triaged this task as Normal priority.
zack added a project: Archive coverage.
zack renamed this task from periodically ingest GNU package releases to (periodically) ingest GNU package releases.

This should probably be split in 2 tasks:

  • implement a lister to create gnu origins in the scheduler (we most probably have all the necessary code to do that in the swh-lister repository).
  • adapt the loader-tar to be able to retrieve remote tarballs (it works on local tarball) (~> it'd be the occasion to refactor that loader as well ;)
ardumont added a comment.EditedMar 12 2019, 6:53 PM

@pombreda on #swh-devel suggested to use rsync -r which seems to
provide what we want!

18:36 <pombreda> ardumont, stupid suggestion wrt gnu code: have considered using rsync -r rsync://ftp.gnu.org/gnu/ ?
18:37 <pombreda> ardumont: to get a directory listing

Sample:

$ rsync -r rsync://ftp.gnu.org/gnu/ > full-listing-gnu.txt
$ tail full-listing-gnu.txt
-rw-r--r--      1,259,220 2012/02/09 00:49:44 zile/zile-2.4.5.tar.gz
-rw-r--r--            190 2012/02/09 00:49:44 zile/zile-2.4.5.tar.gz.sig
-rw-r--r--      1,257,698 2012/02/18 16:34:58 zile/zile-2.4.6.tar.gz
-rw-r--r--            190 2012/02/18 16:34:59 zile/zile-2.4.6.tar.gz.sig
-rw-r--r--      1,254,385 2012/03/20 21:19:44 zile/zile-2.4.7.tar.gz
-rw-r--r--            190 2012/03/20 21:19:45 zile/zile-2.4.7.tar.gz.sig
-rw-r--r--      1,184,855 2012/07/13 13:15:48 zile/zile-2.4.8.tar.gz
-rw-r--r--            190 2012/07/13 13:15:49 zile/zile-2.4.8.tar.gz.sig
-rw-r--r--      1,192,776 2012/10/01 23:08:02 zile/zile-2.4.9.tar.gz
-rw-r--r--            190 2012/10/01 23:08:03 zile/zile-2.4.9.tar.gz.sig

Related T735#13352

ardumont added a comment.EditedApr 3 2019, 11:26 PM

Heads up, there is now a json file (compressed) describing the gnu mirror's arborescence tree.
It is updated daily at.
It's served at [1]

Excerpt:

  {"type":"directory","name":"/home/ftp","contents":[
    ...
    {"type":"file","name":"find.txt.gz","size":260849,"time":"1554301217"},
    {"type":"directory","name":"gnu","size":12288,"time":"1548425579","contents":[
        {"type":"file","name":"3DLDF-1.1.3-1.1.4.diff.gz","size":1745877,"time":"1071136751"},
        {"type":"file","name":"3DLDF-1.1.3-1.1.4.diff.gz.sig","size":65,"time":"1071136759"},
        {"type":"file","name":"3DLDF-1.1.3.tar.gz","size":3170889,"time":"1071002600"},
        {"type":"file","name":"3DLDF-1.1.3.tar.gz.sig","size":65,"time":"1071002621"},
        {"type":"file","name":"3DLDF-1.1.4-1.1.5.1.diff.gz","size":7735709,"time":"1074284224"},
        {"type":"file","name":"3DLDF-1.1.4-1.1.5.1.diff.gz.sig","size":65,"time":"1074284255"},
        {"type":"file","name":"3DLDF-1.1.4-1.1.5.diff.gz","size":8502211,"time":"1074279888"},
        {"type":"file","name":"3DLDF-1.1.4-1.1.5.diff.gz.sig","size":65,"time":"1074279893"},
        {"type":"file","name":"3DLDF-1.1.4.tar.gz","size":3325761,"time":"1071078759"},
   ...1
}

The discussion took place by cross-posting on mailing-list [2]

Thanks the cool dude at gnu.org for the heads up ;)

[1] https://ftp.gnu.org/tree.json.gz

[2] https://sympa.inria.fr/sympa/arc/swh-devel/2019-03/msg00003.html

iank added a subscriber: iank.Apr 4 2019, 6:37 PM

It is updated daily at.

This is not the case. It is updated every time the ftp directory changes,
so you can use the timestamp of the file to see if there have been
any changes.

This is not the case. It is updated every time the ftp directory changes,
so you can use the timestamp of the file to see if there have been
any changes.

Thanks for the heads up, i did not grasp you deactivated the daily update for the update change!

Cheers,

ardumont updated the task description. (Show Details)Thu, May 16, 1:44 PM

As suggested by @olasd, what was done in 2015 to ingest packages -

  1. Create origins for all the folders indiscriminately
  2. Only import things that look like tarballs (i.e. that end with .tar.something)

So I guess the best approach to make a lister and loader to ingest GNU packages would be to follow the same.