Page MenuHomeSoftware Heritage

(periodically) ingest GNU package releases
Closed, MigratedEdits Locked

Description

We have done only a one-off ingestion of GNU package releases (from https://ftp.gnu.org/) back in 2015.
We should periodically ingest new GNU package releases, automatic the listing process.

ftp.gnu is now available only via HTTP (and no longer via FTP), but an up-to-date directory listing is available at https://ftp.gnu.org/tree.json.gz (thanks Ludovic Courtès for the heads up on this).

Event Timeline

zack renamed this task from periodically ingest GNU package releases to (periodically) ingest GNU package releases.Nov 16 2018, 12:08 PM
zack triaged this task as Normal priority.
zack created this task.
zack added a project: Archive coverage.

This should probably be split in 2 tasks:

  • implement a lister to create gnu origins in the scheduler (we most probably have all the necessary code to do that in the swh-lister repository).
  • adapt the loader-tar to be able to retrieve remote tarballs (it works on local tarball) (~> it'd be the occasion to refactor that loader as well ;)

@pombreda on #swh-devel suggested to use rsync -r which seems to
provide what we want!

18:36 <pombreda> ardumont, stupid suggestion wrt gnu code: have considered using rsync -r rsync://ftp.gnu.org/gnu/ ?
18:37 <pombreda> ardumont: to get a directory listing

Sample:

$ rsync -r rsync://ftp.gnu.org/gnu/ > full-listing-gnu.txt
$ tail full-listing-gnu.txt
-rw-r--r--      1,259,220 2012/02/09 00:49:44 zile/zile-2.4.5.tar.gz
-rw-r--r--            190 2012/02/09 00:49:44 zile/zile-2.4.5.tar.gz.sig
-rw-r--r--      1,257,698 2012/02/18 16:34:58 zile/zile-2.4.6.tar.gz
-rw-r--r--            190 2012/02/18 16:34:59 zile/zile-2.4.6.tar.gz.sig
-rw-r--r--      1,254,385 2012/03/20 21:19:44 zile/zile-2.4.7.tar.gz
-rw-r--r--            190 2012/03/20 21:19:45 zile/zile-2.4.7.tar.gz.sig
-rw-r--r--      1,184,855 2012/07/13 13:15:48 zile/zile-2.4.8.tar.gz
-rw-r--r--            190 2012/07/13 13:15:49 zile/zile-2.4.8.tar.gz.sig
-rw-r--r--      1,192,776 2012/10/01 23:08:02 zile/zile-2.4.9.tar.gz
-rw-r--r--            190 2012/10/01 23:08:03 zile/zile-2.4.9.tar.gz.sig

Related T735#13352

Heads up, there is now a json file (compressed) describing the gnu mirror's arborescence tree.
It is updated daily at.
It's served at [1]

Excerpt:

  {"type":"directory","name":"/home/ftp","contents":[
    ...
    {"type":"file","name":"find.txt.gz","size":260849,"time":"1554301217"},
    {"type":"directory","name":"gnu","size":12288,"time":"1548425579","contents":[
        {"type":"file","name":"3DLDF-1.1.3-1.1.4.diff.gz","size":1745877,"time":"1071136751"},
        {"type":"file","name":"3DLDF-1.1.3-1.1.4.diff.gz.sig","size":65,"time":"1071136759"},
        {"type":"file","name":"3DLDF-1.1.3.tar.gz","size":3170889,"time":"1071002600"},
        {"type":"file","name":"3DLDF-1.1.3.tar.gz.sig","size":65,"time":"1071002621"},
        {"type":"file","name":"3DLDF-1.1.4-1.1.5.1.diff.gz","size":7735709,"time":"1074284224"},
        {"type":"file","name":"3DLDF-1.1.4-1.1.5.1.diff.gz.sig","size":65,"time":"1074284255"},
        {"type":"file","name":"3DLDF-1.1.4-1.1.5.diff.gz","size":8502211,"time":"1074279888"},
        {"type":"file","name":"3DLDF-1.1.4-1.1.5.diff.gz.sig","size":65,"time":"1074279893"},
        {"type":"file","name":"3DLDF-1.1.4.tar.gz","size":3325761,"time":"1071078759"},
   ...1
}

The discussion took place by cross-posting on mailing-list [2]

Thanks the cool dude at gnu.org for the heads up ;)

[1] https://ftp.gnu.org/tree.json.gz

[2] https://sympa.inria.fr/sympa/arc/swh-devel/2019-03/msg00003.html

It is updated daily at.

This is not the case. It is updated every time the ftp directory changes,
so you can use the timestamp of the file to see if there have been
any changes.

This is not the case. It is updated every time the ftp directory changes,
so you can use the timestamp of the file to see if there have been
any changes.

Thanks for the heads up, i did not grasp you deactivated the daily update for the update change!

Cheers,

As suggested by @olasd, what was done in 2015 to ingest packages -

  1. Create origins for all the folders indiscriminately
  2. Only import things that look like tarballs (i.e. that end with .tar.something)

So I guess the best approach to make a lister and loader to ingest GNU packages would be to follow the same.

ardumont claimed this task.
ardumont closed subtask T1723: GNU Loader as Resolved.

Given this is done, where can one see the timeline of visits for a given origin coming from GNU?

I've took a random example and it still contains only the 2015 visit. Maybe I'm looking in the wrong place though. (And maybe we should clean up those visits then.)

gitlab-migration changed the status of subtask T1722: GNU Lister from Resolved to Migrated.Jan 8 2023, 9:59 PM
gitlab-migration changed the status of subtask T1723: GNU Loader from Resolved to Migrated.