Page MenuHomeSoftware Heritage

rubygems: Use gems database dump to improve listing output
ClosedPublic

Authored by anlambert on Oct 7 2022, 11:54 AM.

Details

Summary

Instead of using an undocumented rubygems HTTP endpoint that only
gives us the names of the gems, prefer to exploit the daily PostgreSQL
dump of the rubygems.org database.

It enables to list all gems but also all versions of a gem and its
release artifacts. For each relase artifact, the following info are
extracted: version, download URL, sha256 checksum, release date
plus a couple of extra metadata.

The lister will now set list of artifacts and list of metadata as extra
loader arguments when sending a listed origin to the scheduler database.
A last_update date is also computed which should ensure loading tasks
for rubygems will be scheduled only when new releases are available since
last loadings.

To be noted, the lister will spawn a temporary postgres instance so this
require the initdb executable from postgres server installation to be
available in the execution environment.

Related to T1777

This implements the proposal of @nahimilega in T1777#33490.

This is what I obtained when testing the lister in docker, around 187000 origins listed and processed in 25 minutes.

docker-swh-lister-1  | [2022-10-06 20:40:11,169: INFO/ForkPoolWorker-1] Task swh.lister.rubygems.tasks.RubyGemsListerTask[21911097-b9e7-48f8-a47c-ada105f4725a] succeeded in 1549.4076449069835s: {'pages': 186993, 'origins': 186993}

Diff Detail

Repository
rDLS Listers
Branch
rubygems-lister-improvements
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 32156
Build 50354: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 50353: arc lint + arc unit

Event Timeline

anlambert added a subscriber: nahimilega.

Build is green

Patch application report for D8639 (id=31201)

Rebasing onto 5a53243bd3...

Current branch diff-target is up to date.
Changes applied before test
commit a3f0a05008db1f1b176f4b2d01fb483303884099
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 6 17:51:27 2022 +0200

    rubygems: Use gems database dump to improve listing output
    
    Instead of using an undocumented rubygems HTTP endpoint that only
    gives us the names of the gems, prefer to exploit the daily PostgreSQL
    dump of the rubygems.org database.
    
    It enables to list all gems but also all versions of a gem and its
    release artifacts. For each relase artifact, the following info are
    extracted: version, download URL, sha256 checksum, release date
    plus a couple of extra metadata.
    
    The lister will now set list of artifacts and list of metadata as extra
    loader arguments when sending a listed origin to the scheduler database.
    A last_update date is also computed which should ensure loading tasks
    for rubygems will be scheduled only when new releases are available since
    last loadings.
    
    To be noted, the lister will spawn a temporary postgres instance so this
    require the initdb executable from postgres server installation to be
    available in the execution environment.
    
    Related to T1777

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/771/ for more details.

vlorentz added a subscriber: vlorentz.
vlorentz added inline comments.
swh/lister/rubygems/lister.py
141–171

unpacking makes it more readable IMO

(also, this avoids a interpolating the SQL query, even if it shouldn't be an issue)

swh/lister/rubygems/tests/data/small_rubygems_dump.sh
10

this script should be reproducible, use a commit instead of master

This revision is now accepted and ready to land.Oct 7 2022, 2:06 PM
swh/lister/rubygems/lister.py
141–171

Better indeed, thanks !

swh/lister/rubygems/tests/data/small_rubygems_dump.sh
10

Ah right, I should use a permalink here.

Update:

  • rebase
  • address @vlorentz comments
  • add extrinsic metadata URL for each gem version in the metadata sent along artifacts to rubygems loader

Build is green

Patch application report for D8639 (id=31213)

Rebasing onto c22f41a6d7...

Current branch diff-target is up to date.
Changes applied before test
commit 108816f232d2397590240ffc369d5f4c4da32aca
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Oct 6 17:51:27 2022 +0200

    rubygems: Use gems database dump to improve listing output
    
    Instead of using an undocumented rubygems HTTP endpoint that only
    gives us the names of the gems, prefer to exploit the daily PostgreSQL
    dump of the rubygems.org database.
    
    It enables to list all gems but also all versions of a gem and its
    release artifacts. For each relase artifact, the following info are
    extracted: version, download URL, sha256 checksum, release date
    plus a couple of extra metadata.
    
    The lister will now set list of artifacts and list of metadata as extra
    loader arguments when sending a listed origin to the scheduler database.
    A last_update date is also computed which should ensure loading tasks
    for rubygems will be scheduled only when new releases are available since
    last loadings.
    
    To be noted, the lister will spawn a temporary postgres instance so this
    require the initdb executable from postgres server installation to be
    available in the execution environment.
    
    Related to T1777

See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/773/ for more details.