Page MenuHomeSoftware Heritage

Add rubygems loader
ClosedPublic

Authored by Alphare on Sep 28 2022, 6:22 PM.

Diff Detail

Repository
rDLDBASE Generic VCS/Package Loader
Branch
rubygems-D8569
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 31955
Build 50016: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 50015: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D8569 (id=30919)

Rebasing onto 1facea3cd2...

Current branch diff-target is up to date.
Changes applied before test
commit fc60a338c728cae7e62b52bccd2adba972b67185
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Wed Sep 28 18:21:57 2022 +0200

    Add rubygems loader

Link to build: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/906/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/906/console

Harbormaster returned this revision to the author for changes because remote builds failed.Sep 28 2022, 6:25 PM
Harbormaster failed remote builds in B31892: Diff 30919!
This comment was removed by Alphare.

Build is green

Patch application report for D8569 (id=30920)

Rebasing onto 1facea3cd2...

Current branch diff-target is up to date.
Changes applied before test
commit 0e5bb384551bac47c099a120457144dab2cdee83
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Wed Sep 28 18:21:57 2022 +0200

    Add rubygems loader

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/907/ for more details.

This is the initial plan that I was going for with regards to what to load inside the .gem.

When discussing this over IRC, it became apparent that a more high-level policy decision is needed to know exactly what to do with RubyGems (and .gem files in general). Questions arose like:

Should we load from data.tar.gz contained inside even though it's technically not source code?
What to do about the binary artifacts
What is the RubyGems policy about code-redistribution license issues in binary-only gems?

I still think that the vast majority of the code will stay the same regardless of the policy (unless we decide not to load anything, of course), so this patch and all the previous exploratory work are not useless.

anlambert added inline comments.
swh/loader/package/rubygems/loader.py
112

I think we should use the value from the built_at field instead for the release date.

For instance if you look at all the rails versions in its HTML page,
you can see the built_at date is used for each release date (see data from REST API).

Use the built_at date instead of publication date

Build is green

Patch application report for D8569 (id=30985)

Rebasing onto 6299c091ec...

First, rewinding head to replay your work on top of it...
Applying: Add rubygems loader
Using index info to reconstruct a base tree...
M	docs/package-loader-specifications.rst
M	setup.py
Falling back to patching base and 3-way merge...
Auto-merging setup.py
Auto-merging docs/package-loader-specifications.rst
Changes applied before test
commit f108a1cdc185150ba3bb033985e1ae30378240c8
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Wed Sep 28 18:21:57 2022 +0200

    Add rubygems loader

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/921/ for more details.

@Alphare, fyi I improved the rubygems lister in that commit in order to gather all artifacts related to a gem and send these info to the loader as extra arguments.
Below is an example of the lister output for a gem:

ListedOrigin(
    url="https://rubygems.org/gems/haar_joke",
    visit_type="rubygems",
    last_update=iso8601.parse_date("2016-11-05T00:00:00+00:00"),
    extra_loader_arguments={
        "artifacts": [
            {
                "url": "https://rubygems.org/downloads/haar_joke-0.0.2.gem",
                "length": 8704,
                "version": "0.0.2",
                "filename": "haar_joke-0.0.2.gem",
                "checksums": {
                    "sha256": "85a8cf5f41890e9605265eeebfe9e99aa0350a01a3c799f9f55a0615a31a2f5f"
                },
            },
            {
                "url": "https://rubygems.org/downloads/haar_joke-0.0.1.gem",
                "length": 8704,
                "version": "0.0.1",
                "filename": "haar_joke-0.0.1.gem",
                "checksums": {
                    "sha256": "a2ee7052fb8ffcfc4ec0fdb77fae9a36e473f859af196a36870a0f386b5ab55e"
                },
            },
        ],
        "rubygem_metadata": [
            {
                "date": "2016-11-05T00:00:00+00:00",
                "authors": "Gemma Gotch",
                "version": "0.0.2",
                "extrinsic_metadata_url": "https://rubygems.org/api/v2/rubygems/haar_joke/versions/0.0.2.json",
            },
            {
                "date": "2016-07-23T00:00:00+00:00",
                "authors": "Gemma Gotch",
                "version": "0.0.1",
                "extrinsic_metadata_url": "https://rubygems.org/api/v2/rubygems/haar_joke/versions/0.0.1.json",
            },
        ],
    },
}

It enables to improve the scheduling of loading tasks for Ruby gems (by providing last_update value to ListedOrigin)
and it will save you of couple of calls to RubyGems Web API in the loader to fetch the list of versions for a gem.
So loader implementation must be adapted to use these new arguments.

This revision now requires changes to proceed.Oct 20 2022, 3:25 PM

Rubygems: Improve loader to make use of artifacts and rubygems_metadata provided by the lister extra_loader_arguments

Use artifacts and rubygems_metadata to get list of versions, artifacts checksums and extrinsic metadata url
Add an EXTID manifest
Add metadata from extrinsic metadata

Build is green

Patch application report for D8569 (id=31678)

Could not rebase; Attempt merge onto e6847f3616...

Updating e6847f3..081df1d
Fast-forward
 docs/package-loader-specifications.rst             |   9 ++
 setup.py                                           |   1 +
 swh/loader/package/rubygems/__init__.py            |  17 +++
 swh/loader/package/rubygems/loader.py              | 170 +++++++++++++++++++++
 swh/loader/package/rubygems/tasks.py               |  14 ++
 .../package/rubygems/tests/data/expected.json      |  76 +++++++++
 .../api_v2_rubygems_haar_joke_versions_0.0.1.json  |   1 +
 .../api_v2_rubygems_haar_joke_versions_0.0.2.json  |   1 +
 .../downloads_haar_joke-0.0.1.gem                  | Bin 0 -> 8704 bytes
 .../downloads_haar_joke-0.0.2.gem                  | Bin 0 -> 8704 bytes
 swh/loader/package/rubygems/tests/test_rubygems.py |  62 ++++++++
 swh/loader/package/rubygems/tests/test_tasks.py    |  61 ++++++++
 12 files changed, 412 insertions(+)
 create mode 100644 swh/loader/package/rubygems/__init__.py
 create mode 100644 swh/loader/package/rubygems/loader.py
 create mode 100644 swh/loader/package/rubygems/tasks.py
 create mode 100644 swh/loader/package/rubygems/tests/data/expected.json
 create mode 100644 swh/loader/package/rubygems/tests/data/https_rubygems.org/api_v2_rubygems_haar_joke_versions_0.0.1.json
 create mode 100644 swh/loader/package/rubygems/tests/data/https_rubygems.org/api_v2_rubygems_haar_joke_versions_0.0.2.json
 create mode 100644 swh/loader/package/rubygems/tests/data/https_rubygems.org/downloads_haar_joke-0.0.1.gem
 create mode 100644 swh/loader/package/rubygems/tests/data/https_rubygems.org/downloads_haar_joke-0.0.2.gem
 create mode 100644 swh/loader/package/rubygems/tests/test_rubygems.py
 create mode 100644 swh/loader/package/rubygems/tests/test_tasks.py
Changes applied before test
commit 081df1dfec49e66007b230364d0ae31ea92a8094
Author: Franck Bret <franck.bret@octobus.net>
Date:   Thu Oct 27 18:50:19 2022 +0200

    Rubygems: Improve lister to make use of artifacts and rubygems_metadata
    provided by the lister extra_loader_arguments
    
    Use artifacts and rubygems_metadata to get list of versions, artifacts checksums and extrinsic metadata url
    Add an EXTID manifest
    Set metadata from extrinsic metadata

commit 0022bb50abd156cca30472249a440234a626169b
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Thu Oct 27 10:37:54 2022 +0200

    Add rubygems loader
    
    Reviewers: #reviewers, anlambert
    
    Subscribers: anlambert
    
    Maniphest Tasks: T4581
    
    Differential Revision: https://forge.softwareheritage.org/D8569

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1010/ for more details.

Rubygems: Improve loader to make use of artifacts and rubygems_metadata provided by the lister extra_loader_arguments

Use artifacts and rubygems_metadata to get list of versions, artifacts checksums and extrinsic metadata url
Add an EXTID manifest
Add metadata from extrinsic metadata

@anlambert Please note I used 'rubygems_metadata' instead of 'rubygem_metadata' as in the lister. Maybe I'm wrong but I think the lister should rename to rubygems_metadata?

Rubygems: Improve loader to make use of artifacts and rubygems_metadata provided by the lister extra_loader_arguments

Use artifacts and rubygems_metadata to get list of versions, artifacts checksums and extrinsic metadata url
Add an EXTID manifest
Add metadata from extrinsic metadata

@anlambert Please note I used 'rubygems_metadata' instead of 'rubygem_metadata' as in the lister. Maybe I'm wrong but I think the lister should rename to rubygems_metadata?

@franckbret, I did not use plural because we are processing a single gem in the loader (with multiple versions but those are metadata for a single gem).
So I do not think we should modify the lister output.

Could you also test that raw extrinsic metadata have been correctly stored in the storage and can be retrieved ?
You can inspire from the cpan loader test to implement it.

I also added some inline comments to handle.

swh/loader/package/rubygems/loader.py
133

there is no need to parse the received JSON here

146

Use rubygem-release-json instead as format name (cpan loader use cpan-release-json).

147

metadata=extrinsic_metadata_raw

swh/loader/package/rubygems/tests/test_rubygems.py
15–18

It would be easier to read if you inline the loader input data in that file imho, plus you only use the first element of the expected list and the second one is a copy of the first one.

This revision now requires changes to proceed.Oct 28 2022, 11:54 AM

Rubygems: Improve loader to make use of artifacts and rubygems_metadata provided by the lister extra_loader_arguments

Use artifacts and rubygems_metadata to get list of versions, artifacts checksums and extrinsic metadata url
Add an EXTID manifest
Add metadata from extrinsic metadata

@anlambert Please note I used 'rubygems_metadata' instead of 'rubygem_metadata' as in the lister. Maybe I'm wrong but I think the lister should rename to rubygems_metadata?

@franckbret, I did not use plural because we are processing a single gem in the loader (with multiple versions but those are metadata for a single gem).
So I do not think we should modify the lister output.

Okay, will change that

Do not json.loads already deserialized json data

Build is green

Patch application report for D8569 (id=31716)

Could not rebase; Attempt merge onto e6847f3616...

Updating e6847f3..8e34a6d
Fast-forward
 docs/package-loader-specifications.rst             |   9 +
 setup.py                                           |   1 +
 swh/loader/package/rubygems/__init__.py            |  17 ++
 swh/loader/package/rubygems/loader.py              | 168 ++++++++++++++++++
 swh/loader/package/rubygems/tasks.py               |  14 ++
 .../api_v2_rubygems_haar_joke_versions_0.0.1.json  |   1 +
 .../api_v2_rubygems_haar_joke_versions_0.0.2.json  |   1 +
 .../downloads_haar_joke-0.0.1.gem                  | Bin 0 -> 8704 bytes
 .../downloads_haar_joke-0.0.2.gem                  | Bin 0 -> 8704 bytes
 swh/loader/package/rubygems/tests/test_rubygems.py | 196 +++++++++++++++++++++
 swh/loader/package/rubygems/tests/test_tasks.py    |  61 +++++++
 11 files changed, 468 insertions(+)
 create mode 100644 swh/loader/package/rubygems/__init__.py
 create mode 100644 swh/loader/package/rubygems/loader.py
 create mode 100644 swh/loader/package/rubygems/tasks.py
 create mode 100644 swh/loader/package/rubygems/tests/data/https_rubygems.org/api_v2_rubygems_haar_joke_versions_0.0.1.json
 create mode 100644 swh/loader/package/rubygems/tests/data/https_rubygems.org/api_v2_rubygems_haar_joke_versions_0.0.2.json
 create mode 100644 swh/loader/package/rubygems/tests/data/https_rubygems.org/downloads_haar_joke-0.0.1.gem
 create mode 100644 swh/loader/package/rubygems/tests/data/https_rubygems.org/downloads_haar_joke-0.0.2.gem
 create mode 100644 swh/loader/package/rubygems/tests/test_rubygems.py
 create mode 100644 swh/loader/package/rubygems/tests/test_tasks.py
Changes applied before test
commit 8e34a6d7799660ef89beb683fc5577887e0366a3
Author: Franck Bret <franck.bret@octobus.net>
Date:   Thu Oct 27 18:50:19 2022 +0200

    Rubygems: Improve lister to make use of artifacts and rubygems_metadata
    provided by the lister extra_loader_arguments
    
    Use artifacts and rubygems_metadata to get list of versions, artifacts checksums and extrinsic metadata url
    Add an EXTID manifest
    Set metadata from extrinsic metadata

commit 0022bb50abd156cca30472249a440234a626169b
Author: Raphaël Gomès <rgomes@octobus.net>
Date:   Thu Oct 27 10:37:54 2022 +0200

    Add rubygems loader
    
    Reviewers: #reviewers, anlambert
    
    Subscribers: anlambert
    
    Maniphest Tasks: T4581
    
    Differential Revision: https://forge.softwareheritage.org/D8569

See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/1011/ for more details.

Looks good to me, thanks !

This revision is now accepted and ready to land.Nov 2 2022, 4:58 PM