Page MenuHomeSoftware Heritage

Rubygems Lister
Closed, MigratedEdits Locked

Description

To make RubyGems Lister we need the following -
List of all the packages.
Source code URL and metadata for each package

To get the list of all the packages.
There is no public API endpoint available to list all the packages. Although there is an inbuilt API which can be used to list the packages and all the version present for a particular package.

$ gem list -r --all

This will list all the package and all versions available for a particular package. Here is a sample of the output from the inbuilt API https://forge.softwareheritage.org/P413

To get source code URL and metadata for a particular package

API provided by rubygem can be used to complete the task-
Here is the URL pattern which will be used to call the API

https://rubygems.org/api/v2/rubygems/[package]/versions/[version].json

Here is the documentation for the API https://guides.rubygems.org/rubygems-org-api-v2/

Event Timeline

nahimilega triaged this task as Normal priority.Jun 2 2019, 7:23 PM
nahimilega created this task.
nahimilega created this object in space S1 Public.

On further investigation, I found out there are data dumps provided on rubygems.org
https://rubygems.org/pages/data
This could be used to get the list of all the packages.

I looked into the data dumps provided on rubygems.org.
A bash script(link to the script) is provided by rubygem that will download the most recent weekly dump listed on https://rubygems.org/pages/data and load it into a PostgreSQL database.

Here is the list of tables that were present in the database

                        List of relations
 Schema |     Name      | Type  |  Owner   |  Size  | Description 
--------+---------------+-------+----------+--------+-------------
 public | dependencies  | table | postgres | 454 MB | 
 public | gem_downloads | table | postgres | 62 MB  | 
 public | linksets      | table | postgres | 17 MB  | 
 public | rubygems      | table | postgres | 10 MB  | 
 public | versions      | table | postgres | 436 MB | 
(5 rows)

gem_downloads and dependencies table would serve no use in making the lister

The rubygems table contains id corresponding to their id and the date of their updating.

  id   |      name      |         created_at         |         updated_at         |      slug      
--------+----------------+----------------------------+----------------------------+----------------
  31113 | markdownie     | 2010-07-26 03:01:27.850005 | 2017-03-27 03:43:31.555649 | 
  15102 | hpriocot       | 2009-07-25 17:46:41        | 2009-07-25 17:46:41        | hpriocot
  15679 | textgraph      | 2009-07-25 17:49:53        | 2017-03-27 03:38:19.148762 | textgraph
  15696 | test_gem       | 2009-07-25 17:49:54        | 2017-03-27 03:38:19.468454 | test_gem
  15838 | svnbranch      | 2009-07-25 17:50:32        | 2017-03-27 03:38:22.532766 | svnbranch

in linksets table we can get the link to code and updated time for a particular rubygem id pack but no VCS

 id   | rubygem_id |                      home                      | wiki | docs | mail |                code                | bugs |     created_at      |         updated_at         
-------+------------+------------------------------------------------+------+------+------+------------------------------------+------+---------------------+----------------------------
 14646 |      14978 | http://jay.mcgavren.com/zyps/                  |      |      |      | http://github.com/jaymcgavren/zyps |      | 2009-07-25 17:46:22 | 2009-10-15 11:25:22.863545
 14647 |      14980 | http://github.com/austinrfnd/zvent-gem/        |      |      |      |                                    |      | 2009-07-25 17:46:24 | 2009-07-25 17:46:24
 14649 |      14985 |                                                |      |      |      |                                    |      | 2009-07-25 17:46:24 | 2009-07-25 17:46:24
 14650 |      14987 | http://ruby-zoom.rubyforge.org                 |      |      |      |                                    |      | 2009-07-25 17:46:24 | 2009-07-25 17:46:24
 14651 |      14988 | http://github.com/technicalpickles/zomgjeweler |      |      |      |                                    |      | 2009-07-25 17:46:24 | 2009-07-25 17:46:24

For versions table here is the schema

                                                Table "public.versions"
          Column           |            Type             | Collation | Nullable |               Default                
---------------------------+-----------------------------+-----------+----------+--------------------------------------
 id                        | integer                     |           | not null | nextval('versions_id_seq'::regclass)
 authors                   | text                        |           |          | 
 description               | text                        |           |          | 
 number                    | character varying(255)      |           |          | 
 rubygem_id                | integer                     |           |          | 
 built_at                  | timestamp without time zone |           |          | 
 updated_at                | timestamp without time zone |           |          | 
 summary                   | text                        |           |          | 
 platform                  | character varying(255)      |           |          | 
 created_at                | timestamp without time zone |           |          | 
 indexed                   | boolean                     |           |          | true
 prerelease                | boolean                     |           |          | 
 position                  | integer                     |           |          | 
 latest                    | boolean                     |           |          | 
 full_name                 | character varying(255)      |           |          | 
 licenses                  | character varying(255)      |           |          | 
 size                      | integer                     |           |          | 
 requirements              | text                        |           |          | 
 required_ruby_version     | character varying(255)      |           |          | 
 sha256                    | character varying(255)      |           |          | 
 metadata                  | hstore                      |           | not null | ''::hstore
 required_rubygems_version | character varying           |           |          | 
 yanked_at                 | timestamp without time zone |           |          | 
 info_checksum             | character varying           |           |          | 
 yanked_info_checksum      | character varying           |           |          |

And here is a sample row

   id   |   authors    | description | number | rubygem_id |      built_at       |         updated_at         |          summary           | platform |         created_at         | indexed | prerelease | position | latest |    full_name    | licenses | size | requirements | required_ruby_version |                    sha256                    | metadata | required_rubygems_version | yanked_at |          info_checksum           | yanked_info_checksum 
--------+--------------+-------------+--------+------------+---------------------+----------------------------+----------------------------+----------+----------------------------+---------+------------+----------+--------+-----------------+----------+------+--------------+-----------------------+----------------------------------------------+----------+---------------------------+-----------+----------------------------------+----------------------
 175034 | ACM,PBG,LEGO | simple      | 0.1    |      30742 | 2010-12-15 06:00:00 | 2016-06-27 08:24:10.068616 | a simple message framework | ruby     | 2010-12-15 13:26:11.600363 | t       | f          |        2 | f      | masstransit-0.1 |          | 3584 |              |                       | UqTmQNpTs3QIsSFkKxFFp+s5692BX78UoxsGxOTZ6XU= |          | >= 0                      |           | 6bf50f3b108ecbb0d8e06d5f50d3f09d | 
(1 row)

I did a bit investigation on data dumps, and it seems, they can serve the purpose well
To get the package release, we mainly need the name and version of packages. The link can, therefore, can be generated as -

Syntax of the link to the gem package source release 
http://rubygems.org/gems/<name>-<version>.gem
Example
http://rubygems.org/gems/rails-3.2.1.gem

The blueprint for making the lister -

  1. To download and load the latest data dump, the script mentioned in the above comment can be used.
  1. We can get the name and time of the last update of all the package from the rubygems table.
  1. Then we can get all the version associated with a package from versions table with their respective metadata.
  1. From this the info, we can generate the link to the gem package source release as mentioned above. Then can create the loading task.

Some statistics -

No. of packages listed through gem list -r --all : 151373
No. of packages present in data dump: 162339

Some extra info
The package release, which will be downloaded, is in the formed of a structure similar to the following:

/[package_name]               # 1
        |__ /bin              # 2
        |__ /lib              # 3
        |__ /test             # 4
        |__ README            # 5
        |__ Rakefile          # 6
        |__ [name].gemspec    # 7

[package_name]:
The main root directory of the Gem package.
/bin:
Location of the executable binaries if the package has any.
/lib:
Directory containing the main Ruby application code (inc. modules).
/test:
Location of test files.
Rakefile:
The Rake-file for libraries which use Rake for builds.
[packagename].gemspec:
*.gemspec file, which has the name of the main directory, contains all package meta-data, e.g. name, version, directories etc.

( Source -https://www.digitalocean.com/community/tutorials/how-to-package-and-distribute-ruby-applications-as-a-gem-using-rubygems)

We mainly archive source code, but here I feel we should also ingest all files in a package release(except /bin folder) because the source code of package is useless without its files like *.gemspec file and Rakefile. The package cannot be regenerated from source code without these files. Hence I feel these files are also essential to ingest.

Regarding the source code link in the linksets table -
There were only links for 17306( ~ 10% ) packages out of 162339, and that too without vcs. So, I guess they are useless.

bchauvet added a parent task: Unknown Object (Maniphest Task).Sep 2 2022, 10:59 AM