Page MenuHomeSoftware Heritage

cran lister: Align lister to output list of tarballs per origin
Closed, MigratedEdits Locked

Description

The lister currently list one (versioned) tarball at a time (filtered on some fields)
It must be possible to aggregate and group those tarballs (output by the R script) by project/origin.

That way, we could share some common behavior between the gnu/cran loaders (1 origin with list of tarballs).
As hinted at in D1728#39980.

The output of the lister should also become "oneshot" scheduler tasks.

Event Timeline

ardumont renamed this task from cran lister: Align lister output to gnu one to cran lister: Align lister to output list of tarballs per origin (if possible).Oct 2 2019, 6:47 AM
ardumont triaged this task as Normal priority.
ardumont created this task.

In order to list all available versions of a R package hosted on CRAN, the versions package could be used.

Its implementation is based on scraping the CRAN Archive pages.

Experiments

Some experiments with the R versions package:

$ for rp in stringr versions jsonlite; do time ./bin/list-versions.R $rp; echo; done
Loading required package: versions
$stringr
   version       date available
1    1.4.0 2019-02-10      TRUE
2    1.3.1 2018-05-10      TRUE
3    1.3.0 2018-02-19      TRUE
4    1.2.0 2017-02-18      TRUE
5    1.1.0 2016-08-19      TRUE
6    1.0.0 2015-04-30      TRUE
7    0.6.2 2012-12-06      TRUE
8    0.6.1 2012-07-25     FALSE
9      0.6 2011-12-08     FALSE
10     0.5 2011-06-30     FALSE
11     0.4 2010-08-24     FALSE
12     0.3 2010-02-15     FALSE
13     0.2 2009-11-16     FALSE
14  0.1.10 2009-11-09     FALSE

./bin/list-versions.R $rp  0.65s user 0.22s system 2% cpu 32.153 total

Loading required package: versions
$versions
  version       date available
1     0.3 2016-09-01      TRUE
2     0.2 2016-02-17      TRUE
3     0.1 2015-09-18      TRUE

./bin/list-versions.R $rp  0.62s user 0.23s system 2% cpu 35.146 total

Loading required package: versions
$jsonlite
   version       date available
1      1.6 2018-12-07      TRUE
2      1.5 2017-06-01      TRUE
3      1.4 2017-04-08      TRUE
4      1.3 2017-02-28      TRUE
5      1.2 2016-12-30      TRUE
6      1.1 2016-09-14      TRUE
7      1.0 2016-07-01      TRUE
8   0.9.22 2016-06-15      TRUE
9   0.9.21 2016-06-04      TRUE
10  0.9.20 2016-05-10      TRUE
11  0.9.19 2015-11-28      TRUE
12  0.9.18 2015-11-25      TRUE
13  0.9.17 2015-09-06      TRUE
14  0.9.16 2015-04-10      TRUE
15  0.9.15 2015-03-25      TRUE
16  0.9.14 2014-12-01      TRUE
17  0.9.12 2014-10-21      TRUE
18  0.9.13 2014-10-21      TRUE
19  0.9.11 2014-09-04      TRUE
20  0.9.10 2014-08-03     FALSE
21   0.9.9 2014-07-22     FALSE
22   0.9.8 2014-06-02     FALSE
23   0.9.7 2014-04-18     FALSE
24   0.9.6 2014-04-05     FALSE
25   0.9.5 2014-03-27     FALSE
26   0.9.4 2014-03-01     FALSE
27   0.9.3 2014-01-02     FALSE
28   0.9.1 2013-12-12     FALSE
29   0.9.0 2013-12-03     FALSE

./bin/list-versions.R $rp  0.68s user 0.17s system 2% cpu 32.769 total

source: P540
Note:
pkgLoad('versions') installs the versions package if not installed.
It was already done outside the loop so no extra cost in between loop.

It works slowly (expectedly as it does scraping).
The output is not ideal though. We still need to build artifact urls (which i expect to be normalized).

My understanding of the versions package is that it uses the mirror MRAN [1] (Microsoft R Application Network) to check for archived artifacts.
It snapshots dayly CRAN [2] (Comprehensive R Application Network).

Status

I'm unsure. I see multiple possiblities:

  1. continue as we do, the lister outputs 1 artifact (origin) the loader ingests.

We need to add another lister instance to list MRAN's artifacts.
(We could close this task then).

Remark:
The origins becomes MRAN ones instead of the original CRAN one... Factually though, we would have retrieved those from MRAN and not CRAN so it's still true.
So that may not be a problem.

  1. As we do for some package manager (pypi, npm), let the lister outputs packages.

That means more logic loader-side, do the artifacts uri version computations and retrieval.

Pros/cons

|----------+-------------------------------------+---------------------------------+----------------------------------------------------------------------------------------|
| solution | pros                                | cons                            | description                                                                            |
|----------+-------------------------------------+---------------------------------+----------------------------------------------------------------------------------------|
|       1. | factual                             |                                 | After all, we would indeed retrieve artifacts from CRAN and MRAN separately            |
|          | less work                           |                                 | Add a new lister instance for MRAN (possibly some code adaptations regarding urls)     |
|          |                                     | 1 origin per versioned artifact |                                                                                        |
|----------+-------------------------------------+---------------------------------+----------------------------------------------------------------------------------------|
|       2. |                                     | more work                       | Simplify lister but increase logic cogs loader side                                    |
|          | more factual?                       | less factual?                   | Implementation is aware that CRAN/MRAN are 2 sides of the same coin                    |
|          |                                     | computations will be slower     | As demonstrated by the R 'versions' use before (independently on how we implement it)  |
|          |                                     |                                 | (Implementation wise, if we do not want R dependencies loader side as we do the lister |
|          |                                     |                                 | , that means reimplement scraping in python                                            |
|          | 1 origin per project with artifacts |                                 | Realist                                                                                |
|----------+-------------------------------------+---------------------------------+----------------------------------------------------------------------------------------|

Conclusion

I would tend towards 2. nonetheless as this solves the initial problem the task wants to solve (with a different approach).

  • initial problem: at the moment 1 origin is 1 artifact (which i see as not that good): We did that for the first gnu ingestion in 2015 (but not longer).
  • solution: 1 origin is actually a list of versioned artifacts.

Hopefully, others can challenge this and propose better.

[1] https://mran.microsoft.com/timemachine

[2] https://cran.r-project.org/

[3] https://www.quora.com/What-is-MRAN-how-does-it-differ-from-R-CRAN-and-why-would-I-care-about

Heads up.

Proper algorithm to implement:

Lister
  • drop the R cran script
  • parse the listing page instead [1]
  • for each package found there, send the origin url [2] to the loader (as recurring task)
Loader

Improve the loader so it scrapes that origin url [2] page.
It then determines itself what the artifact urls it needs to ingest
In the [2] page, there is an archive link (or something) which lists the old associated artifacts (so apriori, no more need for the mran mirror).

[1] https://cran.r-project.org/web/packages/available_packages_by_date.html

[2] https://cran.r-project.org/package=<package-name>

I wonder if we could find someone in the R community who would be able to ask for this page to be available in a universally machine readable format (e.g. yaml or json)? This feels like it wouldn't be a very big change to any existing generation script.

In the [2] page, there is an archive link (or something) which lists the old associated artifacts (so apriori, no more need for the mran mirror).

The correct name for archive is 'Old sources' btw.

I wonder if we could find someone in the R community who would be able to ask for this page to be available in a universally machine readable format (e.g. yaml or json)? This feels like it wouldn't be a very big change to any existing generation script.

That'd be neat.

maybe @zack knows someone ;)

ardumont renamed this task from cran lister: Align lister to output list of tarballs per origin (if possible) to cran lister: Align lister to output list of tarballs per origin.Jan 17 2020, 12:25 PM

The original description of this task was adapted by D2531 D2532 D2524.
A sample of what's been listed and ingested can be seen through the staging webapp instance [1].

Prior to trigger the new lister and loader, we need to cleanup the first visits [2].

For now, it's only 1 artifact per origin though as T2029#40500 is not yet implemented (that will go in another task).

[1] https://webapp.internal.staging.swh.network/browse/search/?q=cran.r-project.org&with_visit

[2] P585

ardumont claimed this task.

Deployed.