Page MenuHomeSoftware Heritage

cran lister: Align lister to output list of tarballs per origin (if possible)
Open, NormalPublic

Description

The lister currently list one (versioned) tarball at a time (filtered on some fields)
It must be possible to aggregate and group those tarballs (output by the R script) by project/origin.

That way, we could share some common behavior between the gnu/cran loaders (1 origin with list of tarballs).
As hinted at in D1728#39980.

The output of the lister should also become "oneshot" scheduler tasks.

Event Timeline

ardumont renamed this task from cran lister: Align lister output to gnu one to cran lister: Align lister to output list of tarballs per origin (if possible).Wed, Oct 2, 6:47 AM
ardumont triaged this task as Normal priority.
ardumont created this task.

In order to list all available versions of a R package hosted on CRAN, the versions package could be used.

Its implementation is based on scraping the CRAN Archive pages.

ardumont updated the task description. (Show Details)Sun, Oct 6, 10:09 AM
ardumont added a comment.EditedSun, Oct 6, 12:08 PM

Experiments

Some experiments with the R versions package:

$ for rp in stringr versions jsonlite; do time ./bin/list-versions.R $rp; echo; done
Loading required package: versions
$stringr
   version       date available
1    1.4.0 2019-02-10      TRUE
2    1.3.1 2018-05-10      TRUE
3    1.3.0 2018-02-19      TRUE
4    1.2.0 2017-02-18      TRUE
5    1.1.0 2016-08-19      TRUE
6    1.0.0 2015-04-30      TRUE
7    0.6.2 2012-12-06      TRUE
8    0.6.1 2012-07-25     FALSE
9      0.6 2011-12-08     FALSE
10     0.5 2011-06-30     FALSE
11     0.4 2010-08-24     FALSE
12     0.3 2010-02-15     FALSE
13     0.2 2009-11-16     FALSE
14  0.1.10 2009-11-09     FALSE

./bin/list-versions.R $rp  0.65s user 0.22s system 2% cpu 32.153 total

Loading required package: versions
$versions
  version       date available
1     0.3 2016-09-01      TRUE
2     0.2 2016-02-17      TRUE
3     0.1 2015-09-18      TRUE

./bin/list-versions.R $rp  0.62s user 0.23s system 2% cpu 35.146 total

Loading required package: versions
$jsonlite
   version       date available
1      1.6 2018-12-07      TRUE
2      1.5 2017-06-01      TRUE
3      1.4 2017-04-08      TRUE
4      1.3 2017-02-28      TRUE
5      1.2 2016-12-30      TRUE
6      1.1 2016-09-14      TRUE
7      1.0 2016-07-01      TRUE
8   0.9.22 2016-06-15      TRUE
9   0.9.21 2016-06-04      TRUE
10  0.9.20 2016-05-10      TRUE
11  0.9.19 2015-11-28      TRUE
12  0.9.18 2015-11-25      TRUE
13  0.9.17 2015-09-06      TRUE
14  0.9.16 2015-04-10      TRUE
15  0.9.15 2015-03-25      TRUE
16  0.9.14 2014-12-01      TRUE
17  0.9.12 2014-10-21      TRUE
18  0.9.13 2014-10-21      TRUE
19  0.9.11 2014-09-04      TRUE
20  0.9.10 2014-08-03     FALSE
21   0.9.9 2014-07-22     FALSE
22   0.9.8 2014-06-02     FALSE
23   0.9.7 2014-04-18     FALSE
24   0.9.6 2014-04-05     FALSE
25   0.9.5 2014-03-27     FALSE
26   0.9.4 2014-03-01     FALSE
27   0.9.3 2014-01-02     FALSE
28   0.9.1 2013-12-12     FALSE
29   0.9.0 2013-12-03     FALSE

./bin/list-versions.R $rp  0.68s user 0.17s system 2% cpu 32.769 total

source: P540
Note:
pkgLoad('versions') installs the versions package if not installed.
It was already done outside the loop so no extra cost in between loop.

It works slowly (expectedly as it does scraping).
The output is not ideal though. We still need to build artifact urls (which i expect to be normalized).

My understanding of the versions package is that it uses the mirror MRAN [1] (Microsoft R Application Network) to check for archived artifacts.
It snapshots dayly CRAN [2] (Comprehensive R Application Network).

Status

I'm unsure. I see multiple possiblities:

  1. continue as we do, the lister outputs 1 artifact (origin) the loader ingests.

We need to add another lister instance to list MRAN's artifacts.
(We could close this task then).

Remark:
The origins becomes MRAN ones instead of the original CRAN one... Factually though, we would have retrieved those from MRAN and not CRAN so it's still true.
So that may not be a problem.

  1. As we do for some package manager (pypi, npm), let the lister outputs packages.

That means more logic loader-side, do the artifacts uri version computations and retrieval.

Pros/cons

|----------+-------------------------------------+---------------------------------+----------------------------------------------------------------------------------------|
| solution | pros                                | cons                            | description                                                                            |
|----------+-------------------------------------+---------------------------------+----------------------------------------------------------------------------------------|
|       1. | factual                             |                                 | After all, we would indeed retrieve artifacts from CRAN and MRAN separately            |
|          | less work                           |                                 | Add a new lister instance for MRAN (possibly some code adaptations regarding urls)     |
|          |                                     | 1 origin per versioned artifact |                                                                                        |
|----------+-------------------------------------+---------------------------------+----------------------------------------------------------------------------------------|
|       2. |                                     | more work                       | Simplify lister but increase logic cogs loader side                                    |
|          | more factual?                       | less factual?                   | Implementation is aware that CRAN/MRAN are 2 sides of the same coin                    |
|          |                                     | computations will be slower     | As demonstrated by the R 'versions' use before (independently on how we implement it)  |
|          |                                     |                                 | (Implementation wise, if we do not want R dependencies loader side as we do the lister |
|          |                                     |                                 | , that means reimplement scraping in python                                            |
|          | 1 origin per project with artifacts |                                 | Realist                                                                                |
|----------+-------------------------------------+---------------------------------+----------------------------------------------------------------------------------------|

Conclusion

I would tend towards 2. nonetheless as this solves the initial problem the task wants to solve (with a different approach).

  • initial problem: at the moment 1 origin is 1 artifact (which i see as not that good): We did that for the first gnu ingestion in 2015 (but not longer).
  • solution: 1 origin is actually a list of versioned artifacts.

Hopefully, others can challenge this and propose better.

[1] https://mran.microsoft.com/timemachine

[2] https://cran.r-project.org/

[3] https://www.quora.com/What-is-MRAN-how-does-it-differ-from-R-CRAN-and-why-would-I-care-about