Page MenuHomeSoftware Heritage

implement an R-cran lister
Closed, MigratedEdits Locked

Description

As discussed on IRC, at current the approach that is known is to use https://cran.r-project.org/web/packages/available_packages_by_name.html url to list all the packages, the response that is returned by this page is HTML code. Hence to make a lister that is more reliable, there is a need to find some other source to list all the packages.

Revisions and Commits

Event Timeline

nahimilega triaged this task as Normal priority.May 13 2019, 1:48 PM
nahimilega created this task.
nahimilega created this object in space S1 Public.

@faux on IRC mentioned that there is a public DB dump (https://cran.r-project.org/web/dbs) which might be helpful for the purpose.
This DB dump contains files with .rds extension which is used by R language. Here are a couple of rows from that DB dump https://forge.softwareheritage.org/P396

zack renamed this task from Implementation of R-cran lister to implement an R-cran lister.May 13 2019, 1:59 PM

Here is an implementation plan for making R-CRAN lister.
I have taken inspiration from the pypi lister.
To make lister.py for R-CRAN, we need to inherit SimpleLister class and override ingest_data() function and change its first line (where safely_issue_request() is called) to call the function which would run R script to return a json response.
Then after that it is quite like any normal response, we just need to implement following function list_packages, compute url, get_model_from_repo, task_dict and transport_response_simplified.

Here is an implementation plan for making R-CRAN lister.
I have taken inspiration from the pypi lister.
To make lister.py for R-CRAN, we need to inherit SimpleLister class and override ingest_data() function and change its first line (where safely_issue_request() is called) to call the function which would run R script to return a json response.
Then after that it is quite like any normal response, we just need to implement following function list_packages, compute url, get_model_from_repo, task_dict and transport_response_simplified.

This looks fine, thanks.

The first step would be to parse the data that CRAN provides, using the built-in R APIs, to see what's available. We're interested in:

  • listing all packages
  • for each package, listing all the published versions

Once we have both these informations, we can generate the relevant instructions for loading tasks.

@olasd I do not have any familiarity with R language. Learning some basics and making this script would take me around a week. I was wondering it is possible that someone in Software Heritage who have some experience with R can write this script as it would be a matter of minutes to the person who knows R.
Is it possible to do so?

@nahimilega it is probably a two line script. install R and do readRDS() and you will get a data.frame object which is just like a table and has columns and then you can extract what you want. Cheers :). BTW when I did readRDS it retrieved a lot of links and I don't know about the lister that much but you can pickup from there.

Expanding on what Dirk Eddelbuettel posted on IRC when we talked about that, a minimal R script to fetch the current package information would be:

#!/usr/bin/Rscript

db <- tools::CRAN_package_db();
dbjson <- jsonlite::toJSON(db);

print(dbjson);

(Debian dependencies: r-base-core, r-cran-jsonlite).

Now the open question is whether there is an API to find *all* the available versions of a given package, rather than just the latest one like the CRAN_package_db() function provides.

Nicholas: Sadly, one can't. I kinda/sorta have that implicitly as I have been running CRANberries since 2007 or so.

One can approximate. Each package has an archive/ folder with prior versions. You can use that for version, and the file date (I know ...) as a date approximation.

This misses packages that have been removed from current index. It might be surmountable by an additional full crawl of CRAN. Not sure.

But other approximations exist. Eg for some services he runs (partially with R Consortium funding) Gabor Csardi mirrors each CRAN package into github.com/cran/ with a repo per package. We can read the commit history too for the history.

The more uplifting news is once the lister is set up, on a _go-forward_ business this will get everything, and do it right. That is not a small feat.

@eddelbuettel yeah, if there isn't a standard way to go all the way back in time, it's OK to currently only ingest what's currently returned as available. In the medium/long term it will converge to having archived everything (w.r.t. the considered time frame) anyway. And we can always retrofit later on stuff that is archived elsewhere. But I wouldn't want to make this a blocker to start archiving what's (easily) listable now.

Oh, and thanks a lot for your feedback ! :-)

Yes, for history I do not believe we have an easy answer for SWH (and the world at large) to consume. We may have approximations; I'll check with Gabor and others.

Worst case, a one-off 'fill in as best as we can from current time' will be very close and is doable. So depending on how the GSoC summer goes, there may be time. (And that crawling can of course be done in Python too.) But I'll try to think about whether or not we have something easier.