Page MenuHomeSoftware Heritage

lister/loader: Ingest archived artifacts from cran mirror
Open, NormalPublic

Description

keyword is archived, as of now, we only ingest the main one.

2 sides of that coin (which can be done independently and in any order we so choose):

Lister

algo:

  • drop the R cran script
  • parse the listing page instead (as in simple_lister, check lister cgit's way of doing it) [1]
  • for each package found there, send the origin url [2] to the loader (as recurring task)

schema adaptations:

  • make the tasks outputed by the lister as recurring (currently oneshot)
  • Adapt uid field to be the origin_url's value

migration plan:

  • truncate cran_repo table
  • trigger back a full listing
Loader

algo:

  • Improve the loader so it scrapes that origin url [2] page.
  • It then determines itself what the artifact urls it needs to ingest
  • In the [2] page, there is an archive link Old source which lists the previous artifact version.

[1] https://cran.r-project.org/web/packages/available_packages_by_date.html
This can be subject to discussion with the cran community to ask for a better api endpoint (if it's not too much hassle for them to adapt and provide ;)

[2] https://cran.r-project.org/package=<package-name>

Related to T2029#40500

Event Timeline

ardumont triaged this task as Normal priority.Jan 21 2020, 11:52 AM
ardumont created this task.
ardumont updated the task description. (Show Details)
ardumont renamed this task from lister/loader: Ingest all known artifacts from cran mirror to lister/loader: Ingest archived artifacts from cran mirror.Jan 22 2020, 10:46 AM
ardumont updated the task description. (Show Details)