lister/loader: Ingest archived artifacts from cran mirror
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	ardumont
	Jan 21 2020, 11:52 AM

Description

keyword is archived, as of now, we only ingest the main one.

2 sides of that coin (which can be done independently and in any order we so choose):

Lister

algo:

drop the R cran script
parse the listing page instead (as in simple_lister, check lister cgit's way of doing it) [1]
for each package found there, send the origin url [2] to the loader (as recurring task)

schema adaptations:

make the tasks outputed by the lister as recurring (currently oneshot)
Adapt uid field to be the origin_url's value

migration plan:

truncate cran_repo table
trigger back a full listing

Loader

algo:

Improve the loader so it scrapes that origin url [2] page.
It then determines itself what the artifact urls it needs to ingest
In the [2] page, there is an archive link Old source which lists the previous artifact version.

[1] https://cran.r-project.org/web/packages/available_packages_by_date.html
This can be subject to discussion with the cran community to ask for a better api endpoint (if it's not too much hassle for them to adapt and provide ;)

[2] https://cran.r-project.org/package=<package-name>

Related to T2029#40500

Related Objects

Mentioned Here: T2029: cran lister: Align lister to output list of tarballs per origin

Event Timeline

ardumont triaged this task as Normal priority.Jan 21 2020, 11:52 AM

ardumont created this task.

ardumont updated the task description. (Show Details)

This should take care of [1]

[1] https://sentry.softwareheritage.org/share/issue/c8a2d4918c7c43318459507804de8767/

ardumont renamed this task from lister/loader: Ingest all known artifacts from cran mirror to lister/loader: Ingest archived artifacts from cran mirror.Jan 22 2020, 10:46 AM