keyword is archived, as of now, we only ingest the main one.
2 sides of that coin (which can be done independently and in any order we so choose):
Lister
algo:
- drop the R cran script
- parse the listing page instead (as in simple_lister, check lister cgit's way of doing it) [1]
- for each package found there, send the origin url [2] to the loader (as recurring task)
schema adaptations:
- make the tasks outputed by the lister as recurring (currently oneshot)
- Adapt uid field to be the origin_url's value
migration plan:
- truncate cran_repo table
- trigger back a full listing
Loader
algo:
- Improve the loader so it scrapes that origin url [2] page.
- It then determines itself what the artifact urls it needs to ingest
- In the [2] page, there is an archive link Old source which lists the previous artifact version.
[1] https://cran.r-project.org/web/packages/available_packages_by_date.html
This can be subject to discussion with the cran community to ask for a better api endpoint (if it's not too much hassle for them to adapt and provide ;)
[2] https://cran.r-project.org/package=<package-name>
Related to T2029#40500