Page MenuHomeSoftware Heritage

cran.lister: Use CRAN's canonical url as origin url
ClosedPublic

Authored by ardumont on Jan 14 2020, 2:42 PM.

Details

Summary

Also make sure the uid is versioned (so as to allow to actually see new package
versions).

The canonical urls were seen at the available packages listing [1]

(This means an adaptation loader CRAN's side is needed)

[1] https://cran.r-project.org/web/packages/available_packages_by_date.html

Test Plan

tox

Diff Detail

Repository
rDLS Listers
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 10095
Build 14949: tox-on-jenkinsJenkins
Build 14948: arc lint + arc unit

Event Timeline

swh/lister/cran/tests/test_lister.py
50

build failed

my bad, it's a task row not a lister row.

Only dedicate the diff to the origin url changes

Another one will be created for the uid versioning part (~> D2531)

swh/lister/cran/lister.py
46–47

Heads up, i will also drop that version which has no part in there.
It's the artifact's version...

Drop version as it's really the artifact's version (already defined there)

vlorentz added a subscriber: vlorentz.

The canonical urls were seen at the available packages listing [1]

Do you mean that you consider the .../index.html links to be canonical? These pages contain the following text:

Please use the canonical form https://CRAN.R-project.org/package=<NameOfPackage> to link to this page.

Shouldn't we use that instead if we want them to be canonical?

And I don't like using an URL to an HTML file as an origin, anyway...

This revision now requires changes to proceed.Jan 15 2020, 4:15 PM
swh/lister/cran/lister.py
46

this html_url seems a bit strange here, and there is no mention of it in the example given in the docstring...

Also it looks like this argument is in fact required, so I don't understand why it's retrieved with a kwargs.get()

Do you mean that you consider the .../index.html links to be canonical? These pages contain the following text:

In the sense we can follow a link which describes the origin, yes.

Please use the canonical form https://CRAN.R-project.org/package=<NameOfPackage> to link to this page.

nicer indeed.

And I don't like using an URL to an HTML file as an origin, anyway...

Well, yes. Currently, it's a link to a tarball so... that was better with the html page, i think.

;)

swh/lister/cran/lister.py
46

Also it looks like this argument is in fact required, so I don't understand why it's retrieved with a kwargs.get()

it is.

i avoided the task_dict method change as i'm unsure we can do it.

swh/lister/cran/lister.py
46

also that's already the existing variable version's case... (and policy) so... i just aligned that code...

task_dict takes the model transformed earlier as parameter...
so it's unfortunately tightly bound with that... (and i'm not starting yet another refactoring...)

html_url is actually, in this lister's context, an artifact_url, so you do not see html_url, you see artifact_url which is present in the docstring.

Adapt according to review:

  • use the right canonical url
  • be more explicit on task_dict's parameter names

Shouldn't we use that instead if we want them to be canonical?

Totally btw, i've adapted the code accordingly.

I completely missed it, thanks!

swh/lister/cran/lister.py
117–118

Now i'm wondering whether to change this in cran's model... (to name it appropriately...)
.oO(*sigh*)

as discussed IRL, let's go forward with this.

Ultimately, the change needed is to:

  • drop the R cran script
  • parse the listing page instead [1]
  • for each package found there, send the origin url [2] to the loader (as recurring task)
  • rewrite the loader cran so it scrapes that origin url [2] page. It then determines itself what the artifact urls it needs to ingest (taking care of T2029 even more properly than described there, task updated btw).

[1] https://cran.r-project.org/web/packages/available_packages_by_date.html

[2] https://cran.r-project.org/package=<package-name>

This revision is now accepted and ready to land.Jan 16 2020, 1:42 PM

Rebase and plug to master branch