Page MenuHomeSoftware Heritage

cran.lister: Use CRAN's canonical url as origin url
ClosedPublic

Authored by ardumont on Tue, Jan 14, 2:42 PM.

Details

Summary

Also make sure the uid is versioned (so as to allow to actually see new package
versions).

The canonical urls were seen at the available packages listing [1]

(This means an adaptation loader CRAN's side is needed)

[1] https://cran.r-project.org/web/packages/available_packages_by_date.html

Test Plan

tox

Diff Detail

Repository
rDLS Listers
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ardumont created this revision.Tue, Jan 14, 2:42 PM
ardumont added inline comments.Tue, Jan 14, 2:52 PM
swh/lister/cran/tests/test_lister.py
51

build failed

my bad, it's a task row not a lister row.

ardumont planned changes to this revision.Tue, Jan 14, 2:52 PM
ardumont updated this revision to Diff 8995.Tue, Jan 14, 4:20 PM

Fix test

ardumont edited the summary of this revision. (Show Details)Tue, Jan 14, 5:04 PM
ardumont updated this revision to Diff 9000.EditedWed, Jan 15, 10:51 AM

Only dedicate the diff to the origin url changes

Another one will be created for the uid versioning part (~> D2531)

ardumont added inline comments.Wed, Jan 15, 1:30 PM
swh/lister/cran/lister.py
45–47

Heads up, i will also drop that version which has no part in there.
It's the artifact's version...

ardumont updated this revision to Diff 9009.Wed, Jan 15, 1:54 PM

Drop version as it's really the artifact's version (already defined there)

vlorentz requested changes to this revision.Wed, Jan 15, 4:15 PM
vlorentz added a subscriber: vlorentz.

The canonical urls were seen at the available packages listing [1]

Do you mean that you consider the .../index.html links to be canonical? These pages contain the following text:

Please use the canonical form https://CRAN.R-project.org/package=<NameOfPackage> to link to this page.

Shouldn't we use that instead if we want them to be canonical?

And I don't like using an URL to an HTML file as an origin, anyway...

This revision now requires changes to proceed.Wed, Jan 15, 4:15 PM
douardda added inline comments.Wed, Jan 15, 4:17 PM
swh/lister/cran/lister.py
46

this html_url seems a bit strange here, and there is no mention of it in the example given in the docstring...

Also it looks like this argument is in fact required, so I don't understand why it's retrieved with a kwargs.get()

Do you mean that you consider the .../index.html links to be canonical? These pages contain the following text:

In the sense we can follow a link which describes the origin, yes.

Please use the canonical form https://CRAN.R-project.org/package=<NameOfPackage> to link to this page.

nicer indeed.

And I don't like using an URL to an HTML file as an origin, anyway...

Well, yes. Currently, it's a link to a tarball so... that was better with the html page, i think.

;)

swh/lister/cran/lister.py
46

Also it looks like this argument is in fact required, so I don't understand why it's retrieved with a kwargs.get()

it is.

i avoided the task_dict method change as i'm unsure we can do it.

ardumont added inline comments.Wed, Jan 15, 4:33 PM
swh/lister/cran/lister.py
46

also that's already the existing variable version's case... (and policy) so... i just aligned that code...

task_dict takes the model transformed earlier as parameter...
so it's unfortunately tightly bound with that... (and i'm not starting yet another refactoring...)

html_url is actually, in this lister's context, an artifact_url, so you do not see html_url, you see artifact_url which is present in the docstring.

ardumont updated this revision to Diff 9036.Wed, Jan 15, 4:35 PM

Adapt according to review:

  • use the right canonical url
  • be more explicit on task_dict's parameter names
ardumont added a comment.EditedWed, Jan 15, 5:40 PM

Shouldn't we use that instead if we want them to be canonical?

Totally btw, i've adapted the code accordingly.

I completely missed it, thanks!

ardumont added inline comments.Wed, Jan 15, 5:51 PM
swh/lister/cran/lister.py
117–118

Now i'm wondering whether to change this in cran's model... (to name it appropriately...)
.oO(*sigh*)

ardumont added a comment.EditedThu, Jan 16, 11:24 AM

as discussed IRL, let's go forward with this.

Ultimately, the change needed is to:

  • drop the R cran script
  • parse the listing page instead [1]
  • for each package found there, send the origin url [2] to the loader (as recurring task)
  • rewrite the loader cran so it scrapes that origin url [2] page. It then determines itself what the artifact urls it needs to ingest (taking care of T2029 even more properly than described there, task updated btw).

[1] https://cran.r-project.org/web/packages/available_packages_by_date.html

[2] https://cran.r-project.org/package=<package-name>

vlorentz accepted this revision.Thu, Jan 16, 1:42 PM
This revision is now accepted and ready to land.Thu, Jan 16, 1:42 PM
ardumont updated this revision to Diff 9040.Thu, Jan 16, 1:49 PM

Rebase and plug to master branch

This revision was automatically updated to reflect the committed changes.