Details

Reviewers

douardda

Group Reviewers

Reviewers

Maniphest Tasks

T1709: implement an R-cran lister

Commits

rDLSa9a37a85bf9e: swh.lister.cran

Summary

Implemented CRAN lister which uses build in API in R to list the packages in JSON format.
That JSON string is passed to the lister.py which then extract all the information to create an origin form the JSON passed by the R script.

Here is the short snippet of the complete response of R script https://forge.softwareheritage.org/P407
Closes T1709

Diff Detail

Repository

rDLS Listers

Branch

rcran (branched from master)

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 5930
Build 8125: tox-on-jenkins	Jenkins
Build 8124: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

anlambert added a subscriber: anlambert.May 21 2019, 4:02 PM

anlambert added inline comments.

swh/lister/rcran/lister.py
28	FYI, we are currently renaming the scheduler task names to something more rational (see T1508). This should be changed to: _type = 'load-%s' % origin_type

Added rcran lister in readme and cli.py

nahimilega added inline comments.May 21 2019, 5:58 PM

swh/lister/rcran/lister.py
28	Thanks for updating me about the change @anlambert. I made the recommended changes.

Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/130/ for more details.

Harbormaster completed remote builds in B5864: Diff 4923.May 21 2019, 5:59 PM

Added functions necessary in to be present because of @abc.abstractmethod

Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/132/ for more details.

Harbormaster completed remote builds in B5866: Diff 4925.May 21 2019, 7:45 PM

Sorry to insist: commit messages are better, but still lacks at least the imperative form. Also do keep in mind the why-not-how when writing the body of the commit message.
Thanks.

This revision now requires changes to proceed.May 22 2019, 9:26 AM

Improve commit messages by using imperative form

Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/133/ for more details.

Harbormaster completed remote builds in B5872: Diff 4932.May 22 2019, 12:35 PM

@douardda Thanks for helping me out to improve commit messages.
Although I was wondering before landing the diff we usually squash all the commits to one single one, so what is the need to follow strict guidelines for commit messages in the process of improving the diff. I mean at the end they all are going to be squashed to one single commit. :)

Updated doc string

Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/137/ for more details.

Harbormaster completed remote builds in B5907: Diff 4962.May 24 2019, 6:40 AM

In D1492#33485, @nahimilega wrote:

@douardda Thanks for helping me out to improve commit messages.
Although I was wondering before landing the diff we usually squash all the commits to one single one, so what is the need to follow strict guidelines for commit messages in the process of improving the diff. I mean at the end they all are going to be squashed to one single commit. :)

No, commits MUST remains as you wrote them. Do NOT use arcanist to push your commits, use 'git push' directly, as described in https://wiki.softwareheritage.org/wiki/Code_review_in_Phabricator#Landing_your_change_onto_master

Although I was wondering before landing the diff we usually squash all the commits to one single one,

You can but you are not forced to.
You could keep logical modifications that belong together into multiple commits.
I'd say the only requisite is that those commits have their unit tests green.

so what is the need to follow strict guidelines for commit messages in the process of improving the diff.

It's not necessarily only to improve the diff.

That can also help you clarify what you wanted to do in the first place.

I think it's always an helpful exercise and everyone can benefit from it.

Cheers,

Thanks @ardumont @douardda, I will remember your advice.

douardda requested changes to this revision.May 24 2019, 10:28 AM

douardda added inline comments.

README.md
180–191	This part of the doc (hence your addition) is pretty much deprecated. So keep this for now, but be aware it can probably be removed any time soon.
swh/lister/rcran/list_all_the_packages.R
2 ↗	(On Diff #4962)	please add a small comment that explains (one sentence should be enough) what this script does (even if it is quite simple).
swh/lister/rcran/lister.py
18–19	why implement the overloaded init if it does nothing more than calling super()?
30–35	as I said in D1497 I don't like that kwargs.get is used to extract mandatory values. Also don't see the need for those `project_xxx` variables. This method should be a one statement only.
45	This will not work. You assume you are running the command from the package directory. And it is a rather bad idea to call the script directly (rather than calling the `Rscript` tool with the path to the script as argument). Yo need to figure out where this script will be shipped once the python package is installed and ensure you get it's proper path at execution time. I'd also rather use check_output (without a shell) here. No need for a shell in the path.
52	?
54–61	why do you need a method for this one liner? Plus I don't like having URLs hardwritten in the middle of the code: it should be at least a constant of the class or configurable parameter which defaults to this URL. Also, use named-based formatting here: return 'https://cran.r-project.org/src/contrib/%(Package)s_%(Version)s.tar.gz' % repo
118–125	weird. why are these (NOP-like) methods needed?

This revision now requires changes to proceed.May 24 2019, 10:28 AM

Replace R script with a single line command.

Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/139/ for more details.

Harbormaster completed remote builds in B5929: Diff 4979.May 26 2019, 7:04 PM

You got your wish by inlining the R call inside of Python.

That is still if somewhat superfluous gain as it will make it harder to add any R processing, should you need or want it, later.
For starters, it is somewhat suboptimal to use the N x 65 returns (where N == number of CRAN packages, currently at 14000 and still growing by several thousand each year) when you only need package and version. So maybe

R> obj1 <- jsonlite::toJSON( tools::CRAN_package_db() )
R> obj2 <- jsonlite::toJSON( tools::CRAN_package_db()[, c("Package", "Version") ] )
R> dang::lsos()  # little helper to post-process object.size()                                                                                                                                                     
     Type     Size Rows Columns
obj1 json 17174872    1      NA
obj2 json   579312    1      NA
R>

Optimize R Script to make it scalable

Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/140/ for more details.

Harbormaster completed remote builds in B5930: Diff 4980.May 26 2019, 7:43 PM

nahimilega edited the summary of this revision. (Show Details)May 26 2019, 7:45 PM

@eddelbuettel As you recommended, I did some preprocessing only to get the useful data.
Also, I changed from a separate file to a single line R script because it is difficult to determine the location of the R script to execute it when the module will be shipped as a package or run in production.

Also, I changed from a separate file to a single line R script because it is difficult to determine the location of the R script to execute it when the module will be shipped as a package or run in production.

Let's just agree to disgree here. Some of us know how to do this with packages, and have done for years if not decades. But my role here is not to impose or suggest policy (but rather to help with R details as needed) so I will leave this to others as they see fit.

nahimilega added inline comments.May 26 2019, 7:55 PM

swh/lister/rcran/lister.py
118–125	SimpleLister class is inherited, and these methods are @abc.abstractmethod hence, they are needed to be present or else code will throw an error. However, the response is not achieved from conventional methods. Thus, there is no need for these functions.

Let's just agree to disgree here. Some of us know how to do this with packages, and have done for years if not decades.

@eddelbuettel I didn't had an opportunity to work on a project with two languages and combine them, it is my first time. I would love to learn the way it is done and made ready for production.

douardda added inline comments.May 27 2019, 9:49 AM

swh/lister/rcran/lister.py
118–125	Yeah right... this code (I mean the base classes provided for Lister implementation) really need some serious refactoring...

In D1492#33863, @nahimilega wrote:

Let's just agree to disgree here. Some of us know how to do this with packages, and have done for years if not decades.

@eddelbuettel I didn't had an opportunity to work on a project with two languages and combine them, it is my first time. I would love to learn the way it is done and made ready for production.

Note that I did not mean you have to embed the R code as a string in your python code when I made my comment on this part of your code. There are ways to handle such a non-python file properly in a python package. Please read https://packaging.python.org/ and https://setuptools.readthedocs.io/en/latest/setuptools.html (and especially https://setuptools.readthedocs.io/en/latest/setuptools.html#including-data-files ).

Change rcran name to cran
Convert oneliner R command to a R script.

Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/147/ for more details.

Harbormaster completed remote builds in B5944: Diff 4995.May 28 2019, 1:37 PM

nahimilega retitled this revision from Implemented R-CRAN Lister to Implemented CRAN Lister.May 28 2019, 1:37 PM

nahimilega edited the summary of this revision. (Show Details)

Add instance parameter according to unified credentials structure between listers.

Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/156/ for more details.

Harbormaster completed remote builds in B5975: Diff 5024.May 29 2019, 6:43 PM

nahimilega marked an inline comment as done.Jun 2 2019, 12:00 PM

nahimilega retitled this revision from Implemented CRAN Lister to CRAN Lister.Jun 3 2019, 4:10 PM

vlorentz added a subscriber: vlorentz.Jun 3 2019, 6:02 PM

vlorentz added inline comments.

swh/lister/core/tests/conftest.py
14	Should be in alphabetical order
swh/lister/cran/lister.py
38–56 ↗	(On Diff #5024)	Should be wrapped in triple backquotes. And it does not return JSON, it returns a list of dictionaries.

Change docstring and made conftest.py in core ordered alphabetically

Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/163/ for more details.

Harbormaster completed remote builds in B5997: Diff 5047.Jun 3 2019, 7:18 PM

nahimilega marked 2 inline comments as done.Jun 4 2019, 12:40 PM

The diffs looks fine, but I'd like to have confirmation the code works properly in real conditions (in a docker session) before we merge this.

This revision is now accepted and ready to land.Jun 6 2019, 2:51 PM

squashed git commits
rebased the branch on master
removed methods that were not in use but we written because of abstractattribute
Ran the lister in docker

Harbormaster completed remote builds in B6118: Diff 5160.Jun 8 2019, 9:19 PM

Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/175/ for more details.

I ran the lister in docker to test it, it ran without any problem
(the content of tcran.py is same 4 lines that are present in README of lister

(swh) archit@work-pc:~/swh-environment/swh-lister$ python tcran.py 
DEBUG:swh.lister.core.lister_base:Loading config from lister_cran
INFO:swh.core.config:Loading config file /home/archit/.config/swh/lister_cran.yml
DEBUG:swh.lister.core.lister_base:<swh.lister.cran.lister.CRANLister object at 0x7f03ea55c518> CONFIG={'content_size_limit': 104857600, 'log_db': 'dbname=softwareheritage-log', 'storage': {'cls': 'remote', 'args': {'url': 'http://localhost:5002/'}}, 'scheduler': {'cls': 'remote', 'args': {'url': 'http://localhost:5008/'}}, 'lister': {'cls': 'local', 'args': {'db': 'postgresql:///lister-cran'}}, 'credentials': [], 'cache_responses': True, 'cache_dir': '/home/archit/.cache/swh/lister/cran/'}
DEBUG:root:models: 10000
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): localhost:5002
DEBUG:urllib3.connectionpool:http://localhost:5002 "POST /origin/add_multi HTTP/1.1" 200 1
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): localhost:5008
DEBUG:urllib3.connectionpool:http://localhost:5008 "POST /create_tasks HTTP/1.1" 200 1
DEBUG:root:models: 4372
DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost
DEBUG:urllib3.connectionpool:http://localhost:5002 "POST /origin/add_multi HTTP/1.1" 200 1
DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost
DEBUG:urllib3.connectionpool:http://localhost:5008 "POST /create_tasks HTTP/1.1" 200 1

Here is one of the task that was created by the lister
https://forge.softwareheritage.org/P424

@douardda Can you please take a look at it and give me confirmation to merge it.

rebased the branch on master

Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/178/ for more details.

Harbormaster completed remote builds in B6148: Diff 5180.Jun 11 2019, 6:01 PM

Closed by commit rDLSa9a37a85bf9e: swh.lister.cran (authored by nahimilega). · Explain WhyJun 12 2019, 4:05 PM

This revision was automatically updated to reflect the committed changes.

ardumont added inline comments.Jul 25 2019, 6:35 PM

swh/lister/cran/lister.py
101 ↗	(On Diff #5200)	As mentioned in irc, for the sake of history. Instead of overriding the `ingest_data` method by almost duplicating the original, this should have been best to override the method `safely_issue_request` (dropping the use of the identifier parameter). That implementation would have been the current `r_script_request`. This can still be improved at some point ;) Cheers,

ardumont mentioned this in D2062: cran.lister: Refactor and fix cran lister.Oct 2 2019, 11:02 AM

ardumont mentioned this in rDLSd30d574dbe9a: cran.lister: Refactor and fix cran lister.Oct 2 2019, 11:47 AM

CRAN Lister
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 4980

README.md

swh/lister/cli.py

swh/lister/core/tests/conftest.py

swh/lister/rcran/init.py

swh/lister/rcran/lister.py

swh/lister/rcran/models.py

swh/lister/rcran/tasks.py

swh/lister/rcran/tests/init.py

swh/lister/rcran/tests/conftest.py

swh/lister/rcran/tests/test_tasks.py

CRAN ListerClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 4980

README.md

swh/lister/cli.py

swh/lister/core/tests/conftest.py

swh/lister/rcran/__init__.py

swh/lister/rcran/lister.py

swh/lister/rcran/models.py

swh/lister/rcran/tasks.py

swh/lister/rcran/tests/__init__.py

swh/lister/rcran/tests/conftest.py

swh/lister/rcran/tests/test_tasks.py

CRAN Lister
ClosedPublic
Actions

Revision Contents
Changeset List

swh/lister/rcran/init.py

swh/lister/rcran/tests/init.py