Page MenuHomeSoftware Heritage

CRAN Lister
ClosedPublic

Authored by nahimilega on May 20 2019, 3:59 PM.

Details

Summary

Implemented CRAN lister which uses build in API in R to list the packages in JSON format.
That JSON string is passed to the lister.py which then extract all the information to create an origin form the JSON passed by the R script.

Here is the short snippet of the complete response of R script https://forge.softwareheritage.org/P407
Closes T1709

Diff Detail

Repository
rDLS Listers
Branch
rcran (branched from master)
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 5930
Build 8125: tox-on-jenkinsJenkins
Build 8124: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
anlambert added inline comments.
swh/lister/rcran/lister.py
28

FYI, we are currently renaming the scheduler task names to something more rational (see T1508).
This should be changed to:

_type = 'load-%s' % origin_type
nahimilega marked 2 inline comments as done.
  • Added rcran lister in readme and cli.py
swh/lister/rcran/lister.py
28

Thanks for updating me about the change @anlambert. I made the recommended changes.

  • Added functions necessary in to be present because of @abc.abstractmethod

Sorry to insist: commit messages are better, but still lacks at least the imperative form. Also do keep in mind the why-not-how when writing the body of the commit message.
Thanks.

This revision now requires changes to proceed.May 22 2019, 9:26 AM
  • Improve commit messages by using imperative form

@douardda Thanks for helping me out to improve commit messages.
Although I was wondering before landing the diff we usually squash all the commits to one single one, so what is the need to follow strict guidelines for commit messages in the process of improving the diff. I mean at the end they all are going to be squashed to one single commit. :)

@douardda Thanks for helping me out to improve commit messages.
Although I was wondering before landing the diff we usually squash all the commits to one single one, so what is the need to follow strict guidelines for commit messages in the process of improving the diff. I mean at the end they all are going to be squashed to one single commit. :)

No, commits MUST remains as you wrote them. Do NOT use arcanist to push your commits, use 'git push' directly, as described in https://wiki.softwareheritage.org/wiki/Code_review_in_Phabricator#Landing_your_change_onto_master

Although I was wondering before landing the diff we usually squash all the commits to one single one,

You can but you are not forced to.
You could keep logical modifications that belong together into multiple commits.
I'd say the only requisite is that those commits have their unit tests green.

so what is the need to follow strict guidelines for commit messages in the process of improving the diff.

It's not necessarily only to improve the diff.

That can also help you clarify what you wanted to do in the first place.

I think it's always an helpful exercise and everyone can benefit from it.

Cheers,

douardda added inline comments.
README.md
180–191

This part of the doc (hence your addition) is pretty much deprecated. So keep this for now, but be aware it can probably be removed any time soon.

swh/lister/rcran/list_all_the_packages.R
2 ↗(On Diff #4962)

please add a small comment that explains (one sentence should be enough) what this script does (even if it is quite simple).

swh/lister/rcran/lister.py
18–19

why implement the overloaded init if it does nothing more than calling super()?

30–35

as I said in D1497 I don't like that kwargs.get is used to extract mandatory values.

Also don't see the need for those project_xxx variables. This method should be a one statement only.

45

This will not work. You assume you are running the command from the package directory. And it is a rather bad idea to call the script directly (rather than calling the Rscript tool with the path to the script as argument).

Yo need to figure out where this script will be shipped once the python package is installed and ensure you get it's proper path at execution time.

I'd also rather use check_output (without a shell) here. No need for a shell in the path.

52

?

54–61

why do you need a method for this one liner?
Plus I don't like having URLs hardwritten in the middle of the code: it should be at least a constant of the class or configurable parameter which defaults to this URL.

Also, use named-based formatting here:

return 'https://cran.r-project.org/src/contrib/%(Package)s_%(Version)s.tar.gz' % repo
118–125

weird. why are these (NOP-like) methods needed?

This revision now requires changes to proceed.May 24 2019, 10:28 AM
nahimilega marked 6 inline comments as done.
  • Replace R script with a single line command.

You got your wish by inlining the R call inside of Python.

That is still if somewhat superfluous gain as it will make it harder to add any R processing, should you need or want it, later.
For starters, it is somewhat suboptimal to use the N x 65 returns (where N == number of CRAN packages, currently at 14000 and still growing by several thousand each year) when you only need package and version. So maybe

R> obj1 <- jsonlite::toJSON( tools::CRAN_package_db() )
R> obj2 <- jsonlite::toJSON( tools::CRAN_package_db()[, c("Package", "Version") ] )
R> dang::lsos()  # little helper to post-process object.size()                                                                                                                                                     
     Type     Size Rows Columns
obj1 json 17174872    1      NA
obj2 json   579312    1      NA
R>
  • Optimize R Script to make it scalable

@eddelbuettel As you recommended, I did some preprocessing only to get the useful data.
Also, I changed from a separate file to a single line R script because it is difficult to determine the location of the R script to execute it when the module will be shipped as a package or run in production.

Also, I changed from a separate file to a single line R script because it is difficult to determine the location of the R script to execute it when the module will be shipped as a package or run in production.

Let's just agree to disgree here. Some of us know how to do this with packages, and have done for years if not decades. But my role here is not to impose or suggest policy (but rather to help with R details as needed) so I will leave this to others as they see fit.

swh/lister/rcran/lister.py
118–125

SimpleLister class is inherited, and these methods are @abc.abstractmethod hence, they are needed to be present or else code will throw an error. However, the response is not achieved from conventional methods. Thus, there is no need for these functions.

Let's just agree to disgree here. Some of us know how to do this with packages, and have done for years if not decades.

@eddelbuettel I didn't had an opportunity to work on a project with two languages and combine them, it is my first time. I would love to learn the way it is done and made ready for production.

swh/lister/rcran/lister.py
118–125

Yeah right... this code (I mean the base classes provided for Lister implementation) really need some serious refactoring...

Let's just agree to disgree here. Some of us know how to do this with packages, and have done for years if not decades.

@eddelbuettel I didn't had an opportunity to work on a project with two languages and combine them, it is my first time. I would love to learn the way it is done and made ready for production.

Note that I did not mean you have to embed the R code as a string in your python code when I made my comment on this part of your code. There are ways to handle such a non-python file properly in a python package. Please read https://packaging.python.org/ and https://setuptools.readthedocs.io/en/latest/setuptools.html (and especially https://setuptools.readthedocs.io/en/latest/setuptools.html#including-data-files ).

  • Change rcran name to cran
  • Convert oneliner R command to a R script.
nahimilega retitled this revision from Implemented R-CRAN Lister to Implemented CRAN Lister.May 28 2019, 1:37 PM
nahimilega edited the summary of this revision. (Show Details)
  • Add instance parameter according to unified credentials structure between listers.
nahimilega retitled this revision from Implemented CRAN Lister to CRAN Lister.Jun 3 2019, 4:10 PM
vlorentz added inline comments.
swh/lister/core/tests/conftest.py
14

Should be in alphabetical order

swh/lister/cran/lister.py
38–56 ↗(On Diff #5024)

Should be wrapped in triple backquotes. And it does not return JSON, it returns a list of dictionaries.

  • Change docstring and made conftest.py in core ordered alphabetically

The diffs looks fine, but I'd like to have confirmation the code works properly in real conditions (in a docker session) before we merge this.

This revision is now accepted and ready to land.Jun 6 2019, 2:51 PM
  • squashed git commits
  • rebased the branch on master
  • removed methods that were not in use but we written because of abstractattribute
  • Ran the lister in docker

I ran the lister in docker to test it, it ran without any problem
(the content of tcran.py is same 4 lines that are present in README of lister

(swh) archit@work-pc:~/swh-environment/swh-lister$ python tcran.py 
DEBUG:swh.lister.core.lister_base:Loading config from lister_cran
INFO:swh.core.config:Loading config file /home/archit/.config/swh/lister_cran.yml
DEBUG:swh.lister.core.lister_base:<swh.lister.cran.lister.CRANLister object at 0x7f03ea55c518> CONFIG={'content_size_limit': 104857600, 'log_db': 'dbname=softwareheritage-log', 'storage': {'cls': 'remote', 'args': {'url': 'http://localhost:5002/'}}, 'scheduler': {'cls': 'remote', 'args': {'url': 'http://localhost:5008/'}}, 'lister': {'cls': 'local', 'args': {'db': 'postgresql:///lister-cran'}}, 'credentials': [], 'cache_responses': True, 'cache_dir': '/home/archit/.cache/swh/lister/cran/'}
DEBUG:root:models: 10000
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): localhost:5002
DEBUG:urllib3.connectionpool:http://localhost:5002 "POST /origin/add_multi HTTP/1.1" 200 1
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): localhost:5008
DEBUG:urllib3.connectionpool:http://localhost:5008 "POST /create_tasks HTTP/1.1" 200 1
DEBUG:root:models: 4372
DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost
DEBUG:urllib3.connectionpool:http://localhost:5002 "POST /origin/add_multi HTTP/1.1" 200 1
DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost
DEBUG:urllib3.connectionpool:http://localhost:5008 "POST /create_tasks HTTP/1.1" 200 1

Here is one of the task that was created by the lister
https://forge.softwareheritage.org/P424

@douardda Can you please take a look at it and give me confirmation to merge it.

  • rebased the branch on master
This revision was automatically updated to reflect the committed changes.
swh/lister/cran/lister.py
101 ↗(On Diff #5200)

As mentioned in irc, for the sake of history.

Instead of overriding the ingest_data method by almost duplicating the original, this should have been best to override the method safely_issue_request (dropping the use of the identifier parameter).

That implementation would have been the current r_script_request.

This can still be improved at some point ;)

Cheers,