- User Since
- Mar 10 2019, 8:07 PM (10 w, 4 d)
- Updated doc string
Wed, May 22
Comment by @olasd
- the repository format is "Maven"; "Maven Central" is only one instance of a Maven repository. There's many more public Maven repositories that would be useful to index, for instance Clojars or the Google Android maven repo : https://www.deps.co/guides/public-maven-repositories/. You'll need to rename the lister to "maven", and to modify the code to avoid hard-coding the maven repository root, making it an argument to the task instead (as we will want to list projects for several instances).
- you went for a scraping approach, which is fine as a last resort. However, a quick search for "maven central index" brought up https://maven.apache.org/repository/central-index.html. Looks like these indexes are available to allow importing the full
It looks like these indexes are available at least for the following maven repositories :
- Maven Central : https://repo.maven.apache.org/maven2/.index/
- Clojars : https://repo.clojars.org/.index/
- JBoss : https://repository.jboss.org/nexus/content/repositories/releases/.index/nexus-maven-repository-index.gz (there's no file index in the .index directory, but the expected files are there)
The index also provides an incremental version (referenced in a properties file) which would allow for incremental updates without having to re-download the full index.
The Google repo also has an index https://developer.android.com/studio/build/dependencies.html#google-maven but it looks very different from the other maven repos I've found. However, it's fairly small compared to the others, so it shouldn't be too hard to sort it out as well.
Please investigate the format of these repository indexes, and the data they provide, and see whether that would be suitable for use as the data source for the lister.
Thanks for a heads up, I didn't knew about this. I will go through the repository indexes and their provided data and inform you about it by latest.
For the task of listing this API can do the work.
- Change variable names according to convention.
- Add GNU lister in README and cli.py
- Add functions necessary in abstract attribute
@douardda Thanks for helping me out to improve commit messages.
Although I was wondering before landing the diff we usually squash all the commits to one single one, so what is the need to follow strict guidelines for commit messages in the process of improving the diff. I mean at the end they all are going to be squashed to one single commit.
- Improve commit messages by using imperative form
Tue, May 21
- Added functions necessary in to be present because of @abc.abstractmethod
- Added Maven Central lister in README.md and cli.py
- Added rcran lister in readme and cli.py
- Changed print to stdout in R script and impoved commit messages
Removed useless commit messages
Mon, May 20
Browser - Chromium
OS - Ubuntu 18.04
Zoom level in chromium - 100%
Screen Resolution - 1920 * 1080
Sun, May 19
Fri, May 17
And these will be passed to loaders
Not loaders, only 1 loader, the gnu one.
I sense some communication gap, I need to state more clearly what I am thinking of doing so that you can help me more effectively. Please correct me if am I wrong somewhere or there is a better method.
I think It would be great if loader-tar can be shifted to the core.
Thu, May 16
As suggested by @olasd, what was done in 2015 to ingest packages -
- Create origins for all the folders indiscriminately
- Only import things that look like tarballs (i.e. that end with .tar.something)
@olasd recommended trying the listing approach for NuGET lister we discussed(to fetch for repository key in the api response), As recommended, I tried the approach on small dataset. I tried it on 1412 repositories are all of them were quite latest. I found 0 repository URL in them and in 900 of them repository key was empty(ie they were blank string). I think we need to change our approach.
As discussed on IRC the source code link for the repository is in very few of the repositories and the version control system used by repositories is not mentioned in the API response.
One way is Repository URL and the repository type field are present in .nuspec file for each project, so we have to download that file for each project and get source URL but the problem with this is downloading all binary packages to get a small chance to find a link to a source repository sounds like a lot of work, bandwidth and computing power for not much gain and that would only cover one of the ways package maintainers can set the source code information; the aforementioned blog post listed at least four
Thanks, @anlambert, for your help and guidance. As it was my first lister, I would have never been able to complete it without your help. You review assisted me in making this lister more robust and also helping me understand the basics of Lister.
Once again, thanks for your patience and guidance.
Wed, May 15
@anlambert As you mentioned in your previous comment, to remove None from the list I have added the function filter_before_inject() in the lister as you recommended to do.
made all the changes recommended
Removed None from the final list
API Documentation -
@olasd I do not have any familiarity with R language. Learning some basics and making this script would take me around a week. I was wondering it is possible that someone in Software Heritage who have some experience with R can write this script as it would be a matter of minutes to the person who knows R.
Is it possible to do so?
Tue, May 14
- Updated README according to new standard
Mon, May 13
Here is an implementation plan for making R-CRAN lister.
I have taken inspiration from the pypi lister.
To make lister.py for R-CRAN, we need to inherit SimpleLister class and override ingest_data() function and change its first line (where safely_issue_request() is called) to call the function which would run R script to return a json response.
Then after that it is quite like any normal response, we just need to implement following function list_packages, compute url, get_model_from_repo, task_dict and transport_response_simplified.
- Updated testcase in phabricator lister
@faux on IRC mentioned that there is a public DB dump (https://cran.r-project.org/web/dbs) which might be helpful for the purpose.
This DB dump contains files with .rds extension which is used by R language. Here are a couple of rows from that DB dump https://forge.softwareheritage.org/P396
- Fixed a typo in phabricator lister
- Made phabricator lister robust
Sun, May 12
@anlambert As you recommended I tested the lister on multiple forges. Some of the repos where it failed to fetch URL are -
- Updated cli and made phabricator more robust
Fri, May 10
@anlambert I will test my lister on various phabricator instance and fix the bug. Also, I update the readme according to your recommendation. Thanks for your feedback.
- Fixed a typo
- Updated cli and readme for phabricator lister
Thu, May 9
- Made test cases for priority url selector
Tue, May 7
- Made improvements in code quality
Thanks, @vlorentz and @anlambert. I will implement these changes and submit the diff ASAP.
I have one more doubt. I do not have much experience in writing test cases. Can you please advice me or recommend me some source where I can refer to(maybe some lister which is already implemented), to get the idea on writing test cases to validate the repository URL extraction approach.
- Added priority url selector in phabricator lister
Sat, May 4
@anlambert I have made all the recommended changes in the code. Could you please review it once?
- Made phabricator listor robust
Thanks, @anlambert for your reply. I am facing one more issue. How should I decide the priority order of the URIs. Is there any important point which I should keep into consideration while deciding the priority order?
@anlambert I am not able to understand the difference between raw uri, display uri, effective uri and normalized uri, can you please help me with this and explain a bit about the difference between these 4 types of uri?
Fri, May 3
Thank you @anlambert for your in-depth review. Thanks for testing the code. I will surely update my code according to your revision by earliest. Thanks again for such detailed comments. These help me a lot in solving the problem.
Thu, May 2
@anlambert Can you please review the code and suggest improvements
- Fixed test task of phabricator lister
Sun, Apr 28
- Changed indexable value in phabricator lister
Apr 23 2019
- fixed conversion of datatype
- Fixed null exception in shortName
- Changed indexable and uid type to string
- Changed bad response for phabricator lister