@douardda We have a separate task(T1777) for rubygem lister. I guess we can add a separate column in the table by @anlambert showing status of the lister implementation and the tasks related to it.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Sep 13 2019
Sep 12 2019
Here is an interesting update on the issue of listing Maven Central. Great people at the FASTEN EU project are analyzing software dependencies and for that they are working on a tool to download projects from various sources, including Maven.
The tool is here: https://github.com/fasten-project/source-populate
It appears to be more about downloading a known project source than listing the content of a repository, but we could try and share efforts in this space.
In T1389#27957, @anlambert wrote:
Package manager Packages listing url Package metadata url Package source tarball url Rubygems No public api endpoint available https://rubygems.org/api/v2/rubygems/[package]/versions/[version].json available in the medata
Sep 11 2019
how many are left?
Sep 10 2019
I think we're done with this now and everything is running in production.
This is now running in production.
I've rescheduled the tasks for the repositories that had not been loaded. we'll need to follow up separately on failing tasks.
Sep 3 2019
The "standard" listing (output recurring tasks with no priority) ran:
Sep 2 2019
add it to our crawler rotation
Aug 29 2019
A first round has been done:
admin contact
Do we have an admin contact there, to make sure that cloning all their repos at once will not kill their infra?
Aug 27 2019
$ curl --head https://gitlab.ow2.org/api/v4/projects HTTP/2 200 server: nginx date: Tue, 27 Aug 2019 10:21:30 GMT content-type: application/json content-length: 19658 vary: Accept-Encoding cache-control: no-cache link: <https://gitlab.ow2.org/api/v4/projects?membership=false&order_by=created_at&owned=false&page=2&per_page=20&simple=false&sort=desc&starred=false&statistics=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false>; rel="next", <https://gitlab.ow2.org/api/v4/projects?membership=false&order_by=created_at&owned=false&page=1&per_page=20&simple=false&sort=desc&starred=false&statistics=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false>; rel="first", <https://gitlab.ow2.org/api/v4/projects?membership=false&order_by=created_at&owned=false&page=51&per_page=20&simple=false&sort=desc&starred=false&statistics=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false>; rel="last" vary: Origin x-content-type-options: nosniff x-frame-options: SAMEORIGIN x-next-page: 2 x-page: 1 x-per-page: 20 x-prev-page: x-request-id: a8aqxWitUT6 x-runtime: 1.180268 x-total: 1003 x-total-pages: 51 strict-transport-security: max-age=31536000 referrer-policy: strict-origin-when-cross-origin
Aug 21 2019
The mercurial loader and the bitbucket lister have been running all summer.
Given the recent announcement by bitbucket about dropping mercurial support, the priority of this task has just increased.
Jul 19 2019
Jul 16 2019
Jul 9 2019
General functionality of the lister
The general functionality of the LaunchpadGitLister, as we defined it, is a hybrid approach combining:
1- a bare API approach consisting of sending an HTTP GET request to the Launchpad API to retrieve responses containing indexed JSON collections of all git-based launchpad projects
2- a delegation by the lister to a proxy (launchpadlib) to retrieve the corresponding software origins (i.e. git repositories) associated with the retrieved JSON collections of launchpad git-based projects. The delegation consists of invoking python methods defined for the launchpadlib python library to directly retrieve the git repos as python objects, map them accordingly to the data model of SWH, and delegate the planning of the corresponding loading tasks to the scheduler.
Jul 8 2019
P.S. If you want us to provide an English translation of the diagrams and their corresponding explanation, please feel free to notify us.
Jul 7 2019
Jul 3 2019
Jul 2 2019
Thank you for the reply. Before submitting the code, the design we propose for this issue is represented in the diagrams below :
Extending over the plan by @olasd. Here are some of my thoughts on the implementation of base loader.
Jul 1 2019
ingestion in progress: 374/1021 done.
Listing is done.
Jun 30 2019
Heads up, listing status for all instances (included new ones added):
Jun 29 2019
Jun 28 2019
Same as gnu-savannah, instance listed.
Now waiting for scheduling to kick-in...
Listing is done.
So now let's wait for the scheduling of those new origins to kick in.
By the way, listing tor's git repositories instance is within the scope of T1835.
got it, holding off. i'll let you handle this from here on! keep in mind that tor might switch to gitlab in the future, so might have to redo that process eventually.
Jun 17 2019
In T1389#33215, @zack wrote:Thanks @olasd, @ardumont, and @anlambert for this, it's a great plan and I like it a lot !
Just a few comments on the sidelines:
The lister will generate a one-shot task to load each package for the given repository, with the full information needed to do the data fetching.
This seemed clear from a different part of the description, but fwiw here I'm assuming the plan is to only load the version of the packages not already known/ingested in the past.
Jun 15 2019
I did a bit investigation on data dumps, and it seems, they can serve the purpose well
To get the package release, we mainly need the name and version of packages. The link can, therefore, can be generated as -
Jun 14 2019
Jun 12 2019
got it, holding off. i'll let you handle this from here on! keep in mind that tor might switch to gitlab in the future, so might have to redo that process eventually.
btw, the list is ~400 repos for now
@anarcat please hold off from using save code now for now. As we're planning to have a proper cgit lister, we can just add your instance to your rotation once that's done (unless this is super urgent, that is). That will have the additional advantage that we will automatically notice when new repos show up.
so the clone URLs are not exactly the same as the "gitweb" (AKA cgit) repo, so this requires further hacking... i tried this:
Thanks @olasd, @ardumont, and @anlambert for this, it's a great plan and I like it a lot !
I looked into the data dumps provided on rubygems.org.
A bash script(link to the script) is provided by rubygem that will download the most recent weekly dump listed on https://rubygems.org/pages/data and load it into a PostgreSQL database.
Jun 11 2019
Jun 6 2019
In T1776#32932, @nahimilega wrote:OK. Out of curiosity, would you be able to look at the location of dists instead? Just to get a sense of how much overlap there is with archives from GitHub.
I made a short script to analyse the location of dists for packages; here is the result.
Packages iterated - 15336 (~ 6% of total packages)
Number packages whose dists were not hosted on GitHub, or bitbucket or gitlab: 2
VCS for dists: zip, git
Jun 5 2019
On further investigation, I found out there are data dumps provided on rubygems.org
https://rubygems.org/pages/data
This could be used to get the list of all the packages.
Jun 4 2019
OK. Out of curiosity, would you be able to look at the location of dists instead? Just to get a sense of how much overlap there is with archives from GitHub.
In T1776#32914, @nahimilega wrote:In T1776#32892, @olasd wrote:In T1776#32739, @nahimilega wrote:There is a total of 220570 packages.
I made a short script to analyse the VCS and code hosting platform for packages, here is the result.Packages iterated - 15126 (~ 6% of total packages)
VCS found - git, hg
Number of packages that were not hosted on GitHub, or bitbucket or gitlab: 51There are expected to be around 744 packages that are not hosted on GitHub, or bitbucket or Gitlab if we calculate using applying unitary method on the above data.
In your analysis, did you look at the source key (which seems to represent the upstream version control repository) or the dist key (which points at the tarball/zipfile that's actually downloaded by the package manager when installing that version of the package)?
I looked at the source key in my analysis.
We have provisions in the scheduler API to deduplicate tasks, so that creating a new git loading task for an existing origin wouldn't duplicate the task but rather just reference the existing one.
In T1776#32892, @olasd wrote:In T1776#32739, @nahimilega wrote:There is a total of 220570 packages.
I made a short script to analyse the VCS and code hosting platform for packages, here is the result.Packages iterated - 15126 (~ 6% of total packages)
VCS found - git, hg
Number of packages that were not hosted on GitHub, or bitbucket or gitlab: 51There are expected to be around 744 packages that are not hosted on GitHub, or bitbucket or Gitlab if we calculate using applying unitary method on the above data.
In your analysis, did you look at the source key (which seems to represent the upstream version control repository) or the dist key (which points at the tarball/zipfile that's actually downloaded by the package manager when installing that version of the package)?
Once we have the (released) dists figured out, we can also consider -as a second step- having the packagist lister submit tasks for the upstream repositories mentioned in the source key as well, as an extra data source.
In T1776#32739, @nahimilega wrote:There is a total of 220570 packages.
I made a short script to analyse the VCS and code hosting platform for packages, here is the result.Packages iterated - 15126 (~ 6% of total packages)
VCS found - git, hg
Number of packages that were not hosted on GitHub, or bitbucket or gitlab: 51There are expected to be around 744 packages that are not hosted on GitHub, or bitbucket or Gitlab if we calculate using applying unitary method on the above data.
I think there's value in making the Maven lister support more Maven repositories than just Maven Central, even if we focus on Maven Central as the first proof of concept.
makes sense. IMHO, it's good to have one issue per repository, since every repository will have its specific topics
Jun 3 2019
One of the biggest challenges in the implementation of this Lister is to deduplicate all the links listed by the lister, as most of the packages are hosted on GitHub, or bitbucket or gitlab, and a large portion of those websites are already listed and ingested.
Hence we need to remove all the packages that are already listed by another lister in the process of listing all the packages in Packagist.
In T1734#32762, @anonbnr wrote:Hello, we are a group of M1 computer science students of the University of Montpellier, France.
Thanks a lot to @hboutemy for your valuable insights on sources in the Maven central repository, and for the pointer to Reproducible Builds on the JVM.
Jun 2 2019
if you search for source-release or src, yes, you'll find archivable sources
the only issue is that you'll not find many content...
Hello, we are a group of M1 computer science students of the University of Montpellier, France.
I renamed the issue title to "Maven Central repository Lister" if the intent is to focus on this repository https://maven.apache.org/repository/index.html
I renamed the issue title to "Maven Central repository Lister" if the intent is to focus on this repository https://maven.apache.org/repository/index.html