Page MenuHomeSoftware Heritage

Create a Lister for launchpad.net
Open, NormalPublic

Description

(Note: as fork is to Github, branch is to Launchpad)
Launchpad uses two types of version control system (git and bazaar) but CVS and Subversion can be imported. For git repositories, they can be directly fed to git loader, but for bazaar repositories, we need a separate loader.

Bazaar repo can be downloaded via a bazaar command of this format

bzr branch lp:<projectname>

(reference http://blog.launchpad.net/general/the-great-source-code-supermarket)

and for git it is of the format

git clone https://git.launchpad.net/<projectname>

In launchpad for every project there one main branch called trunk That is in format

bzr branch lp:<projectname>

and rest are its branches which are in the format bzr branch

lp:~<author.name>/<project.name>/<name>

To ingest all the code, we need to list all the branches of all the projects.
Launchpad proves an API which can be used to list all the projects and branches.

What should be the output of lister?
The output of lister for git projects should be in this format https://git.launchpad.net/<projectname>
And for the bazaar repos, the output should be lp:x where x could be ~<author.name>/<project.name>/<name> or <projectname> depending on whether it project of a branch

Plan to execute the lister-
Either we can use the API to list all the projects of we can use the python library launchpadlib that lets you treat the HTTP resources published by Launchpad's web service as Python objects responding to a standard set of commands. Both can do the work well

Now to list all the branches of a project we need to use launchpadlib to get all the branches.
As done in the first answer here https://askubuntu.com/questions/262485/is-there-a-bzr-command-to-see-all-branches-of-a-project-on-launchpad.
Or we could use bare API as

https://api.launchpad.net/1.0/<project_name>?ws.op=getBranches

Event Timeline

nahimilega triaged this task as Normal priority.May 22 2019, 7:57 PM
nahimilega created this task.
nahimilega created this object in space S1 Public.
nahimilega updated the task description. (Show Details)May 22 2019, 10:26 PM
This comment was removed by nahimilega.
nahimilega updated the task description. (Show Details)May 24 2019, 11:48 AM
nahimilega updated the task description. (Show Details)
nahimilega updated the task description. (Show Details)May 24 2019, 12:09 PM
nahimilega updated the task description. (Show Details)May 28 2019, 8:05 PM
nahimilega added a comment.EditedMay 28 2019, 8:35 PM

Launchpadlib
Pros
The library is available on the Debian stretch.
Easier and faster to get all the branches of a project as it returns at one go whereas bare API returns in an indexing fashion.

Cons
Error handling would be cumbersome.
incremental_lister would be a bit difficult to make.

Bare Launchpad API
Pros
As IndexingHttpLister base class would be perfect for this work, hence most of the code is already present. So it would be easier to implement.
Error handling would be already present in the base class hence no need to worry.
Test cases can also be easily made.
It would be quite similar to other listers like GitHub, hence maintain the uniformity in the code.
Does not require any auth credentials

Cons -
It returns branches of a project in an indexing fashion

As far as speed is concerned, I tried both of them, although I didn't time their response time, both took almost the same time, maybe Bare Launchpad API faster because we can get five repos at a time whereas only one with lib.

nahimilega updated the task description. (Show Details)May 28 2019, 8:36 PM

In my view, we can use the best of both the options to make the lister.
We can use bare API to list down the projects and then use launchpadlib to get all the branches for a project.
In this way, we could use the indexing quality of bare API and simplicity of launchpadlib.

anonbnr added a subscriber: anonbnr.Jun 2 2019, 8:29 PM

Hello, we are a group of M1 computer science students of the University of Montpellier, France.

We designed a proper Launchpad Lister, but only for Git-based projects, since the majority use Bazaar, and a Bazaar loader isn't yet implemented for SWH. We're currently at the latest stage of development (testing).
We followed the SWH documentation concerning the implementation of unit tests, and attempted to configure the testing environment properly by using mkvirtualenv and installing tox and pytest to automate the testing process.
However, while loading the testing version of SWH packages, we keep getting the same error :

ERROR: swh-archiver[testing] should either be a path to a local project or a VCS url beginning with svn+, git+, hg+, or bzr+.

We looked at the swh-environment git log, and we saw that swh-archiver has been removed from the environment as explicited by the following commit message:

".mrconfig: Remove swh-archiver from swh-environment".

So now we're incapable of actually executing our unit tests.

On the other hand, I think we might have errors related to the configuration of postgresql to perform the database insertion of nodes, as we end up having permission related errors that we're incapable of solving...

Finally, other errors related to our design might exist, but we didn't reach this stage yet.

While browsing for a solution to our problem, we stumbled upon this thread. We were very happy to notice a similar approach to the problem. So we contacted our project supervisor who advised us to get in contact with SWH and see if we can collaborate on the issue. We'd be happy to collaborate with you.

Would you like to take a look at our code?

olasd added a comment.Jun 3 2019, 6:34 PM

Hello, we are a group of M1 computer science students of the University of Montpellier, France.

Hi and welcome to Software Heritage!

We designed a proper Launchpad Lister, but only for Git-based projects, since the majority use Bazaar, and a Bazaar loader isn't yet implemented for SWH. We're currently at the latest stage of development (testing).

Awesome!

We followed the SWH documentation concerning the implementation of unit tests, and attempted to configure the testing environment properly by using mkvirtualenv and installing tox and pytest to automate the testing process.
However, while loading the testing version of SWH packages, we keep getting the same error :

ERROR: swh-archiver[testing] should either be a path to a local project or a VCS url beginning with svn+, git+, hg+, or bzr+.

We looked at the swh-environment git log, and we saw that swh-archiver has been removed from the environment as explicited by the following commit message:

".mrconfig: Remove swh-archiver from swh-environment".

So now we're incapable of actually executing our unit tests.

Looks like you'll need to remove the swh-archiver directory from swh-environment, so that it doesn't get installed any more. This will probably fix that issue.

On the other hand, I think we might have errors related to the configuration of postgresql to perform the database insertion of nodes, as we end up having permission related errors that we're incapable of solving...
Finally, other errors related to our design might exist, but we didn't reach this stage yet.
While browsing for a solution to our problem, we stumbled upon this thread. We were very happy to notice a similar approach to the problem. So we contacted our project supervisor who advised us to get in contact with SWH and see if we can collaborate on the issue. We'd be happy to collaborate with you.
Would you like to take a look at our code?

@nahimilega is one of our Google Summer of Code interns, and one of the things he had planned to work on was the Launchpad lister; it's perfectly fine that you've started work on this, we're of course happy to take all (constructive!) contributions.

When contributing to a software project it's usually a good idea to work out the design with the original authors before jumping right into coding. This gives you a better chance of getting your code accepted, and avoids potentially painful review round-trips due to design disagreements.

I suggest that you now submit the code you have written as a Phabricator diff (https://wiki.softwareheritage.org/wiki/Code_review_in_Phabricator), and to follow up to this task with the design that you've chosen to implement the launchpad lister. We can discuss whether the approach looks good or not, and then work on testing it.

For "developer support" questions like the swh-archiver issue or the PostgreSQL stuff, you can also join our IRC channel to get more interactive help (works better during European office hours). You'll want to submit full log traces of your issues (containing what command you've run and the full output), using a Paste.

Thank you for the reply. Before submitting the code, the design we propose for this issue is represented in the diagrams below :

This one describes the general behavior of the model.

This one highlights the position of the proposed lister in the SWH lister class hierarchy.

To explain the components introduced in the lister class hierarchy, here's a quick description of every added element :

LaunchpadProxy
  • encapsulation des méthodes fournies par launchpadlib au proxy ;
  • login anonyme et récupération indexée des projets ;
  • sélection des projets gérés par Git ;
  • extraction des dépôts Git associés aux projets sélectionnés ;
  • exportation/importation des dépôts dans/depuis des fichiers JSON.
ProxiedLister
  • classe abstraite désignant une généralisation de LaunchpadProxy ;
  • à étendre par n’importe quel Lister utilisant un proxy lors de l’interaction avec l’API d’une plateforme.
WebApiProxy
  • interface désignant une généralisation de quelques fonctionnalités implémentées par LaunchpadProxy : le login, l’extraction des dépôts Git, et leur exportation/importation.
  • à implémenter par n’importe quelle classe modélisant un objet proxy entre un Lister et l’API d’une plateforme.

This one exhibits the specialization of the abstract attributes of the generic database model of a repository by our concrete lister database model.

The last one shows the mapping between the SWH database model of our lister and a git repository model as it is provided by the Launchpad API.

P.S. If you want us to provide an English translation of the diagrams and their corresponding explanation, please feel free to notify us.
We'll be waiting to hear your feedback. Thank you very much.

P.S. If you want us to provide an English translation of the diagrams and their corresponding explanation, please feel free to notify us.

It would be really helpful if you can provide an English translation of the diagrams and their corresponding explanation.

General functionality of the lister


The general functionality of the LaunchpadGitLister, as we defined it, is a hybrid approach combining:
1- a bare API approach consisting of sending an HTTP GET request to the Launchpad API to retrieve responses containing indexed JSON collections of all git-based launchpad projects
2- a delegation by the lister to a proxy (launchpadlib) to retrieve the corresponding software origins (i.e. git repositories) associated with the retrieved JSON collections of launchpad git-based projects. The delegation consists of invoking python methods defined for the launchpadlib python library to directly retrieve the git repos as python objects, map them accordingly to the data model of SWH, and delegate the planning of the corresponding loading tasks to the scheduler.

The LaunchpadGitLister is therefore SWHIndexingHTTPLister (since it uses an indexing scheme and HTTP as a transport protocol). This fact is materialized through the bare API approach.
On the other hand, given that Launchpad disposes of an official open-source client to interact with its API (launchpadlib), we can take advantage of this fact to delegate the extraction process to it, and thus an instance of this client would behave as a proxy between the LaunchpadGitLister and the Launchpad API. This fact is materialized through the second step in the general functionality of the LaunchpadGitLister.

In order to generalize this design and make it extensible, we decided to introduce the notion of a "ProxiedLister" that uses a "WebApiProxy" to which it delegates specific listing sub-tasks. LaunchpadGitLister is therefore a ProxiedLister as well, and the launchpadlib proxy encapsulated in the LaunchpadProxy class, is a WebApiProxy.


So to recap:

WebApiProxy

  1. an interface containing all the methods that any class modeling a Web Api Proxy, in the context of SWH, must implement.
  2. methods include :
    • logging in to the API;
    • selecting the git-based launchpad projects;
    • extracting the software origins (Git repos) associated with the selected git-based launchpad projects;
    • exporting/importing the retrieved origins as JSON files/objects.
  3. note: this interface could be further extended and refactored later on when the Bazaar loader is up and running, and upon finding a new platform that allows us to use this approach.

LaunchpadProxy

  1. a wrapper class that encapsulates the launchpadlib proxy and implements WebApiProxy.
  2. it provides an implementation of the aforementioned methods of the interface using methods defined by the launchpadlib.

ProxiedLister

  1. an abstract class designating any lister that delegates specific listing sub-tasks to a WebApiProxy.
  2. note : currently this class is empty, but could easily be used to refactor code and provide default implementations of the listing methods as seen by proxied listers.

Data model of the listed software origins


The data model corresponding to the software origins of Launchpad git-based projects, namely LaunchpadGitModel consists of specializing the abstract attributes of the IndexingModelBase database table (given that Launchpad provides an indexing schema for the returned responses through its API) in the following manner :

  1. the uid field data type is specialized as a string designating the "unique_name" property introduced by Launchpad to identify the origin.
  2. the indexable field data type is specialized as a string also designating unique_name. This specialization, however, is up for discussion since we think that using unique_name in incremental and/or range listing would not go well.

The actual mapping between the SWH data model of the origin and the launchpad model of the origin is depicted in the figure below