Page MenuHomeSoftware Heritage

Create a Lister for launchpad.net
Open, NormalPublic

Description

(Note: as fork is to Github, branch is to Launchpad)
Launchpad uses two types of version control system (git and bazaar) but CVS and Subversion can be imported. For git repositories, they can be directly fed to git loader, but for bazaar repositories, we need a separate loader.

Bazaar repo can be downloaded via a bazaar command of this format

bzr branch lp:<projectname>

(reference http://blog.launchpad.net/general/the-great-source-code-supermarket)

and for git it is of the format

git clone https://git.launchpad.net/<projectname>

In launchpad for every project there one main branch called trunk That is in format

bzr branch lp:<projectname>

and rest are its branches which are in the format bzr branch

lp:~<author.name>/<project.name>/<name>

To ingest all the code, we need to list all the branches of all the projects.
Launchpad proves an API which can be used to list all the projects and branches.

What should be the output of lister?
The output of lister for git projects should be in this format https://git.launchpad.net/<projectname>
And for the bazaar repos, the output should be lp:x where x could be ~<author.name>/<project.name>/<name> or <projectname> depending on whether it project of a branch

Plan to execute the lister-
Either we can use the API to list all the projects of we can use the python library launchpadlib that lets you treat the HTTP resources published by Launchpad's web service as Python objects responding to a standard set of commands. Both can do the work well

Now to list all the branches of a project we need to use launchpadlib to get all the branches.
As done in the first answer here https://askubuntu.com/questions/262485/is-there-a-bzr-command-to-see-all-branches-of-a-project-on-launchpad.
Or we could use bare API as

https://api.launchpad.net/1.0/<project_name>?ws.op=getBranches

Event Timeline

nahimilega triaged this task as Normal priority.May 22 2019, 7:57 PM
nahimilega created this task.
nahimilega created this object in space S1 Public.
nahimilega updated the task description. (Show Details)May 22 2019, 10:26 PM
This comment was removed by nahimilega.
nahimilega updated the task description. (Show Details)May 24 2019, 11:48 AM
nahimilega updated the task description. (Show Details)
nahimilega updated the task description. (Show Details)May 24 2019, 12:09 PM
nahimilega updated the task description. (Show Details)May 28 2019, 8:05 PM
nahimilega added a comment.EditedMay 28 2019, 8:35 PM

Launchpadlib
Pros
The library is available on the Debian stretch.
Easier and faster to get all the branches of a project as it returns at one go whereas bare API returns in an indexing fashion.

Cons
Error handling would be cumbersome.
incremental_lister would be a bit difficult to make.

Bare Launchpad API
Pros
As IndexingHttpLister base class would be perfect for this work, hence most of the code is already present. So it would be easier to implement.
Error handling would be already present in the base class hence no need to worry.
Test cases can also be easily made.
It would be quite similar to other listers like GitHub, hence maintain the uniformity in the code.
Does not require any auth credentials

Cons -
It returns branches of a project in an indexing fashion

As far as speed is concerned, I tried both of them, although I didn't time their response time, both took almost the same time, maybe Bare Launchpad API faster because we can get five repos at a time whereas only one with lib.

nahimilega updated the task description. (Show Details)May 28 2019, 8:36 PM

In my view, we can use the best of both the options to make the lister.
We can use bare API to list down the projects and then use launchpadlib to get all the branches for a project.
In this way, we could use the indexing quality of bare API and simplicity of launchpadlib.

anonbnr added a subscriber: anonbnr.Jun 2 2019, 8:29 PM

Hello, we are a group of M1 computer science students of the University of Montpellier, France.

We designed a proper Launchpad Lister, but only for Git-based projects, since the majority use Bazaar, and a Bazaar loader isn't yet implemented for SWH. We're currently at the latest stage of development (testing).
We followed the SWH documentation concerning the implementation of unit tests, and attempted to configure the testing environment properly by using mkvirtualenv and installing tox and pytest to automate the testing process.
However, while loading the testing version of SWH packages, we keep getting the same error :

ERROR: swh-archiver[testing] should either be a path to a local project or a VCS url beginning with svn+, git+, hg+, or bzr+.

We looked at the swh-environment git log, and we saw that swh-archiver has been removed from the environment as explicited by the following commit message:

".mrconfig: Remove swh-archiver from swh-environment".

So now we're incapable of actually executing our unit tests.

On the other hand, I think we might have errors related to the configuration of postgresql to perform the database insertion of nodes, as we end up having permission related errors that we're incapable of solving...

Finally, other errors related to our design might exist, but we didn't reach this stage yet.

While browsing for a solution to our problem, we stumbled upon this thread. We were very happy to notice a similar approach to the problem. So we contacted our project supervisor who advised us to get in contact with SWH and see if we can collaborate on the issue. We'd be happy to collaborate with you.

Would you like to take a look at our code?

olasd added a comment.Jun 3 2019, 6:34 PM

Hello, we are a group of M1 computer science students of the University of Montpellier, France.

Hi and welcome to Software Heritage!

We designed a proper Launchpad Lister, but only for Git-based projects, since the majority use Bazaar, and a Bazaar loader isn't yet implemented for SWH. We're currently at the latest stage of development (testing).

Awesome!

We followed the SWH documentation concerning the implementation of unit tests, and attempted to configure the testing environment properly by using mkvirtualenv and installing tox and pytest to automate the testing process.
However, while loading the testing version of SWH packages, we keep getting the same error :

ERROR: swh-archiver[testing] should either be a path to a local project or a VCS url beginning with svn+, git+, hg+, or bzr+.

We looked at the swh-environment git log, and we saw that swh-archiver has been removed from the environment as explicited by the following commit message:

".mrconfig: Remove swh-archiver from swh-environment".

So now we're incapable of actually executing our unit tests.

Looks like you'll need to remove the swh-archiver directory from swh-environment, so that it doesn't get installed any more. This will probably fix that issue.

On the other hand, I think we might have errors related to the configuration of postgresql to perform the database insertion of nodes, as we end up having permission related errors that we're incapable of solving...

Finally, other errors related to our design might exist, but we didn't reach this stage yet.

While browsing for a solution to our problem, we stumbled upon this thread. We were very happy to notice a similar approach to the problem. So we contacted our project supervisor who advised us to get in contact with SWH and see if we can collaborate on the issue. We'd be happy to collaborate with you.

Would you like to take a look at our code?

@nahimilega is one of our Google Summer of Code interns, and one of the things he had planned to work on was the Launchpad lister; it's perfectly fine that you've started work on this, we're of course happy to take all (constructive!) contributions.

When contributing to a software project it's usually a good idea to work out the design with the original authors before jumping right into coding. This gives you a better chance of getting your code accepted, and avoids potentially painful review round-trips due to design disagreements.

I suggest that you now submit the code you have written as a Phabricator diff (https://wiki.softwareheritage.org/wiki/Code_review_in_Phabricator), and to follow up to this task with the design that you've chosen to implement the launchpad lister. We can discuss whether the approach looks good or not, and then work on testing it.

For "developer support" questions like the swh-archiver issue or the PostgreSQL stuff, you can also join our IRC channel to get more interactive help (works better during European office hours). You'll want to submit full log traces of your issues (containing what command you've run and the full output), using a Paste.

Thank you for the reply. Before submitting the code, the design we propose for this issue is represented in the diagrams below :

This one describes the general behavior of the model.

This one highlights the position of the proposed lister in the SWH lister class hierarchy.

To explain the components introduced in the lister class hierarchy, here's a quick description of every added element :

LaunchpadProxy
  • encapsulation des méthodes fournies par launchpadlib au proxy ;
  • login anonyme et récupération indexée des projets ;
  • sélection des projets gérés par Git ;
  • extraction des dépôts Git associés aux projets sélectionnés ;
  • exportation/importation des dépôts dans/depuis des fichiers JSON.
ProxiedLister
  • classe abstraite désignant une généralisation de LaunchpadProxy ;
  • à étendre par n’importe quel Lister utilisant un proxy lors de l’interaction avec l’API d’une plateforme.
WebApiProxy
  • interface désignant une généralisation de quelques fonctionnalités implémentées par LaunchpadProxy : le login, l’extraction des dépôts Git, et leur exportation/importation.
  • à implémenter par n’importe quelle classe modélisant un objet proxy entre un Lister et l’API d’une plateforme.

This one exhibits the specialization of the abstract attributes of the generic database model of a repository by our concrete lister database model.

The last one shows the mapping between the SWH database model of our lister and a git repository model as it is provided by the Launchpad API.

P.S. If you want us to provide an English translation of the diagrams and their corresponding explanation, please feel free to notify us.
We'll be waiting to hear your feedback. Thank you very much.

P.S. If you want us to provide an English translation of the diagrams and their corresponding explanation, please feel free to notify us.

It would be really helpful if you can provide an English translation of the diagrams and their corresponding explanation.

General functionality of the lister


The general functionality of the LaunchpadGitLister, as we defined it, is a hybrid approach combining:
1- a bare API approach consisting of sending an HTTP GET request to the Launchpad API to retrieve responses containing indexed JSON collections of all git-based launchpad projects
2- a delegation by the lister to a proxy (launchpadlib) to retrieve the corresponding software origins (i.e. git repositories) associated with the retrieved JSON collections of launchpad git-based projects. The delegation consists of invoking python methods defined for the launchpadlib python library to directly retrieve the git repos as python objects, map them accordingly to the data model of SWH, and delegate the planning of the corresponding loading tasks to the scheduler.

The LaunchpadGitLister is therefore SWHIndexingHTTPLister (since it uses an indexing scheme and HTTP as a transport protocol). This fact is materialized through the bare API approach.
On the other hand, given that Launchpad disposes of an official open-source client to interact with its API (launchpadlib), we can take advantage of this fact to delegate the extraction process to it, and thus an instance of this client would behave as a proxy between the LaunchpadGitLister and the Launchpad API. This fact is materialized through the second step in the general functionality of the LaunchpadGitLister.

In order to generalize this design and make it extensible, we decided to introduce the notion of a "ProxiedLister" that uses a "WebApiProxy" to which it delegates specific listing sub-tasks. LaunchpadGitLister is therefore a ProxiedLister as well, and the launchpadlib proxy encapsulated in the LaunchpadProxy class, is a WebApiProxy.


So to recap:

WebApiProxy

  1. an interface containing all the methods that any class modeling a Web Api Proxy, in the context of SWH, must implement.
  2. methods include :
    • logging in to the API;
    • selecting the git-based launchpad projects;
    • extracting the software origins (Git repos) associated with the selected git-based launchpad projects;
    • exporting/importing the retrieved origins as JSON files/objects.
  3. note: this interface could be further extended and refactored later on when the Bazaar loader is up and running, and upon finding a new platform that allows us to use this approach.

LaunchpadProxy

  1. a wrapper class that encapsulates the launchpadlib proxy and implements WebApiProxy.
  2. it provides an implementation of the aforementioned methods of the interface using methods defined by the launchpadlib.

ProxiedLister

  1. an abstract class designating any lister that delegates specific listing sub-tasks to a WebApiProxy.
  2. note : currently this class is empty, but could easily be used to refactor code and provide default implementations of the listing methods as seen by proxied listers.

Data model of the listed software origins


The data model corresponding to the software origins of Launchpad git-based projects, namely LaunchpadGitModel consists of specializing the abstract attributes of the IndexingModelBase database table (given that Launchpad provides an indexing schema for the returned responses through its API) in the following manner :

  1. the uid field data type is specialized as a string designating the "unique_name" property introduced by Launchpad to identify the origin.
  2. the indexable field data type is specialized as a string also designating unique_name. This specialization, however, is up for discussion since we think that using unique_name in incremental and/or range listing would not go well.

The actual mapping between the SWH data model of the origin and the launchpad model of the origin is depicted in the figure below

legau added a subscriber: legau.Feb 13 2020, 2:34 PM

What is the current status of this task ?

Hi,

I'm one of the developers on the Launchpad team. A user identified as "leni" spoke to us about this on IRC last week; it so happened that the Launchpad team were in the middle of an in-person sprint at the time, so we were able to discuss the problem fairly quickly and put together a plan to improve our API. I implemented those improvements shortly afterwards. They aren't quite deployed on production yet, but they should be very soon. Unfortunately I don't have any contact details for leni unless they happen to join IRC, so I'm posting a summary of the discussion and my improvements here, which is probably a useful thing to do anyway.

We're very happy to support what Software Heritage is doing. However, we would prefer that you not take the approach discussed earlier here of iterating over all projects and then iterating over branches/repositories within those. There are a couple of reasons for this.

Firstly, it isn't a very accurate representation of the code available on Launchpad. Unlike e.g. GitLab, where a "project" is basically a repository with some extra stuff attached to it, in the Launchpad data model a project is an abstract container representing some kind of software project as a whole, and may contain many branches/repositories; but it isn't the only kind of container that a branch/repository might be attached to. For instance, one of Launchpad's main functions is to host the source code for Ubuntu, and the repositories for that are attached to "distribution source packages", not to projects. If you iterate over projects, you'll miss all this.

Secondly, it's pretty inefficient for both you and us. We have over a million public Bazaar branches on our production instance, and over 17000 public Git repositories. We would really rather that you didn't iterate over all of those and poll them for changes. Instead, we'd like to provide you with a feed of all public repositories (and similarly branches) ordered by their modification time. On your side, you can then catch up with changes by asking Launchpad to give you this feed starting from the most recent time you've previously caught up to, and fetching each of the changed repositories.

There are some subtleties to this due to the way in which the batched collection mechanisms in the Launchpad API interact with repositories being modified during iteration. We hope to improve this in future, but for now we recommend that you use launchpadlib in a slightly unconventional way, bypassing the normal way in which launchpadlib iterates over collections, but still making use of its decoding logic. To iterate over a batch of repositories, you'd do something like this:

batch = lp.git_repositories.getRepositories(
    order_by='most neglected first', modified_since_date=threshold)
for repository in batch[:len(batch.entries)]:
    # pull repository

The first time you ever do this, threshold can be set to None.

The batch[:len(batch.entries)] business ensures that you only iterate over the repositories in a single batch, without causing launchpadlib to follow the next_link in the collection's representation (which, due to the aforementioned subtleties) could lead to you missing changes. You should then set threshold to the date_last_modified of the last repository in the batch, minus a short fudge factor (15 seconds should do) to account for events appearing to arrive out of order due to overlapping transactions. (You'll probably also want to make sure that threshold ends up being strictly greater than the first repository in the batch, to avoid an infinite loop if many repositories are modified within a short period of time.) Then repeat the code above with the new threshold, and continue until you get an empty batch, at which point you've caught up. Save the value of threshold and start from that the next time your script runs.

This is a little complicated, but it should be enough to let you reliably harvest the contents of all public repositories on Launchpad, and you shouldn't need to mix bare API requests with uses of launchpadlib the way people were talking about doing earlier. There'll be a similar interface available for Bazaar branches (lp.branches.getBranches), for whenever you're ready to deal with those.

I'll post another update here once this new interface is available on our production instance. I'm also happy to discuss further API adjustments that might be helpful, and/or clarify aspects of Launchpad's data model as needed.

zack added a subscriber: zack.EditedMar 12 2020, 1:35 PM

Hi Colin (@cjwatson), nice to meet you here !

I'm confident IRC's leni is @legau here.
I should also clarify that he is an external contributor to Software Heritage who took an interest in expanding our archival coverage to Launchpad, and we're very happy about that as it's been something we wanted to do for a while now :-)

We found out only recently that you discussed with @legau LaunchPad API changes to make archival easier, and that's great!

Rest assured that we do not want to put into production anything that is inconvenient for you. Also, the approach you propose (feed + change metadata) is what we have been using recently almost everywhere, and it is definitely what we want to have in general (no matter what previous, not merged attempts at doing so mentioned in this ticket say).

Looks like you and @legau are already on the right trick for a proper solution that will make both Launchpad and Software Heritage happy, so I won't comment more than this on the technical details. But feel free to ping me, or anyone else from Software Heritage staff, if you want to coordinate the (future) deployment of this.

Yep, I'm not annoyed, just being emphatic about what we want to see. :-)

The changes I described above are available on our production instance now.

legau added a comment.Mar 24 2020, 3:50 PM

Hi, I proposed a first version with the new changes (D2799).
@cjwatson it should be coherent with your snippet.
If somebody is used to be working with listers I would be glad to hear some remarks over how I implemented it.