Page MenuHomeSoftware Heritage

Implemented a lister for phabricator instance
ClosedPublic

Authored by nahimilega on Apr 7 2019, 2:37 PM.

Details

Summary

This is the first draft of phabricator lister (T808)
This is the implementation of a lister which can be used to list the links of all the repositories present in a particular phabricator lister.

Test Plan

As I am new to this community and this is my first implementation of listers, I am facing trouble in writing tests. I need some help and guidance to do the same

Diff Detail

Repository
rDLS Listers
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
anlambert requested changes to this revision.May 10 2019, 11:43 AM

Still a couple of nitpicks to handle.

I also noticed that you should update the file swh/lister/cli.py in order to ease the database tables creation for the Phabricator lister
(that file starts to look really ugly by the way and should be refactored in a near future).

swh/lister/phabricator/lister.py
15

Use forge_url instead of FORGEURL for code style consistency

24

you can remove that line

58–59

You should change that doc to:

Return the index of the first repository hosted on the Phabricator instance
68–69
Return url for a hosted repository from its uris attachments according to the following priority lists:

   * protocol: https > http
   * identifier: shortname > callsign > id
This revision now requires changes to proceed.May 10 2019, 11:43 AM
nahimilega updated this revision to Diff 4776.May 10 2019, 2:05 PM
  • Updated cli and readme for phabricator lister
nahimilega updated this revision to Diff 4778.May 10 2019, 2:50 PM
nahimilega marked an inline comment as done.
  • Fixed a typo
nahimilega marked 2 inline comments as done.May 10 2019, 3:51 PM
nahimilega marked an inline comment as done.
anlambert requested changes to this revision.May 10 2019, 6:01 PM

This still needs some work as the lister is not robust enough yet. Testing it on https://developer.blender.org raise errors as some repositories correspond to subversion ones
and the repo url extraction process fails in that case, see sample repo data below.

{'attachments': {'uris': {'uris': [{'fields': {'builtin': {'identifier': None,
                                                           'protocol': None},
                                               'credentialPHID': None,
                                               'dateCreated': '1467894515',
                                               'dateModified': '1468574079',
                                               'disabled': False,
                                               'display': {'default': 'never',
                                                           'effective': 'always',
                                                           'raw': 'always'},
                                               'io': {'default': 'none',
                                                      'effective': 'observe',
                                                      'raw': 'observe'},
                                               'repositoryPHID': 'PHID-REPO-ge2icigfu5ijk2whqfbl',
                                               'uri': {'display': 'https://svn.blender.org/svnroot/bf-blender/',
                                                       'effective': 'https://svn.blender.org/svnroot/bf-blender/',
                                                       'normalized': 'svn.blender.org/svnroot/bf-blender',
                                                       'raw': 'https://svn.blender.org/svnroot/bf-blender/'}},
                                    'id': '70',
                                    'phid': 'PHID-RURI-h7zdbkud6why4xrb2s2e',
                                    'type': 'RURI'}]}},
 'fields': {'almanacServicePHID': None,
            'callsign': 'BL',
            'dateCreated': 1385564674,
            'dateModified': 1468574079,
            'isImporting': False,
            'name': 'Blender Libraries',
            'policy': {'diffusion.push': 'PHID-PROJ-hclk7tvd6pmpjmqastjl',
                       'edit': 'admin',
                       'view': 'public'},
            'shortName': None,
            'spacePHID': None,
            'status': 'active',
            'vcs': 'svn'},
 'id': 8,
 'phid': 'PHID-REPO-ge2icigfu5ijk2whqfbl',
 'type': 'REPO'}

One important thing, you should test you changes on real world examples. You can get a list of Phabricator instances used by open source projects on the Phabricator wikipedia page.
I encourage you to create accounts on multiple forges and test your lister implementation on them.

Let's continue that work next week.

README.md
211

The config file should be named lister_phabricator.yml.

My bad here I should have updated the README in rDLSdac7777cd6630a4b39eeb884b09badfcd732312d

I will submit a diff to fix outdated stuffs in the README

212–216

That configuration is also totally outdated, it should rather be:

storage:
  cls: 'remote'
  args:
    url: 'http://localhost:5002/'

scheduler:
  cls: 'remote'
  args:
    url: 'http://localhost:5008/'
            
lister:
  cls: 'local'
  args:
    db: 'postgresql:///lister-phabricator'

I also do not think that we need to cache responses, considering the number of hosted repositories on a Phabricator instance
is quite low compared to GitHub or GitLab.

I will update the other outdated listers configuration in the README in the diff I will submit.

225

would look better if the two instructions were on separate lines here

swh/lister/cli.py
12–13

you need to add 'phabricator' to that list

swh/lister/phabricator/lister.py
14

This constructor must be moved after the class variable (those in upper cases) definition

18

we should check if forge_url contains a leading slash here and remove it if it is the case as requests to
urls like https://developer.blender.org//api/diffusion.repository.search will fail otherwise

This revision now requires changes to proceed.May 10 2019, 6:01 PM

@anlambert I will test my lister on various phabricator instance and fix the bug. Also, I update the readme according to your recommendation. Thanks for your feedback.

nahimilega updated this revision to Diff 4785.May 12 2019, 12:43 PM
nahimilega marked 6 inline comments as done.
  • Updated cli and made phabricator more robust

@anlambert As you recommended I tested the lister on multiple forges. Some of the repos where it failed to fetch URL are -

{
  "id": 1501,
  "type": "REPO",
  "phid": "PHID-REPO-zw3uo22a5pngavq2wesi",
  "fields": {
    "name": "Subversion - Mysql",
    "vcs": "svn",
    "callsign": "SVNM",
    "shortName": null,
    "status": "inactive",
    "isImporting": false,
    "almanacServicePHID": null,
    "spacePHID": null,
    "dateCreated": 1434555075,
    "dateModified": 1513096478,
    "policy": {
      "view": "public",
      "edit": "PHID-PROJ-rzvwdtume4to5fnuh3rj",
      "diffusion.push": "PHID-PROJ-rzvwdtume4to5fnuh3rj"
    }
  },
  "attachments": {
    "uris": {
      "uris": [
        {
          "id": "12194",
          "type": "RURI",
          "phid": "PHID-RURI-osaarm4jrioboxoertd3",
          "fields": {
            "repositoryPHID": "PHID-REPO-zw3uo22a5pngavq2wesi",
            "uri": {
              "raw": "ssh://id",
              "display": "svn+ssh://vcs@git-ssh.wikimedia.org/diffusion/1501/",
              "effective": "svn+ssh://vcs@git-ssh.wikimedia.org/diffusion/1501/",
              "normalized": "git-ssh.wikimedia.org/diffusion/1501"
            },
            "io": {
              "raw": "read",
              "default": "readwrite",
              "effective": "read"
            },
            "display": {
              "raw": "default",
              "default": "never",
              "effective": "never"
            },
            "credentialPHID": null,
            "disabled": false,
            "builtin": {
              "protocol": "ssh",
              "identifier": "id"
            },
            "dateCreated": "1463620787",
            "dateModified": "1463620787"
          }
        },
        {
          "id": "12192",
          "type": "RURI",
          "phid": "PHID-RURI-cowcdep3l7d73ustyava",
          "fields": {
            "repositoryPHID": "PHID-REPO-zw3uo22a5pngavq2wesi",
            "uri": {
              "raw": "ssh://callsign",
              "display": "svn+ssh://vcs@git-ssh.wikimedia.org/diffusion/SVNM/",
              "effective": "svn+ssh://vcs@git-ssh.wikimedia.org/diffusion/SVNM/",
              "normalized": "git-ssh.wikimedia.org/diffusion/SVNM"
            },
            "io": {
              "raw": "read",
              "default": "readwrite",
              "effective": "read"
            },
            "display": {
              "raw": "default",
              "default": "always",
              "effective": "always"
            },
            "credentialPHID": null,
            "disabled": false,
            "builtin": {
              "protocol": "ssh",
              "identifier": "callsign"
            },
            "dateCreated": "1463620787",
            "dateModified": "1463620787"
          }
        }
      ]
    }
  }
}

As in this example, there is only ssh protocol available for this repos.

{'attachments': {'uris': {'uris': [{'fields': {'builtin': {'identifier': None,
                                                           'protocol': None},
                                               'credentialPHID': None,
                                               'dateCreated': '1467894515',
                                               'dateModified': '1468574079',
                                               'disabled': False,
                                               'display': {'default': 'never',
                                                           'effective': 'always',
                                                           'raw': 'always'},
                                               'io': {'default': 'none',
                                                      'effective': 'observe',
                                                      'raw': 'observe'},
                                               'repositoryPHID': 'PHID-REPO-ge2icigfu5ijk2whqfbl',
                                               'uri': {'display': 'https://svn.blender.org/svnroot/bf-blender/',
                                                       'effective': 'https://svn.blender.org/svnroot/bf-blender/',
                                                       'normalized': 'svn.blender.org/svnroot/bf-blender',
                                                       'raw': 'https://svn.blender.org/svnroot/bf-blender/'}},
                                    'id': '70',
                                    'phid': 'PHID-RURI-h7zdbkud6why4xrb2s2e',
                                    'type': 'RURI'}]}},
 'fields': {'almanacServicePHID': None,
            'callsign': 'BL',
            'dateCreated': 1385564674,
            'dateModified': 1468574079,
            'isImporting': False,
            'name': 'Blender Libraries',
            'policy': {'diffusion.push': 'PHID-PROJ-hclk7tvd6pmpjmqastjl',
                       'edit': 'admin',
                       'view': 'public'},
            'shortName': None,
            'spacePHID': None,
            'status': 'active',
            'vcs': 'svn'},
 'id': 8,
 'phid': 'PHID-REPO-ge2icigfu5ijk2whqfbl',
 'type': 'REPO'}

In this repo, although there is an https URL stated, but the protocol section of api response it is recorded as null, and when you open this url stated (https://svn.blender.org/svnroot/bf-blender/) it is different from what should ideally appear.

I don't know how to deal with these cases.

For now, I have made a change in lister so it will return None when no url is found hence it not even consider the presence of these repos whose url are not according to the requirement, therefore will not create some error in further execution of the process.
(ie changed line 30 lin lister.py from return {} to return None)

@nahimilega , when the protocol field is equal to None (like for some subversion repositories), you should test is the effective uri starts with https or http and returns it if it is the case.
For instance, executing $ svn checkout https://svn.blender.org/svnroot/bf-blender/ works so this means we should be able to load that software origin into the archive.

nahimilega updated this revision to Diff 4788.May 13 2019, 12:01 PM
  • Made phabricator lister robust
nahimilega updated this revision to Diff 4789.May 13 2019, 12:07 PM
  • Fixed a typo in phabricator lister
anlambert requested changes to this revision.May 13 2019, 3:54 PM

@nahimilega , I added some inline comments regarding code style and naming.

Could you add a new test (or update the test_get_repo_url one) for the subversion repository url extraction ?

Also, I think you should rename the test file swh/lister/phabricator/tests/test_pb_lister.py
to swh/lister/phabricator/tests/test_lister.py. Our naming convention for the test files
is test_<module_name>.py.

swh/lister/phabricator/lister.py
83–86

I would rather write here:

for protocol in ('https', 'http'):
    if url.startswith(protocol):
       processed_urls[protocol]['undefined'] = url
       break
88

Replace None by undefined here

This revision now requires changes to proceed.May 13 2019, 3:54 PM
nahimilega updated this revision to Diff 4793.May 13 2019, 6:18 PM
  • Updated testcase in phabricator lister
nahimilega marked 2 inline comments as done.May 13 2019, 6:20 PM
nahimilega updated this revision to Diff 4796.May 14 2019, 2:16 PM
  • Updated README according to new standard
anlambert requested changes to this revision.May 14 2019, 2:53 PM

@nahimilega, while doing more tests, I noticed that the first repository data did not end up in the lister database and no scheduler task was created for it in order to load it into the archive.

Also thinking it back, the lister bootstrapping when no min_bound value is provided to the run method should be handled in the lister implementation
and not in the tasks ones.

I ended up with the following implementation fixing those, I copied the git diff here for commodity:

diff --git a/swh/lister/phabricator/lister.py b/swh/lister/phabricator/lister.py
index 8b481bf..8824ceb 100644
--- a/swh/lister/phabricator/lister.py
+++ b/swh/lister/phabricator/lister.py
@@ -54,14 +54,45 @@ class PhabricatorLister(SWHIndexingHttpLister):
         repos = repos['result']['data']
         return [self.get_model_from_repo(repo) for repo in repos]
 
-    def first_repository_index(self):
+    def _bootstrap_repositories_listing(self):
         """
-            Return the index of the first repository hosted
-            on the Phabricator instance
+        Method called when no min_bound value has been provided
+        to the lister. Its purpose is to:
+
+            1. get the first repository data hosted on the Phabricator
+               instance
+
+            2. inject them into the lister database
+
+            3. return the first repository index to start the listing
+               after that value
+
+        Returns:
+            int: The first repository index
         """
         params = '&order=oldest&limit=1'
         response = self.safely_issue_request(params)
-        return response.json()['result']['data'][0]['id']
+        models_list = self.transport_response_simplified(response)
+        self.max_index = models_list[0]['indexable']
+        models_list = self.filter_before_inject(models_list)
+        injected = self.inject_repo_data_into_db(models_list)
+        self.create_missing_origins_and_tasks(models_list, injected)
+        return self.max_index
+
+    def run(self, min_bound=None, max_bound=None):
+        """
+        (Override) Run the lister on the specified Phabricator instance
+
+        Args:
+            min_bound (int): Optional repository index to start the listing
+                after it
+            max_bound (int): Optional repository index to stop the listing
+                after it
+        """
+        # initial call to the lister, we need to bootstrap it in that case
+        if min_bound is None:
+            min_bound = self._bootstrap_repositories_listing()
+        super().run(min_bound, max_bound)
 
 
 def get_repo_url(attachments):
diff --git a/swh/lister/phabricator/tasks.py b/swh/lister/phabricator/tasks.py
index 5a86dbe..83ecd67 100644
--- a/swh/lister/phabricator/tasks.py
+++ b/swh/lister/phabricator/tasks.py
@@ -14,19 +14,13 @@ def new_lister(
 @app.task(name=__name__ + '.IncrementalPhabricatorLister')
 def incremental_phabricator_lister(**lister_args):
     lister = new_lister(**lister_args)
-    last_index = lister.db_last_index()
-    if last_index is None:
-        last_index = lister.first_repository_index()
-        lister.max_index = last_index
-        lister.ingest_data(last_index)
-    lister.run(min_bound=last_index, max_bound=None)
+    lister.run(min_bound=lister.db_last_index())
 
 
 @app.task(name=__name__ + '.FullPhabricatorLister')
 def full_phabricator_lister(**lister_args):
     lister = new_lister(**lister_args)
-    first_index = lister.first_repository_index()
-    lister.run(min_bound=first_index, max_bound=None)
+    lister.run()
 
 
 @app.task(name=__name__ + '.ping')
diff --git a/swh/lister/phabricator/tests/test_tasks.py b/swh/lister/phabricator/tests/test_tasks.py
index e4eb906..160efcc 100644
--- a/swh/lister/phabricator/tests/test_tasks.py
+++ b/swh/lister/phabricator/tests/test_tasks.py
@@ -26,4 +26,4 @@ def test_incremental(lister, swh_app, celery_session_worker):
     lister.assert_called_once_with(
         api_token='', forge_url='https://forge.softwareheritage.org')
     lister.db_last_index.assert_called_once_with()
-    lister.run.assert_called_once_with(min_bound=42, max_bound=None)
+    lister.run.assert_called_once_with(min_bound=42)

Once you have integrated those changes, you should rebase your feature branch to origin/master then squash all your commits into a single one before updating that diff.
For the commit message, you could write:

swh.lister.phabricator: Add a lister of all hosted repositories on a Phabricator instance

Closes T808

The landing is coming !

swh/lister/phabricator/lister.py
19

This is more readable

if forge_url.endswith('/'):
70–73

No need to indent the doc here

This revision now requires changes to proceed.May 14 2019, 2:53 PM
nahimilega updated this revision to Diff 4805.May 14 2019, 8:40 PM

squash commits

nahimilega marked 3 inline comments as done.May 14 2019, 8:49 PM
nahimilega updated this revision to Diff 4806.May 14 2019, 8:49 PM

squash commits

@nahimilega, I found another issue when listing https://phabricator.wikimedia.org.

As we can not extract url for some repositories, the None value ends up in the models list
so it must be filtered before injection to remove None entries.

The diff below is sufficient to fix the issue:

diff --git a/swh/lister/phabricator/lister.py b/swh/lister/phabricator/lister.py
index b5ed609..6cfdf2c 100644
--- a/swh/lister/phabricator/lister.py
+++ b/swh/lister/phabricator/lister.py
@@ -54,6 +54,14 @@ class PhabricatorLister(SWHIndexingHttpLister):
         repos = repos['result']['data']
         return [self.get_model_from_repo(repo) for repo in repos]
 
+    def filter_before_inject(self, models_list):
+        """Overrides SWHIndexingLister.filter_before_inject
+                                                                                                                                                                                                                                                                              
+        Bounds query results by this Lister's set max_index.                                                                                                                                                                                                                  
+        """                                                                                                                                                                                                                                                                   
+        models_list = [m for m in models_list if m is not None]                                                                                                                                                                                                               
+        return super().filter_before_inject(models_list)                                                                                                                                                                                                                      
+                                                                                                                                                                                                                                                                              
     def _bootstrap_repositories_listing(self):                                                                                                                                                                                                                                
         """                                                                                                                                                                                                                                                                   
         Method called when no min_bound value has been provided
nahimilega updated this revision to Diff 4818.May 15 2019, 2:48 PM

Removed None from the final list

@nahimilega , can you address my last comments and I will accept that diff. Also, you have to rebase on origin/master as I found a bug in the base lister while testing your last changes.

README.md
177

You should write api_token='XXXX' in order to specify that parameter is mandatory

swh/lister/phabricator/lister.py
22–23

Just a nitpick but for a long literal string, the following syntax is recommended in Python:

api_endpoint = ('api/diffusion.repository.'
                'search?api.token=%s') % api_token
42

Could you add the following here:

def request_headers(self):
        """(Override) Set requests headers to send when querying the Phabricator API

        """
        return {'User-Agent': 'Software Heritage phabricator lister',
                      'Accept': 'application/json'}

This enables to specify that Software Heritage is the API consumer.

108–111

You should unindent this docstring for more consistency with the rest of the code

nahimilega updated this revision to Diff 4831.May 15 2019, 4:27 PM
nahimilega marked 4 inline comments as done.

made all the changes recommended

nahimilega added a comment.EditedMay 15 2019, 4:30 PM

@anlambert As you mentioned in your previous comment, to remove None from the list I have added the function filter_before_inject() in the lister as you recommended to do.
And I have rebased the branch on origin/master

anlambert accepted this revision.May 15 2019, 4:35 PM

Looks good to me ! Let's land this !

Please follow the Landing your changes onto master section in that wiki page.
To summarize, use git to merge you feature branch into the master one then push to origin/master (please do not use arc land as it messes with the commit message).

This revision is now accepted and ready to land.May 15 2019, 4:35 PM
This revision was automatically updated to reflect the committed changes.

Thanks, @anlambert, for your help and guidance. As it was my first lister, I would have never been able to complete it without your help. You review assisted me in making this lister more robust and also helping me understand the basics of Lister.
Once again, thanks for your patience and guidance.