This is the first draft of phabricator lister (T808)
This is the implementation of a lister which can be used to list the links of all the repositories present in a particular phabricator lister.
Details
- Reviewers
anlambert - Group Reviewers
Reviewers - Commits
- rDLSfedfd73c8e4b: swh.lister.phabricator
As I am new to this community and this is my first implementation of listers, I am facing trouble in writing tests. I need some help and guidance to do the same
Diff Detail
- Repository
- rDLS Listers
- Branch
- master
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 5554 Build 7558: tox-on-jenkins Jenkins Build 7557: arc lint + arc unit
Event Timeline
Still a couple of nitpicks to handle.
I also noticed that you should update the file swh/lister/cli.py in order to ease the database tables creation for the Phabricator lister
(that file starts to look really ugly by the way and should be refactored in a near future).
swh/lister/phabricator/lister.py | ||
---|---|---|
14 | Use forge_url instead of FORGEURL for code style consistency | |
23 | you can remove that line | |
57–58 | You should change that doc to: Return the index of the first repository hosted on the Phabricator instance | |
67–68 | Return url for a hosted repository from its uris attachments according to the following priority lists: * protocol: https > http * identifier: shortname > callsign > id |
Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/103/ for more details.
Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/104/ for more details.
This still needs some work as the lister is not robust enough yet. Testing it on https://developer.blender.org raise errors as some repositories correspond to subversion ones
and the repo url extraction process fails in that case, see sample repo data below.
{'attachments': {'uris': {'uris': [{'fields': {'builtin': {'identifier': None, 'protocol': None}, 'credentialPHID': None, 'dateCreated': '1467894515', 'dateModified': '1468574079', 'disabled': False, 'display': {'default': 'never', 'effective': 'always', 'raw': 'always'}, 'io': {'default': 'none', 'effective': 'observe', 'raw': 'observe'}, 'repositoryPHID': 'PHID-REPO-ge2icigfu5ijk2whqfbl', 'uri': {'display': 'https://svn.blender.org/svnroot/bf-blender/', 'effective': 'https://svn.blender.org/svnroot/bf-blender/', 'normalized': 'svn.blender.org/svnroot/bf-blender', 'raw': 'https://svn.blender.org/svnroot/bf-blender/'}}, 'id': '70', 'phid': 'PHID-RURI-h7zdbkud6why4xrb2s2e', 'type': 'RURI'}]}}, 'fields': {'almanacServicePHID': None, 'callsign': 'BL', 'dateCreated': 1385564674, 'dateModified': 1468574079, 'isImporting': False, 'name': 'Blender Libraries', 'policy': {'diffusion.push': 'PHID-PROJ-hclk7tvd6pmpjmqastjl', 'edit': 'admin', 'view': 'public'}, 'shortName': None, 'spacePHID': None, 'status': 'active', 'vcs': 'svn'}, 'id': 8, 'phid': 'PHID-REPO-ge2icigfu5ijk2whqfbl', 'type': 'REPO'}
One important thing, you should test you changes on real world examples. You can get a list of Phabricator instances used by open source projects on the Phabricator wikipedia page.
I encourage you to create accounts on multiple forges and test your lister implementation on them.
Let's continue that work next week.
README.md | ||
---|---|---|
211 ↗ | (On Diff #4778) | The config file should be named lister_phabricator.yml. My bad here I should have updated the README in rDLSdac7777cd6630a4b39eeb884b09badfcd732312d I will submit a diff to fix outdated stuffs in the README |
212–216 ↗ | (On Diff #4778) | That configuration is also totally outdated, it should rather be: storage: cls: 'remote' args: url: 'http://localhost:5002/' scheduler: cls: 'remote' args: url: 'http://localhost:5008/' lister: cls: 'local' args: db: 'postgresql:///lister-phabricator' I also do not think that we need to cache responses, considering the number of hosted repositories on a Phabricator instance I will update the other outdated listers configuration in the README in the diff I will submit. |
225 ↗ | (On Diff #4778) | would look better if the two instructions were on separate lines here |
swh/lister/cli.py | ||
12 ↗ | (On Diff #4778) | you need to add 'phabricator' to that list |
swh/lister/phabricator/lister.py | ||
13 | This constructor must be moved after the class variable (those in upper cases) definition | |
17 | we should check if forge_url contains a leading slash here and remove it if it is the case as requests to |
@anlambert I will test my lister on various phabricator instance and fix the bug. Also, I update the readme according to your recommendation. Thanks for your feedback.
Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/105/ for more details.
@anlambert As you recommended I tested the lister on multiple forges. Some of the repos where it failed to fetch URL are -
{ "id": 1501, "type": "REPO", "phid": "PHID-REPO-zw3uo22a5pngavq2wesi", "fields": { "name": "Subversion - Mysql", "vcs": "svn", "callsign": "SVNM", "shortName": null, "status": "inactive", "isImporting": false, "almanacServicePHID": null, "spacePHID": null, "dateCreated": 1434555075, "dateModified": 1513096478, "policy": { "view": "public", "edit": "PHID-PROJ-rzvwdtume4to5fnuh3rj", "diffusion.push": "PHID-PROJ-rzvwdtume4to5fnuh3rj" } }, "attachments": { "uris": { "uris": [ { "id": "12194", "type": "RURI", "phid": "PHID-RURI-osaarm4jrioboxoertd3", "fields": { "repositoryPHID": "PHID-REPO-zw3uo22a5pngavq2wesi", "uri": { "raw": "ssh://id", "display": "svn+ssh://vcs@git-ssh.wikimedia.org/diffusion/1501/", "effective": "svn+ssh://vcs@git-ssh.wikimedia.org/diffusion/1501/", "normalized": "git-ssh.wikimedia.org/diffusion/1501" }, "io": { "raw": "read", "default": "readwrite", "effective": "read" }, "display": { "raw": "default", "default": "never", "effective": "never" }, "credentialPHID": null, "disabled": false, "builtin": { "protocol": "ssh", "identifier": "id" }, "dateCreated": "1463620787", "dateModified": "1463620787" } }, { "id": "12192", "type": "RURI", "phid": "PHID-RURI-cowcdep3l7d73ustyava", "fields": { "repositoryPHID": "PHID-REPO-zw3uo22a5pngavq2wesi", "uri": { "raw": "ssh://callsign", "display": "svn+ssh://vcs@git-ssh.wikimedia.org/diffusion/SVNM/", "effective": "svn+ssh://vcs@git-ssh.wikimedia.org/diffusion/SVNM/", "normalized": "git-ssh.wikimedia.org/diffusion/SVNM" }, "io": { "raw": "read", "default": "readwrite", "effective": "read" }, "display": { "raw": "default", "default": "always", "effective": "always" }, "credentialPHID": null, "disabled": false, "builtin": { "protocol": "ssh", "identifier": "callsign" }, "dateCreated": "1463620787", "dateModified": "1463620787" } } ] } } }
As in this example, there is only ssh protocol available for this repos.
{'attachments': {'uris': {'uris': [{'fields': {'builtin': {'identifier': None, 'protocol': None}, 'credentialPHID': None, 'dateCreated': '1467894515', 'dateModified': '1468574079', 'disabled': False, 'display': {'default': 'never', 'effective': 'always', 'raw': 'always'}, 'io': {'default': 'none', 'effective': 'observe', 'raw': 'observe'}, 'repositoryPHID': 'PHID-REPO-ge2icigfu5ijk2whqfbl', 'uri': {'display': 'https://svn.blender.org/svnroot/bf-blender/', 'effective': 'https://svn.blender.org/svnroot/bf-blender/', 'normalized': 'svn.blender.org/svnroot/bf-blender', 'raw': 'https://svn.blender.org/svnroot/bf-blender/'}}, 'id': '70', 'phid': 'PHID-RURI-h7zdbkud6why4xrb2s2e', 'type': 'RURI'}]}}, 'fields': {'almanacServicePHID': None, 'callsign': 'BL', 'dateCreated': 1385564674, 'dateModified': 1468574079, 'isImporting': False, 'name': 'Blender Libraries', 'policy': {'diffusion.push': 'PHID-PROJ-hclk7tvd6pmpjmqastjl', 'edit': 'admin', 'view': 'public'}, 'shortName': None, 'spacePHID': None, 'status': 'active', 'vcs': 'svn'}, 'id': 8, 'phid': 'PHID-REPO-ge2icigfu5ijk2whqfbl', 'type': 'REPO'}
In this repo, although there is an https URL stated, but the protocol section of api response it is recorded as null, and when you open this url stated (https://svn.blender.org/svnroot/bf-blender/) it is different from what should ideally appear.
I don't know how to deal with these cases.
For now, I have made a change in lister so it will return None when no url is found hence it not even consider the presence of these repos whose url are not according to the requirement, therefore will not create some error in further execution of the process.
(ie changed line 30 lin lister.py from return {} to return None)
@nahimilega , when the protocol field is equal to None (like for some subversion repositories), you should test is the effective uri starts with https or http and returns it if it is the case.
For instance, executing $ svn checkout https://svn.blender.org/svnroot/bf-blender/ works so this means we should be able to load that software origin into the archive.
Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/106/ for more details.
Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/107/ for more details.
@nahimilega , I added some inline comments regarding code style and naming.
Could you add a new test (or update the test_get_repo_url one) for the subversion repository url extraction ?
Also, I think you should rename the test file swh/lister/phabricator/tests/test_pb_lister.py
to swh/lister/phabricator/tests/test_lister.py. Our naming convention for the test files
is test_<module_name>.py.
swh/lister/phabricator/lister.py | ||
---|---|---|
82–85 | I would rather write here: for protocol in ('https', 'http'): if url.startswith(protocol): processed_urls[protocol]['undefined'] = url break | |
87 | Replace None by undefined here |
Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/110/ for more details.
Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/112/ for more details.
@nahimilega, while doing more tests, I noticed that the first repository data did not end up in the lister database and no scheduler task was created for it in order to load it into the archive.
Also thinking it back, the lister bootstrapping when no min_bound value is provided to the run method should be handled in the lister implementation
and not in the tasks ones.
I ended up with the following implementation fixing those, I copied the git diff here for commodity:
diff --git a/swh/lister/phabricator/lister.py b/swh/lister/phabricator/lister.py index 8b481bf..8824ceb 100644 --- a/swh/lister/phabricator/lister.py +++ b/swh/lister/phabricator/lister.py @@ -54,14 +54,45 @@ class PhabricatorLister(SWHIndexingHttpLister): repos = repos['result']['data'] return [self.get_model_from_repo(repo) for repo in repos] - def first_repository_index(self): + def _bootstrap_repositories_listing(self): """ - Return the index of the first repository hosted - on the Phabricator instance + Method called when no min_bound value has been provided + to the lister. Its purpose is to: + + 1. get the first repository data hosted on the Phabricator + instance + + 2. inject them into the lister database + + 3. return the first repository index to start the listing + after that value + + Returns: + int: The first repository index """ params = '&order=oldest&limit=1' response = self.safely_issue_request(params) - return response.json()['result']['data'][0]['id'] + models_list = self.transport_response_simplified(response) + self.max_index = models_list[0]['indexable'] + models_list = self.filter_before_inject(models_list) + injected = self.inject_repo_data_into_db(models_list) + self.create_missing_origins_and_tasks(models_list, injected) + return self.max_index + + def run(self, min_bound=None, max_bound=None): + """ + (Override) Run the lister on the specified Phabricator instance + + Args: + min_bound (int): Optional repository index to start the listing + after it + max_bound (int): Optional repository index to stop the listing + after it + """ + # initial call to the lister, we need to bootstrap it in that case + if min_bound is None: + min_bound = self._bootstrap_repositories_listing() + super().run(min_bound, max_bound) def get_repo_url(attachments): diff --git a/swh/lister/phabricator/tasks.py b/swh/lister/phabricator/tasks.py index 5a86dbe..83ecd67 100644 --- a/swh/lister/phabricator/tasks.py +++ b/swh/lister/phabricator/tasks.py @@ -14,19 +14,13 @@ def new_lister( @app.task(name=__name__ + '.IncrementalPhabricatorLister') def incremental_phabricator_lister(**lister_args): lister = new_lister(**lister_args) - last_index = lister.db_last_index() - if last_index is None: - last_index = lister.first_repository_index() - lister.max_index = last_index - lister.ingest_data(last_index) - lister.run(min_bound=last_index, max_bound=None) + lister.run(min_bound=lister.db_last_index()) @app.task(name=__name__ + '.FullPhabricatorLister') def full_phabricator_lister(**lister_args): lister = new_lister(**lister_args) - first_index = lister.first_repository_index() - lister.run(min_bound=first_index, max_bound=None) + lister.run() @app.task(name=__name__ + '.ping') diff --git a/swh/lister/phabricator/tests/test_tasks.py b/swh/lister/phabricator/tests/test_tasks.py index e4eb906..160efcc 100644 --- a/swh/lister/phabricator/tests/test_tasks.py +++ b/swh/lister/phabricator/tests/test_tasks.py @@ -26,4 +26,4 @@ def test_incremental(lister, swh_app, celery_session_worker): lister.assert_called_once_with( api_token='', forge_url='https://forge.softwareheritage.org') lister.db_last_index.assert_called_once_with() - lister.run.assert_called_once_with(min_bound=42, max_bound=None) + lister.run.assert_called_once_with(min_bound=42)
Once you have integrated those changes, you should rebase your feature branch to origin/master then squash all your commits into a single one before updating that diff.
For the commit message, you could write:
swh.lister.phabricator: Add a lister of all hosted repositories on a Phabricator instance Closes T808
The landing is coming !
swh/lister/phabricator/lister.py | ||
---|---|---|
18 | This is more readable if forge_url.endswith('/'): | |
69–72 | No need to indent the doc here |
Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/113/ for more details.
Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/114/ for more details.
@nahimilega, I found another issue when listing https://phabricator.wikimedia.org.
As we can not extract url for some repositories, the None value ends up in the models list
so it must be filtered before injection to remove None entries.
The diff below is sufficient to fix the issue:
diff --git a/swh/lister/phabricator/lister.py b/swh/lister/phabricator/lister.py index b5ed609..6cfdf2c 100644 --- a/swh/lister/phabricator/lister.py +++ b/swh/lister/phabricator/lister.py @@ -54,6 +54,14 @@ class PhabricatorLister(SWHIndexingHttpLister): repos = repos['result']['data'] return [self.get_model_from_repo(repo) for repo in repos] + def filter_before_inject(self, models_list): + """Overrides SWHIndexingLister.filter_before_inject + + Bounds query results by this Lister's set max_index. + """ + models_list = [m for m in models_list if m is not None] + return super().filter_before_inject(models_list) + def _bootstrap_repositories_listing(self): """ Method called when no min_bound value has been provided
Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/115/ for more details.
@nahimilega , can you address my last comments and I will accept that diff. Also, you have to rebase on origin/master as I found a bug in the base lister while testing your last changes.
README.md | ||
---|---|---|
177 ↗ | (On Diff #4818) | You should write api_token='XXXX' in order to specify that parameter is mandatory |
swh/lister/phabricator/lister.py | ||
21–22 | Just a nitpick but for a long literal string, the following syntax is recommended in Python: api_endpoint = ('api/diffusion.repository.' 'search?api.token=%s') % api_token | |
41 | Could you add the following here: def request_headers(self): """(Override) Set requests headers to send when querying the Phabricator API """ return {'User-Agent': 'Software Heritage phabricator lister', 'Accept': 'application/json'} This enables to specify that Software Heritage is the API consumer. | |
107–110 | You should unindent this docstring for more consistency with the rest of the code |
Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/118/ for more details.
@anlambert As you mentioned in your previous comment, to remove None from the list I have added the function filter_before_inject() in the lister as you recommended to do.
And I have rebased the branch on origin/master
Looks good to me ! Let's land this !
Please follow the Landing your changes onto master section in that wiki page.
To summarize, use git to merge you feature branch into the master one then push to origin/master (please do not use arc land as it messes with the commit message).
Thanks, @anlambert, for your help and guidance. As it was my first lister, I would have never been able to complete it without your help. You review assisted me in making this lister more robust and also helping me understand the basics of Lister.
Once again, thanks for your patience and guidance.