Changeset View
Changeset View
Standalone View
Standalone View
docs/tutorial-2017.rst
- This file was copied from docs/tutorial.rst.
.. _lister-tutorial: | .. _lister-tutorial-2017: | ||||
Tutorial: list the content of your favorite forge in just a few steps | Tutorial: list the content of your favorite forge in just a few steps | ||||
===================================================================== | ===================================================================== | ||||
(the `original version | (the `original version | ||||
<https://www.softwareheritage.org/2017/03/24/list-the-content-of-your-favorite-forge-in-just-a-few-steps/>`_ | <https://www.softwareheritage.org/2017/03/24/list-the-content-of-your-favorite-forge-in-just-a-few-steps/>`_ | ||||
of this article appeared on the Software Heritage blog) | of this article appeared on the Software Heritage blog) | ||||
▲ Show 20 Lines • Show All 64 Lines • ▼ Show 20 Lines | |||||
By contrast, the new GitHub-specific code looks like this (34 lines of Python): | By contrast, the new GitHub-specific code looks like this (34 lines of Python): | ||||
.. figure:: images/new_github_lister.png | .. figure:: images/new_github_lister.png | ||||
And the new BitBucket-specific code is even shorter and looks like this (24 lines of Python): | And the new BitBucket-specific code is even shorter and looks like this (24 lines of Python): | ||||
.. figure:: images/new_bitbucket_lister.png | .. figure:: images/new_bitbucket_lister.png | ||||
And now this is common shared code in a few abstract base classes, with some new features and loads of docstring comments (in red): | And now this is common shared code in a few abstract base classes, with some new | ||||
features and loads of docstring comments (in red): | |||||
.. figure:: images/new_base.png | .. figure:: images/new_base.png | ||||
So how does the lister code work now, and **how might a contributing developer | So how does the lister code work now, and **how might a contributing developer | ||||
go about making a new one** | go about making a new one** | ||||
The first thing to know is that we now have a generic lister base class and ORM | The first thing to know is that we now have a generic lister base class and ORM | ||||
model. A subclass of the lister base should already be able to do almost | model. A subclass of the lister base should already be able to do almost | ||||
▲ Show 20 Lines • Show All 119 Lines • ▼ Show 20 Lines | And this is the entire source code for the GitHub repository lister:: | ||||
# License: GNU General Public License version 3 or later | # License: GNU General Public License version 3 or later | ||||
# See top-level LICENSE file for more information | # See top-level LICENSE file for more information | ||||
import time | import time | ||||
from swh.lister.core.indexing_lister import IndexingHttpLister | from swh.lister.core.indexing_lister import IndexingHttpLister | ||||
from swh.lister.github.models import GitHubModel | from swh.lister.github.models import GitHubModel | ||||
class GitHubLister(IndexingHttpLister): | class GitHubLister(IndexingHttpLister): | ||||
PATH_TEMPLATE = '/repositories?since=%d' | PATH_TEMPLATE = '/repositories?since=%d' | ||||
MODEL = GitHubModel | MODEL = GitHubModel | ||||
def get_model_from_repo(self, repo): | def get_model_from_repo(self, repo): | ||||
return {'uid': repo['id'], | return {'uid': repo['id'], | ||||
'indexable': repo['id'], | 'indexable': repo['id'], | ||||
'name': repo['name'], | 'name': repo['name'], | ||||
'full_name': repo['full_name'], | 'full_name': repo['full_name'], | ||||
'html_url': repo['html_url'], | 'html_url': repo['html_url'], | ||||
'origin_url': repo['html_url'], | 'origin_url': repo['html_url'], | ||||
'origin_type': 'git', | 'origin_type': 'git', | ||||
'description': repo['description']} | 'description': repo['description']} | ||||
def get_next_target_from_response(self, response): | def get_next_target_from_response(self, response): | ||||
if 'next' in response.links: | if 'next' in response.links: | ||||
next_url = response.links['next']['url'] | next_url = response.links['next']['url'] | ||||
return int(next_url.split('since=')[1]) | return int(next_url.split('since=')[1]) | ||||
else: | else: | ||||
return None | return None | ||||
def transport_response_simplified(self, response): | def transport_response_simplified(self, response): | ||||
repos = response.json() | repos = response.json() | ||||
return [self.get_model_from_repo(repo) for repo in repos] | return [self.get_model_from_repo(repo) for repo in repos] | ||||
def request_headers(self): | def request_headers(self): | ||||
return {'Accept': 'application/vnd.github.v3+json'} | return {'Accept': 'application/vnd.github.v3+json'} | ||||
def transport_quota_check(self, response): | def transport_quota_check(self, response): | ||||
remain = int(response.headers['X-RateLimit-Remaining']) | remain = int(response.headers['X-RateLimit-Remaining']) | ||||
if response.status_code == 403 and remain == 0: | if response.status_code == 403 and remain == 0: | ||||
reset_at = int(response.headers['X-RateLimit-Reset']) | reset_at = int(response.headers['X-RateLimit-Reset']) | ||||
delay = min(reset_at - time.time(), 3600) | delay = min(reset_at - time.time(), 3600) | ||||
return True, delay | return True, delay | ||||
else: | else: | ||||
return False, 0 | return False, 0 | ||||
We can see that there are some common elements: | We can see that there are some common elements: | ||||
* Both use the HTTP transport mixin (:class:`IndexingHttpLister | * Both use the HTTP transport mixin (:class:`IndexingHttpLister | ||||
<swh.lister.core.indexing_lister.IndexingHttpLister>`) just combines | <swh.lister.core.indexing_lister.IndexingHttpLister>`) just combines | ||||
:class:`ListerHttpTransport | :class:`ListerHttpTransport | ||||
<swh.lister.core.lister_transports.ListerHttpTransport>` and | <swh.lister.core.lister_transports.ListerHttpTransport>` and | ||||
:class:`IndexingLister | :class:`IndexingLister | ||||
Show All 34 Lines | |||||
a series of extremely reductionist pseudocode methods. This is not what the | a series of extremely reductionist pseudocode methods. This is not what the | ||||
code actually looks like (it's not even real code), but it does have the same | code actually looks like (it's not even real code), but it does have the same | ||||
basic flow. Bear with me while I try to lay out lister operation in a | basic flow. Bear with me while I try to lay out lister operation in a | ||||
quasi-linear way…:: | quasi-linear way…:: | ||||
# main task | # main task | ||||
ghl = GitHubLister(lister_name='github.com', | ghl = GitHubLister(lister_name='github.com', | ||||
api_baseurl='https://github.com') | api_baseurl='https://github.com') | ||||
ghl.run() | ghl.run() | ||||
⇓ (IndexingLister.run):: | ⇓ (IndexingLister.run):: | ||||
# IndexingLister.run | # IndexingLister.run | ||||
identifier = None | identifier = None | ||||
do | do | ||||
response, repos = ListerBase.ingest_data(identifier) | response, repos = ListerBase.ingest_data(identifier) | ||||
identifier = GitHubLister.get_next_target_from_response(response) | identifier = GitHubLister.get_next_target_from_response(response) | ||||
while(identifier) | while(identifier) | ||||
⇓ (ListerBase.ingest_data):: | ⇓ (ListerBase.ingest_data):: | ||||
# ListerBase.ingest_data | # ListerBase.ingest_data | ||||
response = ListerBase.safely_issue_request(identifier) | response = ListerBase.safely_issue_request(identifier) | ||||
repos = GitHubLister.transport_response_simplified(response) | repos = GitHubLister.transport_response_simplified(response) | ||||
injected = ListerBase.inject_repo_data_into_db(repos) | injected = ListerBase.inject_repo_data_into_db(repos) | ||||
return response, injected | return response, injected | ||||
⇓ (ListerBase.safely_issue_request):: | ⇓ (ListerBase.safely_issue_request):: | ||||
# ListerBase.safely_issue_request | # ListerBase.safely_issue_request | ||||
repeat: | repeat: | ||||
resp = ListerHttpTransport.transport_request(identifier) | resp = ListerHttpTransport.transport_request(identifier) | ||||
retry, delay = ListerHttpTransport.transport_quota_check(resp) | retry, delay = ListerHttpTransport.transport_quota_check(resp) | ||||
if retry: | if retry: | ||||
sleep(delay) | sleep(delay) | ||||
until((not retry) or too_many_retries) | until((not retry) or too_many_retries) | ||||
return resp | return resp | ||||
⇓ (ListerHttpTransport.transport_request):: | ⇓ (ListerHttpTransport.transport_request):: | ||||
# ListerHttpTransport.transport_request | # ListerHttpTransport.transport_request | ||||
path = ListerBase.api_baseurl | path = ListerBase.api_baseurl | ||||
+ ListerHttpTransport.PATH_TEMPLATE % identifier | + ListerHttpTransport.PATH_TEMPLATE % identifier | ||||
headers = ListerHttpTransport.request_headers() | headers = ListerHttpTransport.request_headers() | ||||
return http.get(path, headers) | return http.get(path, headers) | ||||
(Oh look, there's our ``PATH_TEMPLATE``) | (Oh look, there's our ``PATH_TEMPLATE``) | ||||
⇓ (ListerHttpTransport.request_headers):: | ⇓ (ListerHttpTransport.request_headers):: | ||||
# ListerHttpTransport.request_headers | # ListerHttpTransport.request_headers | ||||
Show All 15 Lines |