Changeset View
Standalone View
docs/tutorial.rst
:orphan: | :orphan: | ||||
.. _lister-tutorial: | .. _lister-tutorial: | ||||
Tutorial: list the content of your favorite forge in just a few steps | Tutorial: list the content of your favorite forge in just a few steps | ||||
===================================================================== | ===================================================================== | ||||
(the `original version | (the `original version | ||||
<https://www.softwareheritage.org/2017/03/24/list-the-content-of-your-favorite-forge-in-just-a-few-steps/>`_ | <https://www.softwareheritage.org/2017/03/24/list-the-content-of-your-favorite-forge-in-just-a-few-steps/>`_ | ||||
of this article appeared on the Software Heritage blog) | of this article appeared on the Software Heritage blog) | ||||
Back in November 2016, Nicolas Dandrimont wrote about structural code changes | Back in November 2016, Nicolas Dandrimont wrote about structural code changes | ||||
`leading to a massive (+15 million!) upswing in the number of repositories | `leading to a massive (+15 million!) upswing in the number of repositories | ||||
archived by Software Heritage | archived by Software Heritage | ||||
<https://www.softwareheritage.org/2016/11/09/listing-47-million-repositories-refactoring-our-github-lister/>`_ | <https://www.softwareheritage.org/2016/11/09/listing-47-million-repositories-refactoring-our-github-lister/>`_ | ||||
through a combination of automatic linkage between the listing and loading | through a combination of automatic linkage between the listing and loading | ||||
scheduler, new understanding of how to deal with extremely large repository | scheduler, new understanding of how to deal with extremely large repository | ||||
ardumont: I'd keep the original which i found clearer. | |||||
hosts like `GitHub <https://github.com/>`_, and activating a new set of | hosts like `GitHub <https://github.com/>`_, and activating a new set of | ||||
repositories that had previously been skipped over. | repositories that had previously been skipped over. | ||||
In the post, Nicolas outlined the three major phases of work in Software | In the post, Nicolas outlined the three major phases of work in Software | ||||
Heritage's preservation process (listing, scheduling updates, loading) and | Heritage's preservation process (listing, scheduling updates, loading) and | ||||
highlighted that the ability to preserve the world's free software heritage | highlighted that the ability to preserve the world's free software heritage | ||||
depends on our ability to find and list the repositories. | depends on our ability to find and list the repositories. | ||||
At the time, Software Heritage was only able to list projects on | At the time, Software Heritage was only able to list projects on | ||||
GitHub. Focusing early on GitHub, one of the largest and most active forge in | GitHub. Focusing early on GitHub, one of the largest and most active forge in | ||||
the world, allowed for a big value-to-effort ratio and a rapid launch for the | the world, allowed for a big value-to-effort ratio and a rapid launch for the | ||||
archive. As the old Italian proverb goes, "Il meglio è nemico del bene," or in | archive. As the old Italian proverb goes, "Il meglio è nemico del bene," or in | ||||
modern English parlance, "Perfect is the enemy of good," right? Right. So the | modern English parlance, "Perfect is the enemy of good," right? Right. So the | ||||
plan from the beginning was to implement a lister for GitHub, then maybe | plan from the beginning was to implement a lister for GitHub, then maybe | ||||
implement another one, and then take a few giant steps backward and squint our | implement another one, and then take a few giant steps backward and squint our | ||||
Done Inline ActionsPlease keep backward here, it's correct. ardumont: Please keep `backward` here, it's correct. | |||||
eyes. | eyes. | ||||
Why? Because source code hosting services don't behave according to a unified | Why? Because source code hosting services don't behave according to a unified | ||||
standard. Each new service requires dedicated development time to implement a | standard. Each new service requires dedicated development time to implement a | ||||
new scraping client for the non-transferable requirements and intricacies of | new scraping client for the non-transferable requirements and intricacies of | ||||
that service's API. At the time, doing it in an extensible and adaptable way | that service's API. At the time, doing it in an extensible and adaptable way | ||||
required a level of exposure to the myriad differences between these services | required a level of exposure to the myriad differences between these services | ||||
that we just didn't think we had yet. | that we just didn't think we had yet. | ||||
▲ Show 20 Lines • Show All 110 Lines • ▼ Show 20 Lines | |||||
<swh.lister.core.indexing_lister.SWHIndexingLister>`. Along with the | <swh.lister.core.indexing_lister.SWHIndexingLister>`. Along with the | ||||
requirements of the lister base, the indexing lister base adds one extra | requirements of the lister base, the indexing lister base adds one extra | ||||
requirement: | requirement: | ||||
1. A method called ``get_next_target_from_response`` must be defined, which | 1. A method called ``get_next_target_from_response`` must be defined, which | ||||
takes a complete request response and returns the index ('foo' above) of the | takes a complete request response and returns the index ('foo' above) of the | ||||
next page. | next page. | ||||
Note: You also need add your lister in the main conftest.py (swh/lister/core/tests/conftest.py) | |||||
So those are all the basic requirements. There are, of course, a few other | So those are all the basic requirements. There are, of course, a few other | ||||
little bits and pieces (covered for now in the code's docstring comments), but | little bits and pieces (covered for now in the code's docstring comments), but | ||||
for the most part that's it. It sounds like a lot of information to absorb and | for the most part that's it. It sounds like a lot of information to absorb and | ||||
implement, but remember that most of the implementation requirements mentioned | implement, but remember that most of the implementation requirements mentioned | ||||
above are already provided for 99% of services by the HTTP mix-in module. It | above are already provided for 99% of services by the HTTP mix-in module. It | ||||
looks much simpler when we look at the actual implementations of the two | looks much simpler when we look at the actual implementations of the two | ||||
new-style indexing listers we currently have… | new-style indexing listers we currently have… | ||||
Done Inline ActionsThis has nothing to do with tox, but with pytest. Please read the documentations of these tools to have a better understanding of their respective roles and usages. douardda: This has nothing to do with tox, but with pytest. Please read the documentations of these tools… | |||||
Done Inline ActionsPlease keep the general formatting of text as before. ardumont: Please keep the general formatting of text as before.
It's probably wrapped around 79 or 80… | |||||
Done Inline ActionsWhen developing a new lister, it's important to test. For this, add the tests (check `swh/lister/*/tests/`) and register the celery tasks in the main conftest.py (`swh/lister....`). Another important step is to actually run it within the docker-dev (:ref:`run-lister-tutorial`). ardumont: ```
When developing a new lister, it's important to test.
For this, add the tests (check… | |||||
This is the entire source code for the BitBucket repository lister:: | This is the entire source code for the BitBucket repository lister:: | ||||
# Copyright (C) 2017 the Software Heritage developers | # Copyright (C) 2017 the Software Heritage developers | ||||
# License: GNU General Public License version 3 or later | # License: GNU General Public License version 3 or later | ||||
Done Inline ActionsYou say it already in the page, just propose it: After tests, it's suggested to run your new lister in docker-dev: `How to run a lister <Link_to_the_page>`. ardumont: You say it already in the page, just propose it:
```
After tests, it's suggested to run your… | |||||
# See top-level LICENSE file for more information | # See top-level LICENSE file for more information | ||||
from urllib import parse | from urllib import parse | ||||
from swh.lister.bitbucket.models import BitBucketModel | from swh.lister.bitbucket.models import BitBucketModel | ||||
Done Inline ActionsThat paragraph is redundant with the previous one which is much better (which also has sphinx ready link). ardumont: That paragraph is redundant with the previous one which is much better (which also has sphinx… | |||||
from swh.lister.core.indexing_lister import SWHIndexingHttpLister | from swh.lister.core.indexing_lister import SWHIndexingHttpLister | ||||
class BitBucketLister(SWHIndexingHttpLister): | class BitBucketLister(SWHIndexingHttpLister): | ||||
PATH_TEMPLATE = '/repositories?after=%s' | PATH_TEMPLATE = '/repositories?after=%s' | ||||
MODEL = BitBucketModel | MODEL = BitBucketModel | ||||
Done Inline Actionsswh-lister ardumont: `swh-lister` | |||||
def get_model_from_repo(self, repo): | def get_model_from_repo(self, repo): | ||||
return {'uid': repo['uuid'], | return {'uid': repo['uuid'], | ||||
'indexable': repo['created_on'], | 'indexable': repo['created_on'], | ||||
'name': repo['name'], | 'name': repo['name'], | ||||
'full_name': repo['full_name'], | 'full_name': repo['full_name'], | ||||
'html_url': repo['links']['html']['href'], | 'html_url': repo['links']['html']['href'], | ||||
'origin_url': repo['links']['clone'][0]['href'], | 'origin_url': repo['links']['clone'][0]['href'], | ||||
'origin_type': repo['scm'], | 'origin_type': repo['scm'], | ||||
'description': repo['description']} | 'description': repo['description']} | ||||
def get_next_target_from_response(self, response): | def get_next_target_from_response(self, response): | ||||
body = response.json() | body = response.json() | ||||
if 'next' in body: | if 'next' in body: | ||||
Done Inline ActionsI'm wondering if we want to enter into so much details. @douardda What do you think? Isn't docker-compose up enough? ardumont: I'm wondering if we want to enter into so much details.
@douardda What do you think? Isn't… | |||||
Done Inline ActionsHere I intentionally did this because docker-compose up will start all the containers with could be harsh on the pc. Running all the docker containers while working eats up all my RAM nahimilega: Here I intentionally did this because docker-compose up will start all the containers with… | |||||
Done Inline ActionsYes, but that's a detail of your machine (as annoying as it is). Also, by exposing the service names here, that pose a potential problem if we ever change it (docker-compose won't though). ardumont: Yes, but that's a detail of your machine (as annoying as it is).
There should be a global note… | |||||
return parse.unquote(body['next'].split('after=')[1]) | return parse.unquote(body['next'].split('after=')[1]) | ||||
else: | else: | ||||
return None | return None | ||||
def transport_response_simplified(self, response): | def transport_response_simplified(self, response): | ||||
repos = response.json()['values'] | repos = response.json()['values'] | ||||
return [self.get_model_from_repo(repo) for repo in repos] | return [self.get_model_from_repo(repo) for repo in repos] | ||||
Show All 13 Lines | And this is the entire source code for the GitHub repository lister:: | ||||
def get_model_from_repo(self, repo): | def get_model_from_repo(self, repo): | ||||
return {'uid': repo['id'], | return {'uid': repo['id'], | ||||
'indexable': repo['id'], | 'indexable': repo['id'], | ||||
'name': repo['name'], | 'name': repo['name'], | ||||
'full_name': repo['full_name'], | 'full_name': repo['full_name'], | ||||
'html_url': repo['html_url'], | 'html_url': repo['html_url'], | ||||
'origin_url': repo['html_url'], | 'origin_url': repo['html_url'], | ||||
'origin_type': 'git', | 'origin_type': 'git', | ||||
Done Inline ActionsTruncate the output. ardumont: Truncate the output.
Also, keep it formatted as the real cli do, it's nicer to read with… | |||||
'description': repo['description']} | 'description': repo['description']} | ||||
def get_next_target_from_response(self, response): | def get_next_target_from_response(self, response): | ||||
Done Inline Actionsis creating new loading task not yet registered, you need to register that task type as well. ardumont: `is creating new loading task not yet registered, you need to register that task type as well.` | |||||
if 'next' in response.links: | if 'next' in response.links: | ||||
next_url = response.links['next']['url'] | next_url = response.links['next']['url'] | ||||
return int(next_url.split('since=')[1]) | return int(next_url.split('since=')[1]) | ||||
else: | else: | ||||
return None | return None | ||||
def transport_response_simplified(self, response): | def transport_response_simplified(self, response): | ||||
repos = response.json() | repos = response.json() | ||||
return [self.get_model_from_repo(repo) for repo in repos] | return [self.get_model_from_repo(repo) for repo in repos] | ||||
def request_headers(self): | def request_headers(self): | ||||
return {'Accept': 'application/vnd.github.v3+json'} | return {'Accept': 'application/vnd.github.v3+json'} | ||||
Done Inline ActionsWell, yes and no. You need to schedule a task, for example with the gnu lister: swh scheduler --url http://localhost:5008/ task add list-gnu-full --policy oneshot ardumont: Well, yes and no.
We will try to avoid using python top-level, let's use the scheduler instead. | |||||
def transport_quota_check(self, response): | def transport_quota_check(self, response): | ||||
remain = int(response.headers['X-RateLimit-Remaining']) | remain = int(response.headers['X-RateLimit-Remaining']) | ||||
if response.status_code == 403 and remain == 0: | if response.status_code == 403 and remain == 0: | ||||
reset_at = int(response.headers['X-RateLimit-Reset']) | reset_at = int(response.headers['X-RateLimit-Reset']) | ||||
Done Inline ActionsIt's missing a new chapter that explicits we are changing subject here. Ok, i think i see now. Then add a new one to explicit how to run (and test) lister within the docker-dev environment. Let's discuss this in irc... ardumont: It's missing a new chapter that explicits we are changing subject here.
Ok, i think i see now. | |||||
delay = min(reset_at - time.time(), 3600) | delay = min(reset_at - time.time(), 3600) | ||||
return True, delay | return True, delay | ||||
else: | else: | ||||
return False, 0 | return False, 0 | ||||
We can see that there are some common elements: | We can see that there are some common elements: | ||||
* Both use the HTTP transport mixin (:class:`SWHIndexingHttpLister | * Both use the HTTP transport mixin (:class:`SWHIndexingHttpLister | ||||
▲ Show 20 Lines • Show All 109 Lines • Show Last 20 Lines |
I'd keep the original which i found clearer.