Changeset View
Standalone View
docs/tutorial.rst
:orphan: | :orphan: | ||||
.. _lister-tutorial: | .. _lister-tutorial: | ||||
Tutorial: list the content of your favorite forge in just a few steps | Tutorial: list the content of your favorite forge in just a few steps | ||||
===================================================================== | ===================================================================== | ||||
(the `original version | (the `original version | ||||
<https://www.softwareheritage.org/2017/03/24/list-the-content-of-your-favorite-forge-in-just-a-few-steps/>`_ | <https://www.softwareheritage.org/2017/03/24/list-the-content-of-your-favorite-forge-in-just-a-few-steps/>`_ | ||||
of this article appeared on the Software Heritage blog) | of this article appeared on the Software Heritage blog) | ||||
Back in November 2016, Nicolas Dandrimont wrote about structural code changes | Back in November 2016, Nicolas Dandrimont wrote about structural code changes | ||||
`leading to a massive (+15 million!) upswing in the number of repositories | `leading to a massive (+15 million!) upswing in the number of repositories | ||||
archived by Software Heritage | archived by Software Heritage | ||||
<https://www.softwareheritage.org/2016/11/09/listing-47-million-repositories-refactoring-our-github-lister/>`_ | <https://www.softwareheritage.org/2016/11/09/listing-47-million-repositories-refactoring-our-github-lister/>`_ | ||||
through a combination of automatic linkage between the listing and loading | through a combination of automatic linkage between the listing and loading | ||||
scheduler, new understanding of how to deal with extremely large repository | scheduler, new understanding of how to deal with extremely large repository | ||||
ardumont: I'd keep the original which i found clearer. | |||||
hosts like `GitHub <https://github.com/>`_, and activating a new set of | hosts like `GitHub <https://github.com/>`_, and activating a new set of | ||||
repositories that had previously been skipped over. | repositories that had previously been skipped over. | ||||
In the post, Nicolas outlined the three major phases of work in Software | In the post, Nicolas outlined the three major phases of work in Software | ||||
Heritage's preservation process (listing, scheduling updates, loading) and | Heritage's preservation process (listing, scheduling updates, loading) and | ||||
highlighted that the ability to preserve the world's free software heritage | highlighted that the ability to preserve the world's free software heritage | ||||
depends on our ability to find and list the repositories. | depends on our ability to find and list the repositories. | ||||
At the time, Software Heritage was only able to list projects on | At the time, Software Heritage was only able to list projects on | ||||
GitHub. Focusing early on GitHub, one of the largest and most active forge in | GitHub. Focusing early on GitHub, one of the largest and most active forge in | ||||
the world, allowed for a big value-to-effort ratio and a rapid launch for the | the world, allowed for a big value-to-effort ratio and a rapid launch for the | ||||
archive. As the old Italian proverb goes, "Il meglio è nemico del bene," or in | archive. As the old Italian proverb goes, "Il meglio è nemico del bene," or in | ||||
modern English parlance, "Perfect is the enemy of good," right? Right. So the | modern English parlance, "Perfect is the enemy of good," right? Right. So the | ||||
plan from the beginning was to implement a lister for GitHub, then maybe | plan from the beginning was to implement a lister for GitHub, then maybe | ||||
implement another one, and then take a few giant steps backward and squint our | implement another one, and then take a few giant steps backward and squint our | ||||
Done Inline ActionsPlease keep backward here, it's correct. ardumont: Please keep `backward` here, it's correct. | |||||
eyes. | eyes. | ||||
Why? Because source code hosting services don't behave according to a unified | Why? Because source code hosting services don't behave according to a unified | ||||
standard. Each new service requires dedicated development time to implement a | standard. Each new service requires dedicated development time to implement a | ||||
new scraping client for the non-transferable requirements and intricacies of | new scraping client for the non-transferable requirements and intricacies of | ||||
that service's API. At the time, doing it in an extensible and adaptable way | that service's API. At the time, doing it in an extensible and adaptable way | ||||
required a level of exposure to the myriad differences between these services | required a level of exposure to the myriad differences between these services | ||||
that we just didn't think we had yet. | that we just didn't think we had yet. | ||||
Show All 10 Lines | |||||
Fundamentally, a basic lister must follow these steps: | Fundamentally, a basic lister must follow these steps: | ||||
1. Issue a network request for a service endpoint. | 1. Issue a network request for a service endpoint. | ||||
2. Convert the response into a canonical format. | 2. Convert the response into a canonical format. | ||||
3. Populate a work queue for fetching and ingesting source repositories. | 3. Populate a work queue for fetching and ingesting source repositories. | ||||
Steps 1 and 3 are generic problems, so they can get generic solutions hidden | Steps 1 and 3 are generic problems, so they can get generic solutions hidden | ||||
away in base code, most of which never needs to change. That leaves us to | away in the base code, most of which never needs to change. That leaves us to | ||||
implement step 2, which can be trivially done now for services with clean web | implement step 2, which can be trivially done now for services with a clean web | ||||
APIs. | APIs. | ||||
In the new code we've tried to hide away as much generic functionality as | In the new code, we've tried to hide away as much generic functionality as | ||||
possible, turning it into set-and-forget plumbing between a few simple | possible, turning it into set-and-forget plumbing between a few simple | ||||
customized elements. Different hosting services might use different network | customized elements. Different hosting services might use different network | ||||
protocols, rate-limit messages, or pagination schemes, but, as long as there is | protocols, rate-limit messages, or pagination schemes, but, as long as there is | ||||
some way to get a list of the hosted repositories, we think that the new base | some way to get a list of the hosted repositories, we think that the new base | ||||
code will make getting those repositories much easier. | code will make getting those repositories much easier. | ||||
First let me give you the 30,000 foot view… | First, let me give you the 30,000 foot view… | ||||
The old GitHub-specific lister code looked like this (265 lines of Python): | The old GitHub-specific lister code looked like this (265 lines of Python): | ||||
.. figure:: images/old_github_lister.png | .. figure:: images/old_github_lister.png | ||||
By contrast, the new GitHub-specific code looks like this (34 lines of Python): | By contrast, the new GitHub-specific code looks like this (34 lines of Python): | ||||
.. figure:: images/new_github_lister.png | .. figure:: images/new_github_lister.png | ||||
▲ Show 20 Lines • Show All 79 Lines • ▼ Show 20 Lines | |||||
So those are all the basic requirements. There are, of course, a few other | So those are all the basic requirements. There are, of course, a few other | ||||
little bits and pieces (covered for now in the code's docstring comments), but | little bits and pieces (covered for now in the code's docstring comments), but | ||||
for the most part that's it. It sounds like a lot of information to absorb and | for the most part that's it. It sounds like a lot of information to absorb and | ||||
implement, but remember that most of the implementation requirements mentioned | implement, but remember that most of the implementation requirements mentioned | ||||
above are already provided for 99% of services by the HTTP mix-in module. It | above are already provided for 99% of services by the HTTP mix-in module. It | ||||
looks much simpler when we look at the actual implementations of the two | looks much simpler when we look at the actual implementations of the two | ||||
new-style indexing listers we currently have… | new-style indexing listers we currently have… | ||||
Done Inline ActionsThis has nothing to do with tox, but with pytest. Please read the documentations of these tools to have a better understanding of their respective roles and usages. douardda: This has nothing to do with tox, but with pytest. Please read the documentations of these tools… | |||||
Done Inline ActionsPlease keep the general formatting of text as before. ardumont: Please keep the general formatting of text as before.
It's probably wrapped around 79 or 80… | |||||
Done Inline ActionsWhen developing a new lister, it's important to test. For this, add the tests (check `swh/lister/*/tests/`) and register the celery tasks in the main conftest.py (`swh/lister....`). Another important step is to actually run it within the docker-dev (:ref:`run-lister-tutorial`). ardumont: ```
When developing a new lister, it's important to test.
For this, add the tests (check… | |||||
An important aspect for making a new lister is its testing. To register the | |||||
celery tasks of your new lister, you need to add your lister in the main | |||||
conftest.py (swh/lister/core/tests/conftest.py) | |||||
Done Inline ActionsYou say it already in the page, just propose it: After tests, it's suggested to run your new lister in docker-dev: `How to run a lister <Link_to_the_page>`. ardumont: You say it already in the page, just propose it:
```
After tests, it's suggested to run your… | |||||
After testing, it is suggested to run your new lister in docker as it provides | |||||
good, almost-production like test. Here are the steps you need to follow to run | |||||
a new lister in docker. | |||||
Done Inline ActionsThat paragraph is redundant with the previous one which is much better (which also has sphinx ready link). ardumont: That paragraph is redundant with the previous one which is much better (which also has sphinx… | |||||
1. You must write a docker-compose override file (`docker-compose.override.yml`). | |||||
An example is given in the `docker-compose.override.yml.example` file :: | |||||
version: '2' | |||||
services: | |||||
Done Inline Actionsswh-lister ardumont: `swh-lister` | |||||
swh-lister: | |||||
volumes: | |||||
- "$SWH_ENVIRONMENT_HOME/swh-lister:/src/swh-lister" | |||||
The file named `docker-compose.override.yml` will automatically be loaded by | |||||
`docker-compose`. For more details, you may refer to README.md present in | |||||
swh-docker-dev. | |||||
2. Follow the instruction mentioned under heading Preparation steps and | |||||
Configuration file sample in README.md of swh-lister. | |||||
3. Make sure to run storage (5002) and scheduler (5008) services locally. | |||||
You can run them by the following command:: | |||||
~/swh-environment/swh-docker-dev$ docker-compose up -d swh-scheduler-api \ | |||||
Done Inline ActionsI'm wondering if we want to enter into so much details. @douardda What do you think? Isn't docker-compose up enough? ardumont: I'm wondering if we want to enter into so much details.
@douardda What do you think? Isn't… | |||||
Done Inline ActionsHere I intentionally did this because docker-compose up will start all the containers with could be harsh on the pc. Running all the docker containers while working eats up all my RAM nahimilega: Here I intentionally did this because docker-compose up will start all the containers with… | |||||
Done Inline ActionsYes, but that's a detail of your machine (as annoying as it is). Also, by exposing the service names here, that pose a potential problem if we ever change it (docker-compose won't though). ardumont: Yes, but that's a detail of your machine (as annoying as it is).
There should be a global note… | |||||
swh-storage | |||||
4. Add the lister task-type in the scheduler. For example, if you want to | |||||
add pypi lister task-type :: | |||||
~/swh-environment$swh-scheduler task-type add list-pypi recurring \ | |||||
"Full pypi lister" | |||||
You can check all the task-type by:: | |||||
~/swh-environment$swh scheduler task-type list | |||||
Known task types: | |||||
list-bitbucket-incremental: | |||||
Incrementally list BitBucket | |||||
list-cran: | |||||
Full CRAN Lister | |||||
list-debian-distribution: | |||||
List a Debian distribution | |||||
list-github-full: | |||||
Full update of GitHub repos list | |||||
list-github-incremental: | |||||
... | |||||
If your lister is creating new loading task not yet registered, you need | |||||
to register that task type as well. Like for GNU lister:: | |||||
~/swh-environment$swh scheduler task-type add load-gnu-full recurring \ | |||||
"GNU Loader" | |||||
5. Run your lister with the help of scheduler cli.You need to add the task in | |||||
the schedular using its cli. For example you need to execute this command | |||||
Done Inline ActionsTruncate the output. ardumont: Truncate the output.
Also, keep it formatted as the real cli do, it's nicer to read with… | |||||
to run gnu lister :: | |||||
~/swh-environment$swh scheduler --url http://localhost:5008/ task add \ | |||||
Done Inline Actionsis creating new loading task not yet registered, you need to register that task type as well. ardumont: `is creating new loading task not yet registered, you need to register that task type as well.` | |||||
list-gnu-full --policy oneshot | |||||
After the execution of lister is complete you can see the loading task created. | |||||
~/swh-environment/swh-lister$swh scheduler task list | |||||
This is the entire source code for the BitBucket repository lister:: | This is the entire source code for the BitBucket repository lister:: | ||||
# Copyright (C) 2017 the Software Heritage developers | # Copyright (C) 2017 the Software Heritage developers | ||||
# License: GNU General Public License version 3 or later | # License: GNU General Public License version 3 or later | ||||
# See top-level LICENSE file for more information | # See top-level LICENSE file for more information | ||||
from urllib import parse | from urllib import parse | ||||
from swh.lister.bitbucket.models import BitBucketModel | from swh.lister.bitbucket.models import BitBucketModel | ||||
Done Inline ActionsWell, yes and no. You need to schedule a task, for example with the gnu lister: swh scheduler --url http://localhost:5008/ task add list-gnu-full --policy oneshot ardumont: Well, yes and no.
We will try to avoid using python top-level, let's use the scheduler instead. | |||||
from swh.lister.core.indexing_lister import SWHIndexingHttpLister | from swh.lister.core.indexing_lister import SWHIndexingHttpLister | ||||
class BitBucketLister(SWHIndexingHttpLister): | class BitBucketLister(SWHIndexingHttpLister): | ||||
PATH_TEMPLATE = '/repositories?after=%s' | PATH_TEMPLATE = '/repositories?after=%s' | ||||
Done Inline ActionsIt's missing a new chapter that explicits we are changing subject here. Ok, i think i see now. Then add a new one to explicit how to run (and test) lister within the docker-dev environment. Let's discuss this in irc... ardumont: It's missing a new chapter that explicits we are changing subject here.
Ok, i think i see now. | |||||
MODEL = BitBucketModel | MODEL = BitBucketModel | ||||
def get_model_from_repo(self, repo): | def get_model_from_repo(self, repo): | ||||
return {'uid': repo['uuid'], | return {'uid': repo['uuid'], | ||||
'indexable': repo['created_on'], | 'indexable': repo['created_on'], | ||||
'name': repo['name'], | 'name': repo['name'], | ||||
'full_name': repo['full_name'], | 'full_name': repo['full_name'], | ||||
'html_url': repo['links']['html']['href'], | 'html_url': repo['links']['html']['href'], | ||||
▲ Show 20 Lines • Show All 174 Lines • Show Last 20 Lines |
I'd keep the original which i found clearer.