diff --git a/docs/run_a_new_lister.rst b/docs/run_a_new_lister.rst
new file mode 100644
index 0000000..f2bbc8b
--- /dev/null
+++ b/docs/run_a_new_lister.rst
@@ -0,0 +1,90 @@
+
+:orphan:
+
+.. _run-lister-tutorial:
+
+Tutorial: run a lister within docker-dev in just a few steps
+=====================================================================
+
+It is a good practice to run your new lister in docker-dev. This provides an almost
+production-like environment. Testing the lister in docker dev prior to deployment
+reduces the chances of encountering errors when turning it for production.
+Here are the steps you need to follow to run a lister within your local environment.
+
+
+1. You must edit the docker-compose override file (`docker-compose.override.yml`).
+ following the sample provided ::
+
+ version: '2'
+
+ services:
+ swh-lister:
+ volumes:
+ - "$SWH_ENVIRONMENT_HOME/swh-lister:/src/swh-lister"
+
+ The file named `docker-compose.override.yml` will automatically be loaded by
+ ``docker-compose``.Having an override makes it possible to run a docker container
+ with some swh packages installed from sources instead of using the latest
+ published packages from pypi. For more details, you may refer to README.md
+ present in ``swh-docker-dev``.
+2. Follow the instruction mentioned under heading **Preparation steps** and
+ **Configuration file sample** in README.md of swh-lister.
+3. Add in the lister configuration the new ``task_modules`` and ``task_queues``
+ entry for the your new lister. You need to amend the conf/lister.yml file to
+ add the entries. Here is an example for GNU lister::
+
+ celery:
+ task_broker: amqp://guest:guest@amqp//
+ task_modules:
+ ...
+ - swh.lister.gnu.tasks
+ task_queues:
+ ...
+ - swh.lister.gnu.tasks.GNUListerTask
+
+4. Make sure to run ``storage (5002)`` and ``scheduler (5008)`` services locally.
+ You may use the following command to run docker::
+
+ ~/swh-environment/swh-docker-dev$ docker-compose up -d
+
+5. Add the lister task-type in the scheduler. For example, if you want to
+ add pypi lister task-type ::
+
+ ~/swh-environment$ swh scheduler task-type add list-gnu-full \
+ "swh.lister.gnu.tasks.GNUListerTask" "Full GNU lister" \
+ --default-interval '1 day' --backoff-factor 1
+
+ You can check all the task-type by::
+
+ ~/swh-environment$swh scheduler task-type list
+ Known task types:
+ list-bitbucket-incremental:
+ Incrementally list BitBucket
+ list-cran:
+ Full CRAN Lister
+ list-debian-distribution:
+ List a Debian distribution
+ list-github-full:
+ Full update of GitHub repos list
+ list-github-incremental:
+ ...
+
+ If your lister is creating new loading task not yet registered, you need
+ to register that task type as well.
+
+6. Run your lister with the help of scheduler cli. You need to add the task in
+ the scheduler using its cli. For example, you need to execute this command
+ to run gnu lister ::
+
+ ~/swh-environment$ swh scheduler --url http://localhost:5008/ task add \
+ list-gnu-full --policy oneshot
+
+After the execution of lister is complete, you can see the loading task created::
+
+ ~/swh-environment/swh-lister$ swh scheduler task list
+
+You can also check the repositories listed by the lister from the database in
+which the lister output is stored. To connect to the database::
+
+ ~/swh-environment/swh-docker-dev$ docker-compose exec swh-lister bash -c \
+ 'psql swh-listers'
diff --git a/docs/tutorial.rst b/docs/tutorial.rst
index 6c656f2..8d91e86 100644
--- a/docs/tutorial.rst
+++ b/docs/tutorial.rst
@@ -1,425 +1,367 @@
:orphan:
.. _lister-tutorial:
Tutorial: list the content of your favorite forge in just a few steps
=====================================================================
(the `original version
`_
of this article appeared on the Software Heritage blog)
Back in November 2016, Nicolas Dandrimont wrote about structural code changes
`leading to a massive (+15 million!) upswing in the number of repositories
archived by Software Heritage
`_
through a combination of automatic linkage between the listing and loading
scheduler, new understanding of how to deal with extremely large repository
hosts like `GitHub `_, and activating a new set of
repositories that had previously been skipped over.
In the post, Nicolas outlined the three major phases of work in Software
Heritage's preservation process (listing, scheduling updates, loading) and
highlighted that the ability to preserve the world's free software heritage
depends on our ability to find and list the repositories.
At the time, Software Heritage was only able to list projects on
GitHub. Focusing early on GitHub, one of the largest and most active forge in
the world, allowed for a big value-to-effort ratio and a rapid launch for the
archive. As the old Italian proverb goes, "Il meglio è nemico del bene," or in
modern English parlance, "Perfect is the enemy of good," right? Right. So the
plan from the beginning was to implement a lister for GitHub, then maybe
implement another one, and then take a few giant steps backward and squint our
eyes.
Why? Because source code hosting services don't behave according to a unified
standard. Each new service requires dedicated development time to implement a
new scraping client for the non-transferable requirements and intricacies of
that service's API. At the time, doing it in an extensible and adaptable way
required a level of exposure to the myriad differences between these services
that we just didn't think we had yet.
Nicolas' post closed by saying "We haven't carved out a stable API yet that
allows you to just fill in the blanks, as we only have the GitHub lister
currently, and a proven API will emerge organically only once we have some
diversity."
That has since changed. As of March 6, 2017, the Software Heritage **lister
code has been aggressively restructured, abstracted, and commented** to make
creating new listers significantly easier. There may yet be a few kinks to iron
out, but **now making a new lister is practically like filling in the blanks**.
Fundamentally, a basic lister must follow these steps:
1. Issue a network request for a service endpoint.
2. Convert the response into a canonical format.
3. Populate a work queue for fetching and ingesting source repositories.
Steps 1 and 3 are generic problems, so they can get generic solutions hidden
away in the base code, most of which never needs to change. That leaves us to
implement step 2, which can be trivially done now for services with a clean web
APIs.
In the new code, we've tried to hide away as much generic functionality as
possible, turning it into set-and-forget plumbing between a few simple
customized elements. Different hosting services might use different network
protocols, rate-limit messages, or pagination schemes, but, as long as there is
some way to get a list of the hosted repositories, we think that the new base
code will make getting those repositories much easier.
First, let me give you the 30,000 foot view…
The old GitHub-specific lister code looked like this (265 lines of Python):
.. figure:: images/old_github_lister.png
By contrast, the new GitHub-specific code looks like this (34 lines of Python):
.. figure:: images/new_github_lister.png
And the new BitBucket-specific code is even shorter and looks like this (24 lines of Python):
.. figure:: images/new_bitbucket_lister.png
And now this is common shared code in a few abstract base classes, with some new features and loads of docstring comments (in red):
.. figure:: images/new_base.png
So how does the lister code work now, and **how might a contributing developer
go about making a new one**
The first thing to know is that we now have a generic lister base class and ORM
model. A subclass of the lister base should already be able to do almost
everything needed to complete a listing task for a single service
request/response cycle with the following implementation requirements:
1. A member variable must be declared called ``MODEL``, which is equal to a
subclass (Note: type, not instance) of the base ORM model. The reasons for
using a subclass is mostly just because different services use different
incompatible primary identifiers for their repositories. The model
subclasses are typically only one or two additional variable declarations.
2. A method called ``transport_request`` must be implemented, which takes the
complete target identifier (e.g., a URL) and tries to request it one time
using whatever transport protocol is required for interacting with the
service. It should not attempt to retry on timeouts or do anything else with
the response (that is already done for you). It should just either return
the response or raise a ``FetchError`` exception.
3. A method called ``transport_response_to_string`` must be implemented, which
takes the entire response of the request in (1) and converts it to a string
for logging purposes.
4. A method called ``transport_quota_check`` must be implemented, which takes
the entire response of the request in (1) and checks to see if the process
has run afoul of any query quotas or rate limits. If the service says to
wait before making more requests, the method should return ``True`` and also
the number of seconds to wait, otherwise it returns ``False``.
5. A method called ``transport_response_simplified`` must be implemented, which
also takes the entire response of the request in (1) and converts it to a
Python list of dicts (one dict for each repository) with keys given
according to the aforementioned ``MODEL`` class members.
Because 1, 2, 3, and 4 are basically dependent only on the chosen network
protocol, we also have an HTTP mix-in module, which supplements the lister base
and provides default implementations for those methods along with optional
request header injection using the Python Requests library. The
``transport_quota_check`` method as provided follows the IETF standard for
communicating rate limits with `HTTP code 429
`_ which some hosting services
have chosen not to follow, so it's possible that a specific lister will need to
override it.
On top of all of that, we also provide another layer over the base lister class
which adds support for sequentially looping over indices. What are indices?
Well, some services (`BitBucket `_ and GitHub for
example) don't send you the entire list of all of their repositories at once,
because that server response would be unwieldy. Instead they paginate their
results, and they also allow you to query their APIs like this:
``https://server_address.tld/query_type?start_listing_from_id=foo``. Changing
the value of 'foo' lets you fetch a set of repositories starting from there. We
call 'foo' an index, and we call a service that works this way an indexing
service. GitHub uses the repository unique identifier and BitBucket uses the
repository creation time, but a service can really use anything as long as the
values monotonically increase with new repositories. A good indexing service
also includes the URL of the next page with a later 'foo' in its responses. For
these indexing services we provide another intermediate lister called the
indexing lister. Instead of inheriting from :class:`SWHListerBase
`, the lister class would inherit
from :class:`SWHIndexingLister
`. Along with the
requirements of the lister base, the indexing lister base adds one extra
requirement:
1. A method called ``get_next_target_from_response`` must be defined, which
takes a complete request response and returns the index ('foo' above) of the
next page.
So those are all the basic requirements. There are, of course, a few other
little bits and pieces (covered for now in the code's docstring comments), but
for the most part that's it. It sounds like a lot of information to absorb and
implement, but remember that most of the implementation requirements mentioned
above are already provided for 99% of services by the HTTP mix-in module. It
looks much simpler when we look at the actual implementations of the two
new-style indexing listers we currently have…
-An important aspect for making a new lister is its testing. To register the
-celery tasks of your new lister, you need to add your lister in the main
-conftest.py (swh/lister/core/tests/conftest.py)
-
-After testing, it is suggested to run your new lister in docker as it provides
-good, almost-production like test. Here are the steps you need to follow to run
-a new lister in docker.
-
-1. You must write a docker-compose override file (`docker-compose.override.yml`).
- An example is given in the `docker-compose.override.yml.example` file ::
-
- version: '2'
-
- services:
- swh-lister:
- volumes:
- - "$SWH_ENVIRONMENT_HOME/swh-lister:/src/swh-lister"
-
- The file named `docker-compose.override.yml` will automatically be loaded by
- `docker-compose`. For more details, you may refer to README.md present in
- swh-docker-dev.
-2. Follow the instruction mentioned under heading Preparation steps and
- Configuration file sample in README.md of swh-lister.
-3. Make sure to run storage (5002) and scheduler (5008) services locally.
- You can run them by the following command::
-
- ~/swh-environment/swh-docker-dev$ docker-compose up -d swh-scheduler-api \
- swh-storage
-4. Add the lister task-type in the scheduler. For example, if you want to
- add pypi lister task-type ::
-
- ~/swh-environment$swh-scheduler task-type add list-pypi recurring \
- "Full pypi lister"
-
- You can check all the task-type by::
-
- ~/swh-environment$swh scheduler task-type list
- Known task types:
- list-bitbucket-incremental:
- Incrementally list BitBucket
- list-cran:
- Full CRAN Lister
- list-debian-distribution:
- List a Debian distribution
- list-github-full:
- Full update of GitHub repos list
- list-github-incremental:
- ...
-
- If your lister is creating new loading task not yet registered, you need
- to register that task type as well. Like for GNU lister::
-
- ~/swh-environment$swh scheduler task-type add load-gnu-full recurring \
- "GNU Loader"
-
-5. Run your lister with the help of scheduler cli.You need to add the task in
- the schedular using its cli. For example you need to execute this command
- to run gnu lister ::
-
- ~/swh-environment$swh scheduler --url http://localhost:5008/ task add \
- list-gnu-full --policy oneshot
-
-After the execution of lister is complete you can see the loading task created.
- ~/swh-environment/swh-lister$swh scheduler task list
+When developing a new lister, it's important to test. For this, add the tests
+(check `swh/lister/*/tests/`) and register the celery tasks in the main
+conftest.py (`swh/lister/core/tests/conftest.py`).
+
+Another important step is to actually run it within the
+docker-dev (:ref:`run-lister-tutorial`).
This is the entire source code for the BitBucket repository lister::
# Copyright (C) 2017 the Software Heritage developers
# License: GNU General Public License version 3 or later
# See top-level LICENSE file for more information
from urllib import parse
from swh.lister.bitbucket.models import BitBucketModel
from swh.lister.core.indexing_lister import SWHIndexingHttpLister
class BitBucketLister(SWHIndexingHttpLister):
PATH_TEMPLATE = '/repositories?after=%s'
MODEL = BitBucketModel
def get_model_from_repo(self, repo):
return {'uid': repo['uuid'],
'indexable': repo['created_on'],
'name': repo['name'],
'full_name': repo['full_name'],
'html_url': repo['links']['html']['href'],
'origin_url': repo['links']['clone'][0]['href'],
'origin_type': repo['scm'],
'description': repo['description']}
def get_next_target_from_response(self, response):
body = response.json()
if 'next' in body:
return parse.unquote(body['next'].split('after=')[1])
else:
return None
def transport_response_simplified(self, response):
repos = response.json()['values']
return [self.get_model_from_repo(repo) for repo in repos]
And this is the entire source code for the GitHub repository lister::
# Copyright (C) 2017 the Software Heritage developers
# License: GNU General Public License version 3 or later
# See top-level LICENSE file for more information
import time
from swh.lister.core.indexing_lister import SWHIndexingHttpLister
from swh.lister.github.models import GitHubModel
class GitHubLister(SWHIndexingHttpLister):
PATH_TEMPLATE = '/repositories?since=%d'
MODEL = GitHubModel
def get_model_from_repo(self, repo):
return {'uid': repo['id'],
'indexable': repo['id'],
'name': repo['name'],
'full_name': repo['full_name'],
'html_url': repo['html_url'],
'origin_url': repo['html_url'],
'origin_type': 'git',
'description': repo['description']}
def get_next_target_from_response(self, response):
if 'next' in response.links:
next_url = response.links['next']['url']
return int(next_url.split('since=')[1])
else:
return None
def transport_response_simplified(self, response):
repos = response.json()
return [self.get_model_from_repo(repo) for repo in repos]
def request_headers(self):
return {'Accept': 'application/vnd.github.v3+json'}
def transport_quota_check(self, response):
remain = int(response.headers['X-RateLimit-Remaining'])
if response.status_code == 403 and remain == 0:
reset_at = int(response.headers['X-RateLimit-Reset'])
delay = min(reset_at - time.time(), 3600)
return True, delay
else:
return False, 0
We can see that there are some common elements:
* Both use the HTTP transport mixin (:class:`SWHIndexingHttpLister
`) just combines
:class:`SWHListerHttpTransport
` and
:class:`SWHIndexingLister
`) to get most of the
network request functionality for free.
* Both also define ``MODEL`` and ``PATH_TEMPLATE`` variables. It should be
clear to developers that ``PATH_TEMPLATE``, when combined with the base
service URL (e.g., ``https://some_service.com``) and passed a value (the
'foo' index described earlier) results in a complete identifier for making
API requests to these services. It is required by our HTTP module.
* Both services respond using JSON, so both implementations of
``transport_response_simplified`` are similar and quite short.
We can also see that there are a few differences:
* GitHub sends the next URL as part of the response header, while BitBucket
sends it in the response body.
* GitHub differentiates API versions with a request header (our HTTP
transport mix-in will automatically use any headers provided by an
optional request_headers method that we implement here), while
BitBucket has it as part of their base service URL. BitBucket uses
the IETF standard HTTP 429 response code for their rate limit
notifications (the HTTP transport mix-in automatically handles
that), while GitHub uses their own custom response headers that need
special treatment.
* But look at them! 58 lines of Python code, combined, to absorb all
repositories from two of the largest and most influential source code hosting
services.
Ok, so what is going on behind the scenes?
To trace the operation of the code, let's start with a sample instantiation and
progress from there to see which methods get called when. What follows will be
a series of extremely reductionist pseudocode methods. This is not what the
code actually looks like (it's not even real code), but it does have the same
basic flow. Bear with me while I try to lay out lister operation in a
quasi-linear way…::
# main task
ghl = GitHubLister(lister_name='github.com',
api_baseurl='https://github.com')
ghl.run()
⇓ (SWHIndexingLister.run)::
# SWHIndexingLister.run
identifier = None
do
response, repos = SWHListerBase.ingest_data(identifier)
identifier = GitHubLister.get_next_target_from_response(response)
while(identifier)
⇓ (SWHListerBase.ingest_data)::
# SWHListerBase.ingest_data
response = SWHListerBase.safely_issue_request(identifier)
repos = GitHubLister.transport_response_simplified(response)
injected = SWHListerBase.inject_repo_data_into_db(repos)
return response, injected
⇓ (SWHListerBase.safely_issue_request)::
# SWHListerBase.safely_issue_request
repeat:
resp = SWHListerHttpTransport.transport_request(identifier)
retry, delay = SWHListerHttpTransport.transport_quota_check(resp)
if retry:
sleep(delay)
until((not retry) or too_many_retries)
return resp
⇓ (SWHListerHttpTransport.transport_request)::
# SWHListerHttpTransport.transport_request
path = SWHListerBase.api_baseurl
+ SWHListerHttpTransport.PATH_TEMPLATE % identifier
headers = SWHListerHttpTransport.request_headers()
return http.get(path, headers)
(Oh look, there's our ``PATH_TEMPLATE``)
⇓ (SWHListerHttpTransport.request_headers)::
# SWHListerHttpTransport.request_headers
override → GitHubLister.request_headers
↑↑ (SWHListerBase.safely_issue_request)
⇓ (SWHListerHttpTransport.transport_quota_check)::
# SWHListerHttpTransport.transport_quota_check
override → GitHubLister.transport_quota_check
And then we're done. From start to finish, I hope this helps you understand how
the few customized pieces fit into the new shared plumbing.
Now you can go and write up a lister for a code hosting site we don't have yet!