diff --git a/docs/images/new_base.png b/docs/images/new_base.png
new file mode 100644
index 0000000..2a2e3fc
Binary files /dev/null and b/docs/images/new_base.png differ
diff --git a/docs/images/new_bitbucket_lister.png b/docs/images/new_bitbucket_lister.png
new file mode 100644
index 0000000..7c491bb
Binary files /dev/null and b/docs/images/new_bitbucket_lister.png differ
diff --git a/docs/images/new_github_lister.png b/docs/images/new_github_lister.png
new file mode 100644
index 0000000..e5a7fba
Binary files /dev/null and b/docs/images/new_github_lister.png differ
diff --git a/docs/images/old_github_lister.png b/docs/images/old_github_lister.png
new file mode 100644
index 0000000..65398a0
Binary files /dev/null and b/docs/images/old_github_lister.png differ
diff --git a/docs/index.rst b/docs/index.rst
index 9a991e8..653b85e 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -1,17 +1,22 @@
.. _swh-lister:
-Software Heritage - Development Documentation
-=============================================
+Software Heritage listers
+=========================
.. toctree::
:maxdepth: 2
:caption: Contents:
+Overview
+--------
+
+* :ref:`lister-tutorial`
+
Indices and tables
-==================
+------------------
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
diff --git a/docs/tutorial.rst b/docs/tutorial.rst
new file mode 100644
index 0000000..904be35
--- /dev/null
+++ b/docs/tutorial.rst
@@ -0,0 +1,357 @@
+.. _lister-tutorial:
+
+Tutorial: list the content of your favorite forge in just a few steps
+=====================================================================
+
+(the `original version
+`_
+of this article appeared on the Software Heritage blog)
+
+Back in November 2016, Nicolas Dandrimont wrote about structural code changes
+`leading to a massive (+15 million!) upswing in the number of repositories
+archived by Software Heritage
+`_
+through a combination of automatic linkage between the listing and loading
+scheduler, new understanding of how to deal with extremely large repository
+hosts like `GitHub `_, and activating a new set of
+repositories that had previously been skipped over.
+
+In the post, Nicolas outlined the three major phases of work in Software
+Heritage's preservation process (listing, scheduling updates, loading) and
+highlighted that the ability to preserve the world's free software heritage
+depends on our ability to find and list the repositories.
+
+At the time, Software Heritage was only able to list projects on
+GitHub. Focusing early on GitHub, one of the largest and most active forge in
+the world, allowed for a big value-to-effort ratio and a rapid launch for the
+archive. As the old Italian proverb goes, "Il meglio è nemico del bene," or in
+modern English parlance, "Perfect is the enemy of good," right? Right. So the
+plan from the beginning was to implement a lister for GitHub, then maybe
+implement another one, and then take a few giant steps backward and squint our
+eyes.
+
+Why? Because source code hosting services don't behave according to a unified
+standard. Each new service requires dedicated development time to implement a
+new scraping client for the non-transferable requirements and intricacies of
+that service's API. At the time, doing it in an extensible and adaptable way
+required a level of exposure to the myriad differences between these services
+that we just didn't think we had yet.
+
+Nicolas' post closed by saying "We haven't carved out a stable API yet that
+allows you to just fill in the blanks, as we only have the GitHub lister
+currently, and a proven API will emerge organically only once we have some
+diversity."
+
+That has since changed. As of March 6, 2017, the Software Heritage **lister
+code has been aggressively restructured, abstracted, and commented** to make
+creating new listers significantly easier. There may yet be a few kinks to iron
+out, but **now making a new lister is practically like filling in the blanks**.
+
+Fundamentally, a basic lister must follow these steps:
+
+1. Issue a network request for a service endpoint.
+2. Convert the response into a canonical format.
+3. Populate a work queue for fetching and ingesting source repositories.
+
+Steps 1 and 3 are generic problems, so they can get generic solutions hidden
+away in base code, most of which never needs to change. That leaves us to
+implement step 2, which can be trivially done now for services with clean web
+APIs.
+
+In the new code we've tried to hide away as much generic functionality as
+possible, turning it into set-and-forget plumbing between a few simple
+customized elements. Different hosting services might use different network
+protocols, rate-limit messages, or pagination schemes, but, as long as there is
+some way to get a list of the hosted repositories, we think that the new base
+code will make getting those repositories much easier.
+
+First let me give you the 30,000 foot view…
+
+The old GitHub-specific lister code looked like this (265 lines of Python):
+
+.. figure:: images/old_github_lister.png
+
+By contrast, the new GitHub-specific code looks like this (34 lines of Python):
+
+.. figure:: images/new_github_lister.png
+
+And the new BitBucket-specific code is even shorter and looks like this (24 lines of Python):
+
+.. figure:: images/new_bitbucket_lister.png
+
+And now this is common shared code in a few abstract base classes, with some new features and loads of docstring comments (in red):
+
+.. figure:: images/new_base.png
+
+So how does the lister code work now, and **how might a contributing developer
+go about making a new one**
+
+The first thing to know is that we now have a generic lister base class and ORM
+model. A subclass of the lister base should already be able to do almost
+everything needed to complete a listing task for a single service
+request/response cycle with the following implementation requirements:
+
+1. A member variable must be declared called ``MODEL``, which is equal to a
+ subclass (Note: type, not instance) of the base ORM model. The reasons for
+ using a subclass is mostly just because different services use different
+ incompatible primary identifiers for their repositories. The model
+ subclasses are typically only one or two additional variable declarations.
+
+2. A method called ``transport_request`` must be implemented, which takes the
+ complete target identifier (e.g., a URL) and tries to request it one time
+ using whatever transport protocol is required for interacting with the
+ service. It should not attempt to retry on timeouts or do anything else with
+ the response (that is already done for you). It should just either return
+ the response or raise a ``FetchError`` exception.
+
+3. A method called ``transport_response_to_string`` must be implemented, which
+ takes the entire response of the request in (1) and converts it to a string
+ for logging purposes.
+
+4. A method called ``transport_quota_check`` must be implemented, which takes
+ the entire response of the request in (1) and checks to see if the process
+ has run afoul of any query quotas or rate limits. If the service says to
+ wait before making more requests, the method should return ``True`` and also
+ the number of seconds to wait, otherwise it returns ``False``.
+
+5. A method called ``transport_response_simplified`` must be implemented, which
+ also takes the entire response of the request in (1) and converts it to a
+ Python list of dicts (one dict for each repository) with keys given
+ according to the aforementioned ``MODEL`` class members.
+
+Because 1, 2, 3, and 4 are basically dependent only on the chosen network
+protocol, we also have an HTTP mix-in module, which supplements the lister base
+and provides default implementations for those methods along with optional
+request header injection using the Python Requests library. The
+``transport_quota_check`` method as provided follows the IETF standard for
+communicating rate limits with `HTTP code 429
+`_ which some hosting services
+have chosen not to follow, so it's possible that a specific lister will need to
+override it.
+
+On top of all of that, we also provide another layer over the base lister class
+which adds support for sequentially looping over indices. What are indices?
+Well, some services (`BitBucket `_ and GitHub for
+example) don't send you the entire list of all of their repositories at once,
+because that server response would be unwieldy. Instead they paginate their
+results, and they also allow you to query their APIs like this:
+``https://server_address.tld/query_type?start_listing_from_id=foo``. Changing
+the value of 'foo' lets you fetch a set of repositories starting from there. We
+call 'foo' an index, and we call a service that works this way an indexing
+service. GitHub uses the repository unique identifier and BitBucket uses the
+repository creation time, but a service can really use anything as long as the
+values monotonically increase with new repositories. A good indexing service
+also includes the URL of the next page with a later 'foo' in its responses. For
+these indexing services we provide another intermediate lister called the
+indexing lister. Instead of inheriting from :class:`SWHListerBase
+`, the lister class would inherit
+from :class:`SWHIndexingLister
+`. Along with the
+requirements of the lister base, the indexing lister base adds one extra
+requirement:
+
+1. A method called ``get_next_target_from_response`` must be defined, which
+ takes a complete request response and returns the index ('foo' above) of the
+ next page.
+
+So those are all the basic requirements. There are, of course, a few other
+little bits and pieces (covered for now in the code's docstring comments), but
+for the most part that's it. It sounds like a lot of information to absorb and
+implement, but remember that most of the implementation requirements mentioned
+above are already provided for 99% of services by the HTTP mix-in module. It
+looks much simpler when we look at the actual implementations of the two
+new-style indexing listers we currently have…
+
+This is the entire source code for the BitBucket repository lister::
+
+ # Copyright (C) 2017 the Software Heritage developers
+ # License: GNU General Public License version 3 or later
+ # See top-level LICENSE file for more information
+
+ from urllib import parse
+ from swh.lister.bitbucket.models import BitBucketModel
+ from swh.lister.core.indexing_lister import SWHIndexingHttpLister
+
+ class BitBucketLister(SWHIndexingHttpLister):
+ PATH_TEMPLATE = '/repositories?after=%s'
+ MODEL = BitBucketModel
+
+ def get_model_from_repo(self, repo):
+ return {'uid': repo['uuid'],
+ 'indexable': repo['created_on'],
+ 'name': repo['name'],
+ 'full_name': repo['full_name'],
+ 'html_url': repo['links']['html']['href'],
+ 'origin_url': repo['links']['clone'][0]['href'],
+ 'origin_type': repo['scm'],
+ 'description': repo['description']}
+
+ def get_next_target_from_response(self, response):
+ body = response.json()
+ if 'next' in body:
+ return parse.unquote(body['next'].split('after=')[1])
+ else:
+ return None
+
+ def transport_response_simplified(self, response):
+ repos = response.json()['values']
+ return [self.get_model_from_repo(repo) for repo in repos]
+
+And this is the entire source code for the GitHub repository lister::
+
+ # Copyright (C) 2017 the Software Heritage developers
+ # License: GNU General Public License version 3 or later
+ # See top-level LICENSE file for more information
+
+ import time
+ from swh.lister.core.indexing_lister import SWHIndexingHttpLister
+ from swh.lister.github.models import GitHubModel
+
+ class GitHubLister(SWHIndexingHttpLister):
+ PATH_TEMPLATE = '/repositories?since=%d'
+ MODEL = GitHubModel
+
+ def get_model_from_repo(self, repo):
+ return {'uid': repo['id'],
+ 'indexable': repo['id'],
+ 'name': repo['name'],
+ 'full_name': repo['full_name'],
+ 'html_url': repo['html_url'],
+ 'origin_url': repo['html_url'],
+ 'origin_type': 'git',
+ 'description': repo['description']}
+
+ def get_next_target_from_response(self, response):
+ if 'next' in response.links:
+ next_url = response.links['next']['url']
+ return int(next_url.split('since=')[1])
+ else:
+ return None
+
+ def transport_response_simplified(self, response):
+ repos = response.json()
+ return [self.get_model_from_repo(repo) for repo in repos]
+
+ def request_headers(self):
+ return {'Accept': 'application/vnd.github.v3+json'}
+
+ def transport_quota_check(self, response):
+ remain = int(response.headers['X-RateLimit-Remaining'])
+ if response.status_code == 403 and remain == 0:
+ reset_at = int(response.headers['X-RateLimit-Reset'])
+ delay = min(reset_at - time.time(), 3600)
+ return True, delay
+ else:
+ return False, 0
+
+We can see that there are some common elements:
+
+* Both use the HTTP transport mixin (:class:`SWHIndexingHttpLister
+ `) just combines
+ :class:`SWHListerHttpTransport
+ ` and
+ :class:`SWHIndexingLister
+ `) to get most of the
+ network request functionality for free.
+
+* Both also define ``MODEL`` and ``PATH_TEMPLATE`` variables. It should be
+ clear to developers that ``PATH_TEMPLATE``, when combined with the base
+ service URL (e.g., ``https://some_service.com``) and passed a value (the
+ 'foo' index described earlier) results in a complete identifier for making
+ API requests to these services. It is required by our HTTP module.
+
+* Both services respond using JSON, so both implementations of
+ ``transport_response_simplified`` are similar and quite short.
+
+We can also see that there are a few differences:
+
+* GitHub sends the next URL as part of the response header, while BitBucket
+ sends it in the response body.
+
+* GitHub differentiates API versions with a request header (our HTTP transport
+mix-in will automatically use any headers provided by an optional
+request_headers method that we implement here), while BitBucket has it as part
+of their base service URL. BitBucket uses the IETF standard HTTP 429 response
+code for their rate limit notifications (the HTTP transport mix-in
+automatically handles that), while GitHub uses their own custom response
+headers that need special treatment.
+
+* But look at them! 58 lines of Python code, combined, to absorb all
+ repositories from two of the largest and most influential source code hosting
+ services.
+
+Ok, so what is going on behind the scenes?
+
+To trace the operation of the code, let's start with a sample instantiation and
+progress from there to see which methods get called when. What follows will be
+a series of extremely reductionist pseudocode methods. This is not what the
+code actually looks like (it's not even real code), but it does have the same
+basic flow. Bear with me while I try to lay out lister operation in a
+quasi-linear way…::
+
+ # main task
+
+ ghl = GitHubLister(lister_name='github.com',
+ api_baseurl='https://github.com')
+ ghl.run()
+
+⇓ (SWHIndexingLister.run)::
+
+ # SWHIndexingLister.run
+
+ identifier = None
+ do
+ response, repos = SWHListerBase.ingest_data(identifier)
+ identifier = GitHubLister.get_next_target_from_response(response)
+ while(identifier)
+
+⇓ (SWHListerBase.ingest_data)::
+
+ # SWHListerBase.ingest_data
+
+ response = SWHListerBase.safely_issue_request(identifier)
+ repos = GitHubLister.transport_response_simplified(response)
+ injected = SWHListerBase.inject_repo_data_into_db(repos)
+ return response, injected
+
+⇓ (SWHListerBase.safely_issue_request)::
+
+ # SWHListerBase.safely_issue_request
+
+ repeat:
+ resp = SWHListerHttpTransport.transport_request(identifier)
+ retry, delay = SWHListerHttpTransport.transport_quota_check(resp)
+ if retry:
+ sleep(delay)
+ until((not retry) or too_many_retries)
+ return resp
+
+⇓ (SWHListerHttpTransport.transport_request)::
+
+ # SWHListerHttpTransport.transport_request
+
+ path = SWHListerBase.api_baseurl
+ + SWHListerHttpTransport.PATH_TEMPLATE % identifier
+ headers = SWHListerHttpTransport.request_headers()
+ return http.get(path, headers)
+
+(Oh look, there's our ``PATH_TEMPLATE``)
+
+⇓ (SWHListerHttpTransport.request_headers)::
+
+ # SWHListerHttpTransport.request_headers
+
+ override → GitHubLister.request_headers
+
+↑↑ (SWHListerBase.safely_issue_request)
+
+⇓ (SWHListerHttpTransport.transport_quota_check)::
+
+ # SWHListerHttpTransport.transport_quota_check
+
+ override → GitHubLister.transport_quota_check
+
+And then we're done. From start to finish, I hope this helps you understand how
+the few customized pieces fit into the new shared plumbing.
+
+Now you can go and write up a lister for a code hosting site we don't have yet!