diff --git a/docs/run_a_new_lister.rst b/docs/run_a_new_lister.rst new file mode 100644 index 0000000..f2bbc8b --- /dev/null +++ b/docs/run_a_new_lister.rst @@ -0,0 +1,90 @@ + +:orphan: + +.. _run-lister-tutorial: + +Tutorial: run a lister within docker-dev in just a few steps +===================================================================== + +It is a good practice to run your new lister in docker-dev. This provides an almost +production-like environment. Testing the lister in docker dev prior to deployment +reduces the chances of encountering errors when turning it for production. +Here are the steps you need to follow to run a lister within your local environment. + + +1. You must edit the docker-compose override file (`docker-compose.override.yml`). + following the sample provided :: + + version: '2' + + services: + swh-lister: + volumes: + - "$SWH_ENVIRONMENT_HOME/swh-lister:/src/swh-lister" + + The file named `docker-compose.override.yml` will automatically be loaded by + ``docker-compose``.Having an override makes it possible to run a docker container + with some swh packages installed from sources instead of using the latest + published packages from pypi. For more details, you may refer to README.md + present in ``swh-docker-dev``. +2. Follow the instruction mentioned under heading **Preparation steps** and + **Configuration file sample** in README.md of swh-lister. +3. Add in the lister configuration the new ``task_modules`` and ``task_queues`` + entry for the your new lister. You need to amend the conf/lister.yml file to + add the entries. Here is an example for GNU lister:: + + celery: + task_broker: amqp://guest:guest@amqp// + task_modules: + ... + - swh.lister.gnu.tasks + task_queues: + ... + - swh.lister.gnu.tasks.GNUListerTask + +4. Make sure to run ``storage (5002)`` and ``scheduler (5008)`` services locally. + You may use the following command to run docker:: + + ~/swh-environment/swh-docker-dev$ docker-compose up -d + +5. Add the lister task-type in the scheduler. For example, if you want to + add pypi lister task-type :: + + ~/swh-environment$ swh scheduler task-type add list-gnu-full \ + "swh.lister.gnu.tasks.GNUListerTask" "Full GNU lister" \ + --default-interval '1 day' --backoff-factor 1 + + You can check all the task-type by:: + + ~/swh-environment$swh scheduler task-type list + Known task types: + list-bitbucket-incremental: + Incrementally list BitBucket + list-cran: + Full CRAN Lister + list-debian-distribution: + List a Debian distribution + list-github-full: + Full update of GitHub repos list + list-github-incremental: + ... + + If your lister is creating new loading task not yet registered, you need + to register that task type as well. + +6. Run your lister with the help of scheduler cli. You need to add the task in + the scheduler using its cli. For example, you need to execute this command + to run gnu lister :: + + ~/swh-environment$ swh scheduler --url http://localhost:5008/ task add \ + list-gnu-full --policy oneshot + +After the execution of lister is complete, you can see the loading task created:: + + ~/swh-environment/swh-lister$ swh scheduler task list + +You can also check the repositories listed by the lister from the database in +which the lister output is stored. To connect to the database:: + + ~/swh-environment/swh-docker-dev$ docker-compose exec swh-lister bash -c \ + 'psql swh-listers' diff --git a/docs/tutorial.rst b/docs/tutorial.rst index 6c656f2..8d91e86 100644 --- a/docs/tutorial.rst +++ b/docs/tutorial.rst @@ -1,425 +1,367 @@ :orphan: .. _lister-tutorial: Tutorial: list the content of your favorite forge in just a few steps ===================================================================== (the `original version `_ of this article appeared on the Software Heritage blog) Back in November 2016, Nicolas Dandrimont wrote about structural code changes `leading to a massive (+15 million!) upswing in the number of repositories archived by Software Heritage `_ through a combination of automatic linkage between the listing and loading scheduler, new understanding of how to deal with extremely large repository hosts like `GitHub `_, and activating a new set of repositories that had previously been skipped over. In the post, Nicolas outlined the three major phases of work in Software Heritage's preservation process (listing, scheduling updates, loading) and highlighted that the ability to preserve the world's free software heritage depends on our ability to find and list the repositories. At the time, Software Heritage was only able to list projects on GitHub. Focusing early on GitHub, one of the largest and most active forge in the world, allowed for a big value-to-effort ratio and a rapid launch for the archive. As the old Italian proverb goes, "Il meglio è nemico del bene," or in modern English parlance, "Perfect is the enemy of good," right? Right. So the plan from the beginning was to implement a lister for GitHub, then maybe implement another one, and then take a few giant steps backward and squint our eyes. Why? Because source code hosting services don't behave according to a unified standard. Each new service requires dedicated development time to implement a new scraping client for the non-transferable requirements and intricacies of that service's API. At the time, doing it in an extensible and adaptable way required a level of exposure to the myriad differences between these services that we just didn't think we had yet. Nicolas' post closed by saying "We haven't carved out a stable API yet that allows you to just fill in the blanks, as we only have the GitHub lister currently, and a proven API will emerge organically only once we have some diversity." That has since changed. As of March 6, 2017, the Software Heritage **lister code has been aggressively restructured, abstracted, and commented** to make creating new listers significantly easier. There may yet be a few kinks to iron out, but **now making a new lister is practically like filling in the blanks**. Fundamentally, a basic lister must follow these steps: 1. Issue a network request for a service endpoint. 2. Convert the response into a canonical format. 3. Populate a work queue for fetching and ingesting source repositories. Steps 1 and 3 are generic problems, so they can get generic solutions hidden away in the base code, most of which never needs to change. That leaves us to implement step 2, which can be trivially done now for services with a clean web APIs. In the new code, we've tried to hide away as much generic functionality as possible, turning it into set-and-forget plumbing between a few simple customized elements. Different hosting services might use different network protocols, rate-limit messages, or pagination schemes, but, as long as there is some way to get a list of the hosted repositories, we think that the new base code will make getting those repositories much easier. First, let me give you the 30,000 foot view… The old GitHub-specific lister code looked like this (265 lines of Python): .. figure:: images/old_github_lister.png By contrast, the new GitHub-specific code looks like this (34 lines of Python): .. figure:: images/new_github_lister.png And the new BitBucket-specific code is even shorter and looks like this (24 lines of Python): .. figure:: images/new_bitbucket_lister.png And now this is common shared code in a few abstract base classes, with some new features and loads of docstring comments (in red): .. figure:: images/new_base.png So how does the lister code work now, and **how might a contributing developer go about making a new one** The first thing to know is that we now have a generic lister base class and ORM model. A subclass of the lister base should already be able to do almost everything needed to complete a listing task for a single service request/response cycle with the following implementation requirements: 1. A member variable must be declared called ``MODEL``, which is equal to a subclass (Note: type, not instance) of the base ORM model. The reasons for using a subclass is mostly just because different services use different incompatible primary identifiers for their repositories. The model subclasses are typically only one or two additional variable declarations. 2. A method called ``transport_request`` must be implemented, which takes the complete target identifier (e.g., a URL) and tries to request it one time using whatever transport protocol is required for interacting with the service. It should not attempt to retry on timeouts or do anything else with the response (that is already done for you). It should just either return the response or raise a ``FetchError`` exception. 3. A method called ``transport_response_to_string`` must be implemented, which takes the entire response of the request in (1) and converts it to a string for logging purposes. 4. A method called ``transport_quota_check`` must be implemented, which takes the entire response of the request in (1) and checks to see if the process has run afoul of any query quotas or rate limits. If the service says to wait before making more requests, the method should return ``True`` and also the number of seconds to wait, otherwise it returns ``False``. 5. A method called ``transport_response_simplified`` must be implemented, which also takes the entire response of the request in (1) and converts it to a Python list of dicts (one dict for each repository) with keys given according to the aforementioned ``MODEL`` class members. Because 1, 2, 3, and 4 are basically dependent only on the chosen network protocol, we also have an HTTP mix-in module, which supplements the lister base and provides default implementations for those methods along with optional request header injection using the Python Requests library. The ``transport_quota_check`` method as provided follows the IETF standard for communicating rate limits with `HTTP code 429 `_ which some hosting services have chosen not to follow, so it's possible that a specific lister will need to override it. On top of all of that, we also provide another layer over the base lister class which adds support for sequentially looping over indices. What are indices? Well, some services (`BitBucket `_ and GitHub for example) don't send you the entire list of all of their repositories at once, because that server response would be unwieldy. Instead they paginate their results, and they also allow you to query their APIs like this: ``https://server_address.tld/query_type?start_listing_from_id=foo``. Changing the value of 'foo' lets you fetch a set of repositories starting from there. We call 'foo' an index, and we call a service that works this way an indexing service. GitHub uses the repository unique identifier and BitBucket uses the repository creation time, but a service can really use anything as long as the values monotonically increase with new repositories. A good indexing service also includes the URL of the next page with a later 'foo' in its responses. For these indexing services we provide another intermediate lister called the indexing lister. Instead of inheriting from :class:`SWHListerBase `, the lister class would inherit from :class:`SWHIndexingLister `. Along with the requirements of the lister base, the indexing lister base adds one extra requirement: 1. A method called ``get_next_target_from_response`` must be defined, which takes a complete request response and returns the index ('foo' above) of the next page. So those are all the basic requirements. There are, of course, a few other little bits and pieces (covered for now in the code's docstring comments), but for the most part that's it. It sounds like a lot of information to absorb and implement, but remember that most of the implementation requirements mentioned above are already provided for 99% of services by the HTTP mix-in module. It looks much simpler when we look at the actual implementations of the two new-style indexing listers we currently have… -An important aspect for making a new lister is its testing. To register the -celery tasks of your new lister, you need to add your lister in the main -conftest.py (swh/lister/core/tests/conftest.py) - -After testing, it is suggested to run your new lister in docker as it provides -good, almost-production like test. Here are the steps you need to follow to run -a new lister in docker. - -1. You must write a docker-compose override file (`docker-compose.override.yml`). - An example is given in the `docker-compose.override.yml.example` file :: - - version: '2' - - services: - swh-lister: - volumes: - - "$SWH_ENVIRONMENT_HOME/swh-lister:/src/swh-lister" - - The file named `docker-compose.override.yml` will automatically be loaded by - `docker-compose`. For more details, you may refer to README.md present in - swh-docker-dev. -2. Follow the instruction mentioned under heading Preparation steps and - Configuration file sample in README.md of swh-lister. -3. Make sure to run storage (5002) and scheduler (5008) services locally. - You can run them by the following command:: - - ~/swh-environment/swh-docker-dev$ docker-compose up -d swh-scheduler-api \ - swh-storage -4. Add the lister task-type in the scheduler. For example, if you want to - add pypi lister task-type :: - - ~/swh-environment$swh-scheduler task-type add list-pypi recurring \ - "Full pypi lister" - - You can check all the task-type by:: - - ~/swh-environment$swh scheduler task-type list - Known task types: - list-bitbucket-incremental: - Incrementally list BitBucket - list-cran: - Full CRAN Lister - list-debian-distribution: - List a Debian distribution - list-github-full: - Full update of GitHub repos list - list-github-incremental: - ... - - If your lister is creating new loading task not yet registered, you need - to register that task type as well. Like for GNU lister:: - - ~/swh-environment$swh scheduler task-type add load-gnu-full recurring \ - "GNU Loader" - -5. Run your lister with the help of scheduler cli.You need to add the task in - the schedular using its cli. For example you need to execute this command - to run gnu lister :: - - ~/swh-environment$swh scheduler --url http://localhost:5008/ task add \ - list-gnu-full --policy oneshot - -After the execution of lister is complete you can see the loading task created. - ~/swh-environment/swh-lister$swh scheduler task list +When developing a new lister, it's important to test. For this, add the tests +(check `swh/lister/*/tests/`) and register the celery tasks in the main +conftest.py (`swh/lister/core/tests/conftest.py`). + +Another important step is to actually run it within the +docker-dev (:ref:`run-lister-tutorial`). This is the entire source code for the BitBucket repository lister:: # Copyright (C) 2017 the Software Heritage developers # License: GNU General Public License version 3 or later # See top-level LICENSE file for more information from urllib import parse from swh.lister.bitbucket.models import BitBucketModel from swh.lister.core.indexing_lister import SWHIndexingHttpLister class BitBucketLister(SWHIndexingHttpLister): PATH_TEMPLATE = '/repositories?after=%s' MODEL = BitBucketModel def get_model_from_repo(self, repo): return {'uid': repo['uuid'], 'indexable': repo['created_on'], 'name': repo['name'], 'full_name': repo['full_name'], 'html_url': repo['links']['html']['href'], 'origin_url': repo['links']['clone'][0]['href'], 'origin_type': repo['scm'], 'description': repo['description']} def get_next_target_from_response(self, response): body = response.json() if 'next' in body: return parse.unquote(body['next'].split('after=')[1]) else: return None def transport_response_simplified(self, response): repos = response.json()['values'] return [self.get_model_from_repo(repo) for repo in repos] And this is the entire source code for the GitHub repository lister:: # Copyright (C) 2017 the Software Heritage developers # License: GNU General Public License version 3 or later # See top-level LICENSE file for more information import time from swh.lister.core.indexing_lister import SWHIndexingHttpLister from swh.lister.github.models import GitHubModel class GitHubLister(SWHIndexingHttpLister): PATH_TEMPLATE = '/repositories?since=%d' MODEL = GitHubModel def get_model_from_repo(self, repo): return {'uid': repo['id'], 'indexable': repo['id'], 'name': repo['name'], 'full_name': repo['full_name'], 'html_url': repo['html_url'], 'origin_url': repo['html_url'], 'origin_type': 'git', 'description': repo['description']} def get_next_target_from_response(self, response): if 'next' in response.links: next_url = response.links['next']['url'] return int(next_url.split('since=')[1]) else: return None def transport_response_simplified(self, response): repos = response.json() return [self.get_model_from_repo(repo) for repo in repos] def request_headers(self): return {'Accept': 'application/vnd.github.v3+json'} def transport_quota_check(self, response): remain = int(response.headers['X-RateLimit-Remaining']) if response.status_code == 403 and remain == 0: reset_at = int(response.headers['X-RateLimit-Reset']) delay = min(reset_at - time.time(), 3600) return True, delay else: return False, 0 We can see that there are some common elements: * Both use the HTTP transport mixin (:class:`SWHIndexingHttpLister `) just combines :class:`SWHListerHttpTransport ` and :class:`SWHIndexingLister `) to get most of the network request functionality for free. * Both also define ``MODEL`` and ``PATH_TEMPLATE`` variables. It should be clear to developers that ``PATH_TEMPLATE``, when combined with the base service URL (e.g., ``https://some_service.com``) and passed a value (the 'foo' index described earlier) results in a complete identifier for making API requests to these services. It is required by our HTTP module. * Both services respond using JSON, so both implementations of ``transport_response_simplified`` are similar and quite short. We can also see that there are a few differences: * GitHub sends the next URL as part of the response header, while BitBucket sends it in the response body. * GitHub differentiates API versions with a request header (our HTTP transport mix-in will automatically use any headers provided by an optional request_headers method that we implement here), while BitBucket has it as part of their base service URL. BitBucket uses the IETF standard HTTP 429 response code for their rate limit notifications (the HTTP transport mix-in automatically handles that), while GitHub uses their own custom response headers that need special treatment. * But look at them! 58 lines of Python code, combined, to absorb all repositories from two of the largest and most influential source code hosting services. Ok, so what is going on behind the scenes? To trace the operation of the code, let's start with a sample instantiation and progress from there to see which methods get called when. What follows will be a series of extremely reductionist pseudocode methods. This is not what the code actually looks like (it's not even real code), but it does have the same basic flow. Bear with me while I try to lay out lister operation in a quasi-linear way…:: # main task ghl = GitHubLister(lister_name='github.com', api_baseurl='https://github.com') ghl.run() ⇓ (SWHIndexingLister.run):: # SWHIndexingLister.run identifier = None do response, repos = SWHListerBase.ingest_data(identifier) identifier = GitHubLister.get_next_target_from_response(response) while(identifier) ⇓ (SWHListerBase.ingest_data):: # SWHListerBase.ingest_data response = SWHListerBase.safely_issue_request(identifier) repos = GitHubLister.transport_response_simplified(response) injected = SWHListerBase.inject_repo_data_into_db(repos) return response, injected ⇓ (SWHListerBase.safely_issue_request):: # SWHListerBase.safely_issue_request repeat: resp = SWHListerHttpTransport.transport_request(identifier) retry, delay = SWHListerHttpTransport.transport_quota_check(resp) if retry: sleep(delay) until((not retry) or too_many_retries) return resp ⇓ (SWHListerHttpTransport.transport_request):: # SWHListerHttpTransport.transport_request path = SWHListerBase.api_baseurl + SWHListerHttpTransport.PATH_TEMPLATE % identifier headers = SWHListerHttpTransport.request_headers() return http.get(path, headers) (Oh look, there's our ``PATH_TEMPLATE``) ⇓ (SWHListerHttpTransport.request_headers):: # SWHListerHttpTransport.request_headers override → GitHubLister.request_headers ↑↑ (SWHListerBase.safely_issue_request) ⇓ (SWHListerHttpTransport.transport_quota_check):: # SWHListerHttpTransport.transport_quota_check override → GitHubLister.transport_quota_check And then we're done. From start to finish, I hope this helps you understand how the few customized pieces fit into the new shared plumbing. Now you can go and write up a lister for a code hosting site we don't have yet!