diff --git a/docs/images/new_base.png b/docs/images/new_base.png new file mode 100644 index 0000000..2a2e3fc Binary files /dev/null and b/docs/images/new_base.png differ diff --git a/docs/images/new_bitbucket_lister.png b/docs/images/new_bitbucket_lister.png new file mode 100644 index 0000000..7c491bb Binary files /dev/null and b/docs/images/new_bitbucket_lister.png differ diff --git a/docs/images/new_github_lister.png b/docs/images/new_github_lister.png new file mode 100644 index 0000000..e5a7fba Binary files /dev/null and b/docs/images/new_github_lister.png differ diff --git a/docs/images/old_github_lister.png b/docs/images/old_github_lister.png new file mode 100644 index 0000000..65398a0 Binary files /dev/null and b/docs/images/old_github_lister.png differ diff --git a/docs/index.rst b/docs/index.rst index 9a991e8..653b85e 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,17 +1,22 @@ .. _swh-lister: -Software Heritage - Development Documentation -============================================= +Software Heritage listers +========================= .. toctree:: :maxdepth: 2 :caption: Contents: +Overview +-------- + +* :ref:`lister-tutorial` + Indices and tables -================== +------------------ * :ref:`genindex` * :ref:`modindex` * :ref:`search` diff --git a/docs/tutorial.rst b/docs/tutorial.rst new file mode 100644 index 0000000..904be35 --- /dev/null +++ b/docs/tutorial.rst @@ -0,0 +1,357 @@ +.. _lister-tutorial: + +Tutorial: list the content of your favorite forge in just a few steps +===================================================================== + +(the `original version +`_ +of this article appeared on the Software Heritage blog) + +Back in November 2016, Nicolas Dandrimont wrote about structural code changes +`leading to a massive (+15 million!) upswing in the number of repositories +archived by Software Heritage +`_ +through a combination of automatic linkage between the listing and loading +scheduler, new understanding of how to deal with extremely large repository +hosts like `GitHub `_, and activating a new set of +repositories that had previously been skipped over. + +In the post, Nicolas outlined the three major phases of work in Software +Heritage's preservation process (listing, scheduling updates, loading) and +highlighted that the ability to preserve the world's free software heritage +depends on our ability to find and list the repositories. + +At the time, Software Heritage was only able to list projects on +GitHub. Focusing early on GitHub, one of the largest and most active forge in +the world, allowed for a big value-to-effort ratio and a rapid launch for the +archive. As the old Italian proverb goes, "Il meglio è nemico del bene," or in +modern English parlance, "Perfect is the enemy of good," right? Right. So the +plan from the beginning was to implement a lister for GitHub, then maybe +implement another one, and then take a few giant steps backward and squint our +eyes. + +Why? Because source code hosting services don't behave according to a unified +standard. Each new service requires dedicated development time to implement a +new scraping client for the non-transferable requirements and intricacies of +that service's API. At the time, doing it in an extensible and adaptable way +required a level of exposure to the myriad differences between these services +that we just didn't think we had yet. + +Nicolas' post closed by saying "We haven't carved out a stable API yet that +allows you to just fill in the blanks, as we only have the GitHub lister +currently, and a proven API will emerge organically only once we have some +diversity." + +That has since changed. As of March 6, 2017, the Software Heritage **lister +code has been aggressively restructured, abstracted, and commented** to make +creating new listers significantly easier. There may yet be a few kinks to iron +out, but **now making a new lister is practically like filling in the blanks**. + +Fundamentally, a basic lister must follow these steps: + +1. Issue a network request for a service endpoint. +2. Convert the response into a canonical format. +3. Populate a work queue for fetching and ingesting source repositories. + +Steps 1 and 3 are generic problems, so they can get generic solutions hidden +away in base code, most of which never needs to change. That leaves us to +implement step 2, which can be trivially done now for services with clean web +APIs. + +In the new code we've tried to hide away as much generic functionality as +possible, turning it into set-and-forget plumbing between a few simple +customized elements. Different hosting services might use different network +protocols, rate-limit messages, or pagination schemes, but, as long as there is +some way to get a list of the hosted repositories, we think that the new base +code will make getting those repositories much easier. + +First let me give you the 30,000 foot view… + +The old GitHub-specific lister code looked like this (265 lines of Python): + +.. figure:: images/old_github_lister.png + +By contrast, the new GitHub-specific code looks like this (34 lines of Python): + +.. figure:: images/new_github_lister.png + +And the new BitBucket-specific code is even shorter and looks like this (24 lines of Python): + +.. figure:: images/new_bitbucket_lister.png + +And now this is common shared code in a few abstract base classes, with some new features and loads of docstring comments (in red): + +.. figure:: images/new_base.png + +So how does the lister code work now, and **how might a contributing developer +go about making a new one** + +The first thing to know is that we now have a generic lister base class and ORM +model. A subclass of the lister base should already be able to do almost +everything needed to complete a listing task for a single service +request/response cycle with the following implementation requirements: + +1. A member variable must be declared called ``MODEL``, which is equal to a + subclass (Note: type, not instance) of the base ORM model. The reasons for + using a subclass is mostly just because different services use different + incompatible primary identifiers for their repositories. The model + subclasses are typically only one or two additional variable declarations. + +2. A method called ``transport_request`` must be implemented, which takes the + complete target identifier (e.g., a URL) and tries to request it one time + using whatever transport protocol is required for interacting with the + service. It should not attempt to retry on timeouts or do anything else with + the response (that is already done for you). It should just either return + the response or raise a ``FetchError`` exception. + +3. A method called ``transport_response_to_string`` must be implemented, which + takes the entire response of the request in (1) and converts it to a string + for logging purposes. + +4. A method called ``transport_quota_check`` must be implemented, which takes + the entire response of the request in (1) and checks to see if the process + has run afoul of any query quotas or rate limits. If the service says to + wait before making more requests, the method should return ``True`` and also + the number of seconds to wait, otherwise it returns ``False``. + +5. A method called ``transport_response_simplified`` must be implemented, which + also takes the entire response of the request in (1) and converts it to a + Python list of dicts (one dict for each repository) with keys given + according to the aforementioned ``MODEL`` class members. + +Because 1, 2, 3, and 4 are basically dependent only on the chosen network +protocol, we also have an HTTP mix-in module, which supplements the lister base +and provides default implementations for those methods along with optional +request header injection using the Python Requests library. The +``transport_quota_check`` method as provided follows the IETF standard for +communicating rate limits with `HTTP code 429 +`_ which some hosting services +have chosen not to follow, so it's possible that a specific lister will need to +override it. + +On top of all of that, we also provide another layer over the base lister class +which adds support for sequentially looping over indices. What are indices? +Well, some services (`BitBucket `_ and GitHub for +example) don't send you the entire list of all of their repositories at once, +because that server response would be unwieldy. Instead they paginate their +results, and they also allow you to query their APIs like this: +``https://server_address.tld/query_type?start_listing_from_id=foo``. Changing +the value of 'foo' lets you fetch a set of repositories starting from there. We +call 'foo' an index, and we call a service that works this way an indexing +service. GitHub uses the repository unique identifier and BitBucket uses the +repository creation time, but a service can really use anything as long as the +values monotonically increase with new repositories. A good indexing service +also includes the URL of the next page with a later 'foo' in its responses. For +these indexing services we provide another intermediate lister called the +indexing lister. Instead of inheriting from :class:`SWHListerBase +`, the lister class would inherit +from :class:`SWHIndexingLister +`. Along with the +requirements of the lister base, the indexing lister base adds one extra +requirement: + +1. A method called ``get_next_target_from_response`` must be defined, which + takes a complete request response and returns the index ('foo' above) of the + next page. + +So those are all the basic requirements. There are, of course, a few other +little bits and pieces (covered for now in the code's docstring comments), but +for the most part that's it. It sounds like a lot of information to absorb and +implement, but remember that most of the implementation requirements mentioned +above are already provided for 99% of services by the HTTP mix-in module. It +looks much simpler when we look at the actual implementations of the two +new-style indexing listers we currently have… + +This is the entire source code for the BitBucket repository lister:: + + # Copyright (C) 2017 the Software Heritage developers + # License: GNU General Public License version 3 or later + # See top-level LICENSE file for more information + + from urllib import parse + from swh.lister.bitbucket.models import BitBucketModel + from swh.lister.core.indexing_lister import SWHIndexingHttpLister + + class BitBucketLister(SWHIndexingHttpLister): + PATH_TEMPLATE = '/repositories?after=%s' + MODEL = BitBucketModel + + def get_model_from_repo(self, repo): + return {'uid': repo['uuid'], + 'indexable': repo['created_on'], + 'name': repo['name'], + 'full_name': repo['full_name'], + 'html_url': repo['links']['html']['href'], + 'origin_url': repo['links']['clone'][0]['href'], + 'origin_type': repo['scm'], + 'description': repo['description']} + + def get_next_target_from_response(self, response): + body = response.json() + if 'next' in body: + return parse.unquote(body['next'].split('after=')[1]) + else: + return None + + def transport_response_simplified(self, response): + repos = response.json()['values'] + return [self.get_model_from_repo(repo) for repo in repos] + +And this is the entire source code for the GitHub repository lister:: + + # Copyright (C) 2017 the Software Heritage developers + # License: GNU General Public License version 3 or later + # See top-level LICENSE file for more information + + import time + from swh.lister.core.indexing_lister import SWHIndexingHttpLister + from swh.lister.github.models import GitHubModel + + class GitHubLister(SWHIndexingHttpLister): + PATH_TEMPLATE = '/repositories?since=%d' + MODEL = GitHubModel + + def get_model_from_repo(self, repo): + return {'uid': repo['id'], + 'indexable': repo['id'], + 'name': repo['name'], + 'full_name': repo['full_name'], + 'html_url': repo['html_url'], + 'origin_url': repo['html_url'], + 'origin_type': 'git', + 'description': repo['description']} + + def get_next_target_from_response(self, response): + if 'next' in response.links: + next_url = response.links['next']['url'] + return int(next_url.split('since=')[1]) + else: + return None + + def transport_response_simplified(self, response): + repos = response.json() + return [self.get_model_from_repo(repo) for repo in repos] + + def request_headers(self): + return {'Accept': 'application/vnd.github.v3+json'} + + def transport_quota_check(self, response): + remain = int(response.headers['X-RateLimit-Remaining']) + if response.status_code == 403 and remain == 0: + reset_at = int(response.headers['X-RateLimit-Reset']) + delay = min(reset_at - time.time(), 3600) + return True, delay + else: + return False, 0 + +We can see that there are some common elements: + +* Both use the HTTP transport mixin (:class:`SWHIndexingHttpLister + `) just combines + :class:`SWHListerHttpTransport + ` and + :class:`SWHIndexingLister + `) to get most of the + network request functionality for free. + +* Both also define ``MODEL`` and ``PATH_TEMPLATE`` variables. It should be + clear to developers that ``PATH_TEMPLATE``, when combined with the base + service URL (e.g., ``https://some_service.com``) and passed a value (the + 'foo' index described earlier) results in a complete identifier for making + API requests to these services. It is required by our HTTP module. + +* Both services respond using JSON, so both implementations of + ``transport_response_simplified`` are similar and quite short. + +We can also see that there are a few differences: + +* GitHub sends the next URL as part of the response header, while BitBucket + sends it in the response body. + +* GitHub differentiates API versions with a request header (our HTTP transport +mix-in will automatically use any headers provided by an optional +request_headers method that we implement here), while BitBucket has it as part +of their base service URL. BitBucket uses the IETF standard HTTP 429 response +code for their rate limit notifications (the HTTP transport mix-in +automatically handles that), while GitHub uses their own custom response +headers that need special treatment. + +* But look at them! 58 lines of Python code, combined, to absorb all + repositories from two of the largest and most influential source code hosting + services. + +Ok, so what is going on behind the scenes? + +To trace the operation of the code, let's start with a sample instantiation and +progress from there to see which methods get called when. What follows will be +a series of extremely reductionist pseudocode methods. This is not what the +code actually looks like (it's not even real code), but it does have the same +basic flow. Bear with me while I try to lay out lister operation in a +quasi-linear way…:: + + # main task + + ghl = GitHubLister(lister_name='github.com', + api_baseurl='https://github.com') + ghl.run() + +⇓ (SWHIndexingLister.run):: + + # SWHIndexingLister.run + + identifier = None + do + response, repos = SWHListerBase.ingest_data(identifier) + identifier = GitHubLister.get_next_target_from_response(response) + while(identifier) + +⇓ (SWHListerBase.ingest_data):: + + # SWHListerBase.ingest_data + + response = SWHListerBase.safely_issue_request(identifier) + repos = GitHubLister.transport_response_simplified(response) + injected = SWHListerBase.inject_repo_data_into_db(repos) + return response, injected + +⇓ (SWHListerBase.safely_issue_request):: + + # SWHListerBase.safely_issue_request + + repeat: + resp = SWHListerHttpTransport.transport_request(identifier) + retry, delay = SWHListerHttpTransport.transport_quota_check(resp) + if retry: + sleep(delay) + until((not retry) or too_many_retries) + return resp + +⇓ (SWHListerHttpTransport.transport_request):: + + # SWHListerHttpTransport.transport_request + + path = SWHListerBase.api_baseurl + + SWHListerHttpTransport.PATH_TEMPLATE % identifier + headers = SWHListerHttpTransport.request_headers() + return http.get(path, headers) + +(Oh look, there's our ``PATH_TEMPLATE``) + +⇓ (SWHListerHttpTransport.request_headers):: + + # SWHListerHttpTransport.request_headers + + override → GitHubLister.request_headers + +↑↑ (SWHListerBase.safely_issue_request) + +⇓ (SWHListerHttpTransport.transport_quota_check):: + + # SWHListerHttpTransport.transport_quota_check + + override → GitHubLister.transport_quota_check + +And then we're done. From start to finish, I hope this helps you understand how +the few customized pieces fit into the new shared plumbing. + +Now you can go and write up a lister for a code hosting site we don't have yet!