diff --git a/docs/tutorial.rst b/docs/tutorial.rst --- a/docs/tutorial.rst +++ b/docs/tutorial.rst @@ -14,7 +14,7 @@ archived by Software Heritage `_ through a combination of automatic linkage between the listing and loading -scheduler, new understanding of how to deal with extremely large repository +scheduler, a new understanding of how to deal with an extremely large repository hosts like `GitHub `_, and activating a new set of repositories that had previously been skipped over. @@ -29,7 +29,7 @@ archive. As the old Italian proverb goes, "Il meglio è nemico del bene," or in modern English parlance, "Perfect is the enemy of good," right? Right. So the plan from the beginning was to implement a lister for GitHub, then maybe -implement another one, and then take a few giant steps backward and squint our +implement another one, and then take a few giant steps backwards and squint our eyes. Why? Because source code hosting services don't behave according to a unified @@ -56,18 +56,18 @@ 3. Populate a work queue for fetching and ingesting source repositories. Steps 1 and 3 are generic problems, so they can get generic solutions hidden -away in base code, most of which never needs to change. That leaves us to -implement step 2, which can be trivially done now for services with clean web +away in the base code, most of which never needs to change. That leaves us to +implement step 2, which can be trivially done now for services with a clean web APIs. -In the new code we've tried to hide away as much generic functionality as +In the new code, we've tried to hide away as much generic functionality as possible, turning it into set-and-forget plumbing between a few simple customized elements. Different hosting services might use different network protocols, rate-limit messages, or pagination schemes, but, as long as there is some way to get a list of the hosted repositories, we think that the new base code will make getting those repositories much easier. -First let me give you the 30,000 foot view… +First, let me give you the 30,000 foot view… The old GitHub-specific lister code looked like this (265 lines of Python): @@ -164,6 +164,84 @@ looks much simpler when we look at the actual implementations of the two new-style indexing listers we currently have… +An important aspect for making a new lister is its testing. To register the +celery tasks of your new lister, you need to add your lister in the main +conftest.py (swh/lister/core/tests/conftest.py) + +After testing, it is suggested to run your new lister in docker as it provides +good, almost-production like test. Here are the steps you need to follow to run +a new lister in docker. + +1. You must write a docker-compose override file (`docker-compose.override.yml`). + An example is given in the `docker-compose.override.yml.example` file :: + + version: '2' + + services: + swh-objstorage: + volumes: + - "$SWH_ENVIRONMENT_HOME/swh-lister:/src/swh-lister" + + The file named `docker-compose.override.yml` will automatically be loaded by + `docker-compose`. For more details, you may refer to README.md present in + swh-docker-dev. +2. Follow the instruction mentioned under heading Preparation steps and + Configuration file sample in README.md of swh-lister. +3. Make sure to run storage (5002) and scheduler (5008) services locally. + You can run them by the following command:: + + ~/swh-environment/swh-docker-dev$ docker-compose up -d swh-scheduler-api \ + swh-storage +4. Add the lister task-type in the scheduler. For example, if you want to + add pypi lister task-type :: + + ~/swh-environment$swh-scheduler task-type add list-pypi recurring \ + "Full pypi lister" + + You can check all the task-type by:: + + ~/swh-environment$swh scheduler task-type list + Known task types: + load-svn-from-archive: + Loading svn repositories from svn dump + load-svn: + Create dump of a remote svn repository, mount it and load it + load-deposit: + Loading deposit archive into swh through swh-loader-tar + check-deposit: + Pre-checking deposit step before loading into swh archive + cook-vault-bundle: + Cook a Vault bundle + load-hg: + Loading mercurial repository swh-loader-mercurial + load-hg-from-archive: + Loading archive mercurial repository swh-loader-mercurial + load-git: + Update an origin of type git + list-github-incremental: + Incrementally list GitHub + list-github-full: + Full update of GitHub repos list + ... + + If your lister is creating a loading task that is not in the task type + list, then you need to add that too. Like for GNU lister:: + + ~/swh-environment$swh scheduler task-type add load-gnu recurring \ + "GNU Loader" + +5. Run your lister by importing the lister task and executing it. For example + you need to run these lines to run pypi lister :: + + import logging + from swh.lister.pypi.tasks import pypi_lister + + logging.basicConfig(level=logging.DEBUG) + pypi_lister() + +After the execution of lister is complete you can see the loading task created. + ~/swh-environment/swh-lister$swh scheduler task list + This is the entire source code for the BitBucket repository lister:: # Copyright (C) 2017 the Software Heritage developers