diff --git a/README.md b/README.md --- a/README.md +++ b/README.md @@ -13,33 +13,128 @@ This uses docker with docker-compose, so ensure you have a working docker environment and docker-compose is installed. -## How to use +## Quick start + +First, start containers: ``` -docker-compose up +~/swh-environment/swh-docker-dev$ docker-compose up -d +[...] +Creating swh-docker-dev_amqp_1 ... done +Creating swh-docker-dev_zookeeper_1 ... done +Creating swh-docker-dev_kafka_1 ... done +Creating swh-docker-dev_flower_1 ... done +Creating swh-docker-dev_swh-scheduler-db_1 ... done +[...] ``` This will build docker images and run them. -Press Ctrl-C when you want to stop it. +Check everything is running fine with: + +``` +~/swh-environment/swh-docker-dev$ docker-compose ps + Name Command State Ports +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +swh-docker-dev_amqp_1 docker-entrypoint.sh rabbi ... Up 15671/tcp, 0.0.0.0:5018->15672/tcp, 25672/tcp, 4369/tcp, 5671/tcp, 5672/tcp +swh-docker-dev_flower_1 flower --broker=amqp://gue ... Up 0.0.0.0:5555->5555/tcp +swh-docker-dev_kafka_1 start-kafka.sh Up 0.0.0.0:9092->9092/tcp +swh-docker-dev_swh-deposit-db_1 docker-entrypoint.sh postgres Up 5432/tcp +swh-docker-dev_swh-deposit_1 /entrypoint.sh Up 0.0.0.0:5006->5006/tcp +[...] +``` + +Note: if a container failed to start, it's status will be marked as `Exit 1` +instead of `Up`. You can check why using the `docker-compose logs` command. For +example: + +``` +~/swh-environment/swh-docker-dev$ docker-compose logs swh-lister-debian +Attaching to swh-docker-dev_swh-lister-debian_1 +[...] +swh-lister-debian_1 | Processing /src/swh-scheduler +swh-lister-debian_1 | Could not install packages due to an EnvironmentError: [('/src/swh-scheduler/.hypothesis/unicodedata/8.0.0/charmap.json.gz', '/tmp/pip-req-build-pm7nsax3/.hypothesis/unicodedata/8.0.0/charmap.json.gz', "[Errno 13] Permission denied: '/src/swh-scheduler/.hypothesis/unicodedata/8.0.0/charmap.json.gz'")] +swh-lister-debian_1 | +``` + +Once all the containers are running, you can use the web interface by opening +http://localhost:5080/ in your web browser. + +At this point, the archive is empty and needs to be filled with content. To do +so, you can create tasks that will scrape a forge. For example, to inject the +code from the https://0xacab.org gitlab forge: + +``` +$ ~/swh-environment/swh-docker-dev$ docker-compose run swh-scheduler-api \ + swh-scheduler -c remote -u http://swh-scheduler-api:5008/ \ + task add swh-lister-gitlab-full -p oneshot api_baseurl=https://0xacab.org/api/v4 + +Created 1 tasks + +Task 1 + Next run: just now (2018-12-19 14:58:49+00:00) + Interval: 90 days, 0:00:00 + Type: swh-lister-gitlab-full + Policy: oneshot + Args: + Keyword args: + api_baseurl=https://0xacab.org/api/v4 +``` + +This task will scrape the forge's project list and create subtasks to inject +each git repository found there. -To run them in a detached (background) mode: +This will take a bit af time to complete, but you can follow Celery activity on +the flower instance: http://localhost:5080/flower/ + +To increase the speed at wich git repositories are imported, you can spawn more +`swh-loader-git` workers: ``` -docker-compose up -d +~/swh-environment/swh-docker-dev$ export CELERY_BROKER_URL=amqp://:5072// +~/swh-environment/swh-docker-dev$ celery status +mercurial@8f63da914c26: OK +debian@8a1c6ced237b: OK +debian@d4be158f1759: OK +pypi@41187053b90d: OK +dir@52a19b9ba606: OK +pypi@9be0cdcb484c: OK +github@101d702d6e1d: OK +bitbucket@1770d3b81da8: OK +svn@9b2e473d466b: OK +git@ae6ddafca382: OK +tar@e17c0bc4392d: OK +npm@ccfc73f73c4b: OK +gitlab@280a937595f3: OK + +~/swh-environment/swh-docker-dev$ celery control pool_grow 3 -d git@ae6ddafca382 +-> git@ae6ddafca382: OK + pool will grow +~/swh-environment/swh-docker-dev$ celery inspect -d git@ae6ddafca382 stats | grep prefetch_count + "prefetch_count": 4, ``` -To run only the objstorage API: +Note: this later command assumes you have `celery` available on your host +machine. + +Now there are 4 workers ingesting git repositories. +You can also increase the number of `swh-loader-git` containers: ``` -docker-compose up swh-objstorage +~/swh-environment/swh-docker-dev$ docker-compose up -d --scale swh-loader-git=4 +[...] +Creating swh-docker-dev_swh-loader-git_2 ... done +Creating swh-docker-dev_swh-loader-git_3 ... done +Creating swh-docker-dev_swh-loader-git_4 ... done ``` + ### Install a package from sources -It is possible to run a docker with some swh packages installed from sources -instead of from pypi. To do this you must write a docker-compose override -file. An example is given in docker-compose.override.yml.example: +It is possible to run a docker container with some swh packages installed from +sources instead of using lastest published packages from pypi. To do this you +must write a docker-compose override file (`docker-compose.override.yml`). An +example is given in the `docker-compose.override.yml.example` file: ``` version: '2' @@ -50,12 +145,12 @@ - "/home/ddouard/src/swh-environment/swh-objstorage:/src/swh-objstorage" ``` -A file named docker-compose.override.yml will automatically be loaded by -docker-compose. +The file named `docker-compose.override.yml` will automatically be loaded by +`docker-compose`. -This example shows the simple case of the swh-objstorage package: you just have to -mount it in the container in /src and the entrypoint will ensure every -swh-* package found in /src/ is installed (using `pip install -e` so you can +This example shows the simple case of the `swh-objstorage` package: you just have to +mount it in the container in `/src` and the entrypoint will ensure every +swh-* package found in `/src/` is installed (using `pip install -e` so you can easily hack your code. If the application you play with have autoreload support, there is even no need for restarting the impacted container.) @@ -88,21 +183,82 @@ That means, you can start doing the ingestion using those services using the same setup described in the getting-started starting -directly at [1]. Yes, even browsing the web app! +directly at [1]. [1] https://docs.softwareheritage.org/devel/getting-started.html#step-4-ingest-repositories +### Exposed Ports -## Importing contents +Several services have their listening ports exposed on the host: +- amqp: 5072 +- kafka: 5092 +- nginx: 5080 + +And for SWH services: + +- scheduler API: 5008 +- storage API: 5002 +- object storage API: 5003 +- indexer API: 5007 +- web app: 5004 +- deposit app: 5006 + +Beware that these ports are not the same as the ports used from within the +docker network. This means that the same command executed from the host or from +a docker container will not use the same urls to access services. For example, +to use the `celery` utility from the host, you may type: + +``` +~/swh-environment/swh-docker-dev$ CELERY_BROKER_URL=amqp://:5072// celery status +dir@52a19b9ba606: OK +[...] +``` + +To run the same command from within a container: + +``` +~/swh-environment/swh-docker-dev$ celery-compose exec swh-scheduler-api bash +root@01dba49adf37:/# CELERY_BROKER_URL=amqp://amqp:5672// celery status +dir@52a19b9ba606: OK +[...] +``` + +## Managing tasks + +One of the main component of the Software Heritage platform is the task system. +These are used to manage everything related to background process, like +discovering new git repositories to import, ingesting them, checking a known +repository is up to date, etc. + +The task system is based on Celery but uses a custom database-based scheduler. + +So when we refer to the term 'task', it may designate either a Celery task or a +SWH one (ie. the entity in the database). When we refer to simply a "task" in +the documentation, it designates the SWH task. + +When a SWH task is ready to be executed, a Celery task is created to handle the +actual SWH task's job. Note that not all Celery tasks are directly linked to a +SWH task (some SWH tasks are implemented using a Celery task that spawns Celery +subtasks). + +A (SWH) task can be `recurring` or `oneshot`. `oneshot` tasks are only execute +once, whereas `recurring` are regularly executed. The scheduling configuration +of these recurring tasks can be set via the fields `current_interval` and +`priority` (can be 'high', 'normal' or 'low') of the task database entity. ### Inserting a new lister task -To list the content of a source code provider like github or the Debian +To list the content of a source code provider like github or a Debian distribution, you may add a new task for this. -This task should then spawn a series of loader tasks. +This task will (generally) scrape a web page or use a public API to identify +the list of published software artefacts (git repos, debian source packages, +etc.) + +Then, for each repository, a new task will be created to ingest this repository +and keep it up to date. For example, to add a (one shot) task that will list git repos on the 0xacab.org gitlab instance, one can do (from this git repository): @@ -202,8 +358,9 @@ ### Monitoring activity -You can monitor the workers activity by connecting to the RabbitMQ console -on `http://localhost:5018` +You can monitor the workers activity by connecting to the RabbitMQ console on +`http://localhost:5002` or the Celery dashboard (flower) on +`http://localhost:5003`. If you cannot see any task being in fact executed, check the logs of the `swh-scheduler-runner` service (here is an ecample of failure due to the