Changeset View
Changeset View
Standalone View
Standalone View
README.md
# swh-docker-dev | # swh-docker-dev | ||||
[Work in progress] | [Work in progress] | ||||
This repo contains Dockerfiles to allow developers to run a small | This repo contains Dockerfiles to allow developers to run a small | ||||
Software Heritage instance on their development computer. | Software Heritage instance on their development computer. | ||||
The end goal is to smooth the contributors/developers workflow. Focus | The end goal is to smooth the contributors/developers workflow. Focus | ||||
on coding, not configuring! | on coding, not configuring! | ||||
## Dependencies | ## Dependencies | ||||
This uses docker with docker-compose, so ensure you have a working | This uses docker with docker-compose, so ensure you have a working | ||||
docker environment and docker-compose is installed. | docker environment and docker-compose is installed. | ||||
## How to use | ## Quick start | ||||
First, start containers: | |||||
``` | ``` | ||||
docker-compose up | ~/swh-environment/swh-docker-dev$ docker-compose up -d | ||||
[...] | |||||
Creating swh-docker-dev_amqp_1 ... done | |||||
Creating swh-docker-dev_zookeeper_1 ... done | |||||
Creating swh-docker-dev_kafka_1 ... done | |||||
Creating swh-docker-dev_flower_1 ... done | |||||
Creating swh-docker-dev_swh-scheduler-db_1 ... done | |||||
[...] | |||||
``` | ``` | ||||
This will build docker images and run them. | This will build docker images and run them. | ||||
Press Ctrl-C when you want to stop it. | Check everything is running fine with: | ||||
``` | |||||
~/swh-environment/swh-docker-dev$ docker-compose ps | |||||
Name Command State Ports | |||||
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |||||
swh-docker-dev_amqp_1 docker-entrypoint.sh rabbi ... Up 15671/tcp, 0.0.0.0:5018->15672/tcp, 25672/tcp, 4369/tcp, 5671/tcp, 5672/tcp | |||||
swh-docker-dev_flower_1 flower --broker=amqp://gue ... Up 0.0.0.0:5555->5555/tcp | |||||
swh-docker-dev_kafka_1 start-kafka.sh Up 0.0.0.0:9092->9092/tcp | |||||
swh-docker-dev_swh-deposit-db_1 docker-entrypoint.sh postgres Up 5432/tcp | |||||
swh-docker-dev_swh-deposit_1 /entrypoint.sh Up 0.0.0.0:5006->5006/tcp | |||||
[...] | |||||
``` | |||||
To run them in a detached (background) mode: | Note: if a container failed to start, it's status will be marked as `Exit 1` | ||||
instead of `Up`. You can check why using the `docker-compose logs` command. For | |||||
example: | |||||
``` | ``` | ||||
docker-compose up -d | ~/swh-environment/swh-docker-dev$ docker-compose logs swh-lister-debian | ||||
Attaching to swh-docker-dev_swh-lister-debian_1 | |||||
[...] | |||||
swh-lister-debian_1 | Processing /src/swh-scheduler | |||||
swh-lister-debian_1 | Could not install packages due to an EnvironmentError: [('/src/swh-scheduler/.hypothesis/unicodedata/8.0.0/charmap.json.gz', '/tmp/pip-req-build-pm7nsax3/.hypothesis/unicodedata/8.0.0/charmap.json.gz', "[Errno 13] Permission denied: '/src/swh-scheduler/.hypothesis/unicodedata/8.0.0/charmap.json.gz'")] | |||||
swh-lister-debian_1 | | |||||
``` | ``` | ||||
To run only the objstorage API: | Once all the containers are running, you can use the web interface by opening | ||||
http://localhost:5080/ in your web browser. | |||||
At this point, the archive is empty and needs to be filled with content. To do | |||||
so, you can create tasks that will scrape a forge. For example, to inject the | |||||
code from the https://0xacab.org gitlab forge: | |||||
``` | ``` | ||||
docker-compose up swh-objstorage | $ ~/swh-environment/swh-docker-dev$ docker-compose run swh-scheduler-api \ | ||||
swh-scheduler -c remote -u http://swh-scheduler-api:5008/ \ | |||||
task add swh-lister-gitlab-full -p oneshot api_baseurl=https://0xacab.org/api/v4 | |||||
Created 1 tasks | |||||
Task 1 | |||||
Next run: just now (2018-12-19 14:58:49+00:00) | |||||
Interval: 90 days, 0:00:00 | |||||
Type: swh-lister-gitlab-full | |||||
Policy: oneshot | |||||
Args: | |||||
Keyword args: | |||||
api_baseurl=https://0xacab.org/api/v4 | |||||
``` | ``` | ||||
This task will scrape the forge's project list and create subtasks to inject | |||||
each git repository found there. | |||||
This will take a bit af time to complete, but you can follow Celery activity on | |||||
the flower instance: http://localhost:5080/flower/ | |||||
To increase the speed at wich git repositories are imported, you can spawn more | |||||
`swh-loader-git` workers: | |||||
``` | |||||
~/swh-environment/swh-docker-dev$ export CELERY_BROKER_URL=amqp://:5072// | |||||
~/swh-environment/swh-docker-dev$ celery status | |||||
mercurial@8f63da914c26: OK | |||||
debian@8a1c6ced237b: OK | |||||
debian@d4be158f1759: OK | |||||
pypi@41187053b90d: OK | |||||
dir@52a19b9ba606: OK | |||||
pypi@9be0cdcb484c: OK | |||||
github@101d702d6e1d: OK | |||||
bitbucket@1770d3b81da8: OK | |||||
svn@9b2e473d466b: OK | |||||
git@ae6ddafca382: OK | |||||
tar@e17c0bc4392d: OK | |||||
npm@ccfc73f73c4b: OK | |||||
gitlab@280a937595f3: OK | |||||
~/swh-environment/swh-docker-dev$ celery control pool_grow 3 -d git@ae6ddafca382 | |||||
-> git@ae6ddafca382: OK | |||||
pool will grow | |||||
~/swh-environment/swh-docker-dev$ celery inspect -d git@ae6ddafca382 stats | grep prefetch_count | |||||
"prefetch_count": 4, | |||||
``` | |||||
Note: this later command assumes you have `celery` available on your host | |||||
machine. | |||||
Now there are 4 workers ingesting git repositories. | |||||
You can also increase the number of `swh-loader-git` containers: | |||||
``` | |||||
~/swh-environment/swh-docker-dev$ docker-compose up -d --scale swh-loader-git=4 | |||||
[...] | |||||
Creating swh-docker-dev_swh-loader-git_2 ... done | |||||
Creating swh-docker-dev_swh-loader-git_3 ... done | |||||
Creating swh-docker-dev_swh-loader-git_4 ... done | |||||
``` | |||||
### Install a package from sources | ### Install a package from sources | ||||
It is possible to run a docker with some swh packages installed from sources | It is possible to run a docker container with some swh packages installed from | ||||
instead of from pypi. To do this you must write a docker-compose override | sources instead of using lastest published packages from pypi. To do this you | ||||
file. An example is given in docker-compose.override.yml.example: | must write a docker-compose override file (`docker-compose.override.yml`). An | ||||
example is given in the `docker-compose.override.yml.example` file: | |||||
``` | ``` | ||||
version: '2' | version: '2' | ||||
services: | services: | ||||
swh-objstorage: | swh-objstorage: | ||||
volumes: | volumes: | ||||
- "/home/ddouard/src/swh-environment/swh-objstorage:/src/swh-objstorage" | - "/home/ddouard/src/swh-environment/swh-objstorage:/src/swh-objstorage" | ||||
``` | ``` | ||||
A file named docker-compose.override.yml will automatically be loaded by | The file named `docker-compose.override.yml` will automatically be loaded by | ||||
docker-compose. | `docker-compose`. | ||||
This example shows the simple case of the swh-objstorage package: you just have to | This example shows the simple case of the `swh-objstorage` package: you just have to | ||||
mount it in the container in /src and the entrypoint will ensure every | mount it in the container in `/src` and the entrypoint will ensure every | ||||
swh-* package found in /src/ is installed (using `pip install -e` so you can | swh-* package found in `/src/` is installed (using `pip install -e` so you can | ||||
easily hack your code. If the application you play with have autoreload support, | easily hack your code. If the application you play with have autoreload support, | ||||
there is even no need for restarting the impacted container.) | there is even no need for restarting the impacted container.) | ||||
## Details | ## Details | ||||
This runs the following services on their respectively standard ports, | This runs the following services on their respectively standard ports, | ||||
all of the following services are configured to communicate with each | all of the following services are configured to communicate with each | ||||
other: | other: | ||||
Show All 16 Lines | |||||
- swh-loaders: celery workers dedicated to importing/updating source code | - swh-loaders: celery workers dedicated to importing/updating source code | ||||
content (VCS repos, source packages, etc.), | content (VCS repos, source packages, etc.), | ||||
- swh-journal: Persistent logger of changes to the archive, with | - swh-journal: Persistent logger of changes to the archive, with | ||||
publish-subscribe support. | publish-subscribe support. | ||||
That means, you can start doing the ingestion using those services | That means, you can start doing the ingestion using those services | ||||
using the same setup described in the getting-started starting | using the same setup described in the getting-started starting | ||||
directly at [1]. Yes, even browsing the web app! | directly at [1]. | ||||
[1] https://docs.softwareheritage.org/devel/getting-started.html#step-4-ingest-repositories | [1] https://docs.softwareheritage.org/devel/getting-started.html#step-4-ingest-repositories | ||||
### Exposed Ports | |||||
Several services have their listening ports exposed on the host: | |||||
- amqp: 5072 | |||||
- kafka: 5092 | |||||
- nginx: 5080 | |||||
And for SWH services: | |||||
- scheduler API: 5008 | |||||
- storage API: 5002 | |||||
- object storage API: 5003 | |||||
- indexer API: 5007 | |||||
- web app: 5004 | |||||
- deposit app: 5006 | |||||
Beware that these ports are not the same as the ports used from within the | |||||
docker network. This means that the same command executed from the host or from | |||||
a docker container will not use the same urls to access services. For example, | |||||
to use the `celery` utility from the host, you may type: | |||||
``` | |||||
~/swh-environment/swh-docker-dev$ CELERY_BROKER_URL=amqp://:5072// celery status | |||||
dir@52a19b9ba606: OK | |||||
[...] | |||||
``` | |||||
To run the same command from within a container: | |||||
``` | |||||
~/swh-environment/swh-docker-dev$ celery-compose exec swh-scheduler-api bash | |||||
root@01dba49adf37:/# CELERY_BROKER_URL=amqp://amqp:5672// celery status | |||||
dir@52a19b9ba606: OK | |||||
[...] | |||||
``` | |||||
## Managing tasks | |||||
vlorentz: components | |||||
One of the main components of the Software Heritage platform is the task system. | |||||
These are used to manage everything related to background process, like | |||||
discovering new git repositories to import, ingesting them, checking a known | |||||
repository is up to date, etc. | |||||
The task system is based on Celery but uses a custom database-based scheduler. | |||||
## Importing contents | So when we refer to the term 'task', it may designate either a Celery task or a | ||||
SWH one (ie. the entity in the database). When we refer to simply a "task" in | |||||
the documentation, it designates the SWH task. | |||||
When a SWH task is ready to be executed, a Celery task is created to handle the | |||||
actual SWH task's job. Note that not all Celery tasks are directly linked to a | |||||
SWH task (some SWH tasks are implemented using a Celery task that spawns Celery | |||||
subtasks). | |||||
Not Done Inline Actionsexecuted vlorentz: executed | |||||
A (SWH) task can be `recurring` or `oneshot`. `oneshot` tasks are only executed | |||||
once, whereas `recurring` are regularly executed. The scheduling configuration | |||||
of these recurring tasks can be set via the fields `current_interval` and | |||||
`priority` (can be 'high', 'normal' or 'low') of the task database entity. | |||||
### Inserting a new lister task | ### Inserting a new lister task | ||||
To list the content of a source code provider like github or the Debian | To list the content of a source code provider like github or a Debian | ||||
distribution, you may add a new task for this. | distribution, you may add a new task for this. | ||||
This task should then spawn a series of loader tasks. | This task will (generally) scrape a web page or use a public API to identify | ||||
the list of published software artefacts (git repos, debian source packages, | |||||
etc.) | |||||
Then, for each repository, a new task will be created to ingest this repository | |||||
and keep it up to date. | |||||
For example, to add a (one shot) task that will list git repos on the | For example, to add a (one shot) task that will list git repos on the | ||||
0xacab.org gitlab instance, one can do (from this git repository): | 0xacab.org gitlab instance, one can do (from this git repository): | ||||
``` | ``` | ||||
$ docker-compose run swh-scheduler-api \ | $ docker-compose run swh-scheduler-api \ | ||||
swh-scheduler -c remote -u http://swh-scheduler-api:5008/ \ | swh-scheduler -c remote -u http://swh-scheduler-api:5008/ \ | ||||
task add swh-lister-gitlab-full -p oneshot api_baseurl=https://0xacab.org/api/v4 | task add swh-lister-gitlab-full -p oneshot api_baseurl=https://0xacab.org/api/v4 | ||||
▲ Show 20 Lines • Show All 83 Lines • ▼ Show 20 Lines | |||||
indexer_origin_metadata: | indexer_origin_metadata: | ||||
Origin Metadata indexer task | Origin Metadata indexer task | ||||
``` | ``` | ||||
### Monitoring activity | ### Monitoring activity | ||||
You can monitor the workers activity by connecting to the RabbitMQ console | You can monitor the workers activity by connecting to the RabbitMQ console on | ||||
on `http://localhost:5018` | `http://localhost:5002` or the Celery dashboard (flower) on | ||||
`http://localhost:5003`. | |||||
If you cannot see any task being in fact executed, check the logs of the | If you cannot see any task being in fact executed, check the logs of the | ||||
`swh-scheduler-runner` service (here is an ecample of failure due to the | `swh-scheduler-runner` service (here is an ecample of failure due to the | ||||
debian lister task not being properly registered on the swh-scheduler-runner | debian lister task not being properly registered on the swh-scheduler-runner | ||||
service): | service): | ||||
``` | ``` | ||||
$ docker-compose logs --tail=10 swh-scheduler-runner | $ docker-compose logs --tail=10 swh-scheduler-runner | ||||
Show All 12 Lines |
components