Page MenuHomeSoftware Heritage

D1221.id3850.diff
No OneTemporary

D1221.id3850.diff

diff --git a/README.md b/README.md
--- a/README.md
+++ b/README.md
@@ -1,30 +1,62 @@
# swh-docker-dev
-[Work in progress]
-
This repo contains Dockerfiles to allow developers to run a small
Software Heritage instance on their development computer.
The end goal is to smooth the contributors/developers workflow. Focus
on coding, not configuring!
+```eval_rst
+.. note:: Running a Software Heritage instance on your machine can consume
+ quite a bit of resources: if you play a bit too hard (e.g., if you
+ try to list all GitHub repositories with the corresponding lister),
+ you may fill your hard drive, and consume a lot of CPU, memory and
+ network bandwidth.
+```
+
## Dependencies
This uses docker with docker-compose, so ensure you have a working
docker environment and docker-compose is installed.
-## Warning
+We recommend using the latest version of docker, so please read
+https://docs.docker.com/install/linux/docker-ce/debian/ for more details on how
+to install docker on your machine.
-Running a Software Heritage instance on your machine can be quickly quite
-ressource consuming: if you play a bit too hard (eg. if you try the github
-lister), you may fill your hard drive pretty quick, and consume a lot of CPU,
-memory and network bandwidth.
+On a debian system, docker-compose can be installed from debian repositories.
+On a stable (stretch) machine, it is recommended to install the version from
+[backports](https://backports.debian.org/Instructions/):
+
+``` bash
+~$ sudo apt install -t stretch-backports docker-compose
+```
## Quick start
-First, start containers:
+First, clone this repository.
+If you already have followed the [develop setup guide], then you should already
+have a copy of the swh-docker-env git repository. Use it:
+
+``` bash
+~$ cd swh-environment/swh-docker-dev
```
+
+Otherwise, we suggest to create a `swh-environment`
+directory in which this repo will be cloned so you can later on run some
+component in docker containers with overrides code from local repositories (see
+[below](#using-docker-setup-development-and-integration-testing)):
+
+``` bash
+~$ mkdir swh-environment
+~$ cd swh-environment
+~/swh-environment$ git clone https://forge.softwareheritage.org/source/swh-docker-dev.gi
+~/swh-environment$ cd swh-docker-dev
+```
+
+Then, start containers:
+
+``` bash
~/swh-environment/swh-docker-dev$ docker-compose up -d
[...]
Creating swh-docker-dev_amqp_1 ... done
@@ -36,10 +68,9 @@
```
This will build docker images and run them.
-
Check everything is running fine with:
-```
+``` bash
~/swh-environment/swh-docker-dev$ docker-compose ps
Name Command State Ports
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
@@ -51,11 +82,14 @@
[...]
```
-Note: if a container failed to start, it's status will be marked as `Exit 1`
-instead of `Up`. You can check why using the `docker-compose logs` command. For
-example:
+At the time of writing this guide, the startup of some containers may fail the
+first time for dependency-related problems. If some containers failed to start,
+just run the `docker-compose up -d` command again.
-```
+If a container really refuses to start properly, you can check why using the
+`docker-compose logs` command. For example:
+
+``` bash
~/swh-environment/swh-docker-dev$ docker-compose logs swh-lister-debian
Attaching to swh-docker-dev_swh-lister-debian_1
[...]
@@ -64,17 +98,17 @@
swh-lister-debian_1 |
```
-Once all the containers are running, you can use the web interface by opening
+Once all containers are running, you can use the web interface by opening
http://localhost:5080/ in your web browser.
At this point, the archive is empty and needs to be filled with some content.
To do so, you can create tasks that will scrape a forge. For example, to inject
the code from the https://0xacab.org gitlab forge:
-```
-$ ~/swh-environment/swh-docker-dev$ docker-compose run swh-scheduler-api \
- swh-scheduler -c remote -u http://swh-scheduler-api:5008/ \
- task add swh-lister-gitlab-full -p oneshot api_baseurl=https://0xacab.org/api/v4
+``` bash
+~/swh-environment/swh-docker-dev$ docker-compose exec swh-scheduler-api \
+ swh-scheduler task add swh-lister-gitlab-full \
+ -p oneshot api_baseurl=https://0xacab.org/api/v4
Created 1 tasks
@@ -96,83 +130,33 @@
To increase the speed at wich git repositories are imported, you can spawn more
`swh-loader-git` workers:
-```
-~/swh-environment/swh-docker-dev$ export CELERY_BROKER_URL=amqp://:5072//
-~/swh-environment/swh-docker-dev$ celery status
-mercurial@8f63da914c26: OK
-debian@8a1c6ced237b: OK
-debian@d4be158f1759: OK
-pypi@41187053b90d: OK
-dir@52a19b9ba606: OK
-pypi@9be0cdcb484c: OK
-github@101d702d6e1d: OK
-bitbucket@1770d3b81da8: OK
-svn@9b2e473d466b: OK
-git@ae6ddafca382: OK
-tar@e17c0bc4392d: OK
-npm@ccfc73f73c4b: OK
-gitlab@280a937595f3: OK
-
-~/swh-environment/swh-docker-dev$ celery control pool_grow 3 -d git@ae6ddafca382
--> git@ae6ddafca382: OK
+``` bash
+~/swh-environment/swh-docker-dev$ docker-compose exec swh-scheduler-api \
+ celery status
+listers@50ac2185c6c9: OK
+loader@b164f9055637: OK
+indexer@33bc6067a5b8: OK
+vault@c9fef1bbfdc1: OK
+
+4 nodes online.
+~/swh-environment/swh-docker-dev$ docker-compose exec swh-scheduler-api \
+ celery control pool_grow 3 -d loader@b164f9055637
+-> loader@b164f9055637: OK
pool will grow
-~/swh-environment/swh-docker-dev$ celery inspect -d git@ae6ddafca382 stats | grep prefetch_count
- "prefetch_count": 4,
+~/swh-environment/swh-docker-dev$ docker-compose exec swh-scheduler-api \
+ celery inspect -d loader@b164f9055637 stats | grep prefetch_count
+ "prefetch_count": 4
```
-Note: this later command assumes you have `celery` available on your host
-machine.
-
Now there are 4 workers ingesting git repositories.
You can also increase the number of `swh-loader-git` containers:
-```
-~/swh-environment/swh-docker-dev$ docker-compose up -d --scale swh-loader-git=4
+``` bash
+~/swh-environment/swh-docker-dev$ docker-compose up -d --scale swh-loader=4
[...]
-Creating swh-docker-dev_swh-loader-git_2 ... done
-Creating swh-docker-dev_swh-loader-git_3 ... done
-Creating swh-docker-dev_swh-loader-git_4 ... done
-```
-
-
-### Install a package from sources
-
-It is possible to run a docker container with some swh packages installed from
-sources instead of using lastest published packages from pypi. To do this you
-must write a docker-compose override file (`docker-compose.override.yml`). An
-example is given in the `docker-compose.override.yml.example` file:
-
-```
-version: '2'
-
-services:
- swh-objstorage:
- volumes:
- - "/home/ddouard/src/swh-environment/swh-objstorage:/src/swh-objstorage"
-```
-
-The file named `docker-compose.override.yml` will automatically be loaded by
-`docker-compose`.
-
-This example shows the simple case of the `swh-objstorage` package: you just have to
-mount it in the container in `/src` and the entrypoint will ensure every
-swh-* package found in `/src/` is installed (using `pip install -e` so you can
-easily hack your code. If the application you play with have autoreload support,
-there is even no need for restarting the impacted container.)
-
-Note: if the docker fails to start when using local sources for one or more swh
-package, it's most probably due to permission problems on cache files. For
-example, if you have executed tests locally (using pytest or tox), you have
-cache files (__pycache__ etc.) that will prevent `pip install` from working
-within the docker.
-
-The solution is to clean these files and directories before trying to spawn the
-docker.
-
-```
-~/swh-environment$ find . -type d -name __pycache__ -exec rm -rf {} \;
-~/swh-environment$ find . -type d -name .tox -exec rm -rf {} \;
-~/swh-environment$ find . -type d -name .hypothesis -exec rm -rf {} \;
+Creating swh-docker-dev_swh-loader_2 ... done
+Creating swh-docker-dev_swh-loader_3 ... done
+Creating swh-docker-dev_swh-loader_4 ... done
```
@@ -231,18 +215,17 @@
a docker container will not use the same urls to access services. For example,
to use the `celery` utility from the host, you may type:
-```
+``` bash
~/swh-environment/swh-docker-dev$ CELERY_BROKER_URL=amqp://:5072// celery status
-dir@52a19b9ba606: OK
+loader@61704103668c: OK
[...]
```
To run the same command from within a container:
-```
-~/swh-environment/swh-docker-dev$ celery-compose exec swh-scheduler-api bash
-root@01dba49adf37:/# CELERY_BROKER_URL=amqp://amqp:5672// celery status
-dir@52a19b9ba606: OK
+``` bash
+~/swh-environment/swh-docker-dev$ celery-compose exec swh-scheduler-api celery status
+loader@61704103668c: OK
[...]
```
@@ -285,10 +268,10 @@
For example, to add a (one shot) task that will list git repos on the
0xacab.org gitlab instance, one can do (from this git repository):
-```
-$ docker-compose run swh-scheduler-api \
- swh-scheduler -c remote -u http://swh-scheduler-api:5008/ \
- task add swh-lister-gitlab-full -p oneshot api_baseurl=https://0xacab.org/api/v4
+``` bash
+~/swh-environment/swh-docker-dev$ docker-compose exec swh-scheduler-api \
+ swh-scheduler task add swh-lister-gitlab-full \
+ -p oneshot api_baseurl=https://0xacab.org/api/v4
Created 1 tasks
@@ -305,10 +288,9 @@
This will insert a new task in the scheduler. To list existing tasks for a
given task type:
-```
-$ docker-compose run swh-scheduler-api \
- swh-scheduler -c remote -u http://swh-scheduler-api:5008/ \
- task list-pending swh-lister-gitlab-full
+``` bash
+~/swh-environment/swh-docker-dev$ docker-compose exec swh-scheduler-api \
+ swh-scheduler task list-pending swh-lister-gitlab-full
Found 1 swh-lister-gitlab-full tasks
@@ -324,10 +306,9 @@
To list all existing task types:
-```
-$ docker-compose run swh-scheduler-api \
- swh-scheduler -c remote -u http://swh-scheduler-api:5008/ \
- task --list-types
+``` bash
+~/swh-environment/swh-docker-dev$ docker-compose exec swh-scheduler-api \
+ swh-scheduler task --list-types
Known task types:
swh-loader-mount-dump-and-load-svn-repository:
@@ -381,16 +362,16 @@
### Monitoring activity
You can monitor the workers activity by connecting to the RabbitMQ console on
-`http://localhost:5002` or the Celery dashboard (flower) on
-`http://localhost:5003`.
+`http://localhost:5080/reaabitmq` or the grafan dashboard on
+`http://localhost:5080/grafana`.
If you cannot see any task being in fact executed, check the logs of the
`swh-scheduler-runner` service (here is an ecample of failure due to the
debian lister task not being properly registered on the swh-scheduler-runner
service):
-```
-$ docker-compose logs --tail=10 swh-scheduler-runner
+``` bash
+~/swh-environment/swh-docker-dev$ docker-compose logs --tail=10 swh-scheduler-runner
Attaching to swh-docker-dev_swh-scheduler-runner_1
swh-scheduler-runner_1 | "__main__", mod_spec)
swh-scheduler-runner_1 | File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
@@ -403,3 +384,232 @@
swh-scheduler-runner_1 | raise self.NotRegistered(key)
swh-scheduler-runner_1 | celery.exceptions.NotRegistered: 'swh.lister.debian.tasks.DebianListerTask'
```
+
+
+## Using docker setup development and integration testing
+
+If you hack the code of one or more components of the archive with a virtual
+env based setup as described in the [develop setup guide],
+you may want to test your modifications in a working Software Heritage
+instance. The simplest way of achieving this is to use this docker-based
+environment.
+
+If you haven't followed the [develop setup guide], you must clone the the
+[swh-environment] repo in your `swh-environment` directory:
+
+``` bash
+~/swh-environment$ git clone https://forge.softwareheritage.org/source/swh-environment.git .
+```
+
+Note the `.` at the end of this command : we want the git repository to be
+cloned directly in the `~/swh-environment` directory, not in a sub directory.
+Also note that if you haven't done it yet and you want to hack the source code
+of one or more Software Heritage packages, you really should read the
+[develop setup guide].
+
+From there, we will checkout or update all the swh packages:
+
+``` bash
+~/swh-environment$ ./bin/update
+```
+
+### Install a swh package from sources in a container
+
+It is possible to run a docker container with some swh packages installed from
+sources instead of using lastest published packages from pypi. To do this you
+must write a docker-compose override file (`docker-compose.override.yml`). An
+example is given in the `docker-compose.override.yml.example` file:
+
+``` yaml
+version: '2'
+
+services:
+ swh-objstorage:
+ volumes:
+ - "/home/ddouard/src/swh-environment/swh-objstorage:/src/swh-objstorage"
+```
+
+The file named `docker-compose.override.yml` will automatically be loaded by
+`docker-compose`.
+
+This example shows the simple case of the `swh-objstorage` package: you just have to
+mount it in the container in `/src` and the entrypoint will ensure every
+swh-* package found in `/src/` is installed (using `pip install -e` so you can
+easily hack your code. If the application you play with have autoreload support,
+there is even no need for restarting the impacted container.)
+
+Note: if the docker fails to start when using local sources for one or more swh
+package, it's most probably due to permission problems on cache files. For
+example, if you have executed tests locally (using pytest or tox), you have
+cache files (__pycache__ etc.) that will prevent `pip install` from working
+within the docker.
+
+The solution is to clean these files and directories before trying to spawn the
+docker.
+
+``` bash
+~/swh-environment$ find . -type d -name __pycache__ -exec rm -rf {} \;
+~/swh-environment$ find . -type d -name .tox -exec rm -rf {} \;
+~/swh-environment$ find . -type d -name .hypothesis -exec rm -rf {} \;
+```
+
+### Using locally installed swh tools with docker
+
+In all examples above, we have executed swh commands from within a running
+container. Now we also have these swh commands locally available in our virtual
+env, we can use them to interact with swh services running in docker
+containers.
+
+For this, we just need to configure a few environment variables. First, ensure
+your Software Heritage virtualenv is activated (here, using virtualenvwrapper):
+
+``` bash
+~$ workon swh
+(swh) ~/swh-environment$ export SWH_SCHEDULER_URL=http://127.0.0.1:5008/
+(swh) ~/swh-environment$ export CELERY_BROKER_URL=amqp://127.0.0.1:5072/
+```
+
+Now we can use the `celery` command directly to control the celery system
+running in the docker environment:
+
+``` bash
+(swh) ~/swh-environment$ celery status
+vault@c9fef1bbfdc1: OK
+listers@ba66f18e7d02: OK
+indexer@cb14c33cbbfb: OK
+loader@61704103668c: OK
+
+4 nodes online.
+(swh) ~/swh-environment$ celery control -d loader@61704103668c pool_grow 3
+```
+
+And we can use the `swh-scheduler` command all the same:
+
+``` bash
+(swh) ~/swh-environment$ swh-scheduler task-type list
+Known task types:
+indexer_fossology_license:
+ Fossology license indexer task
+indexer_mimetype:
+ Mimetype indexer task
+[...]
+```
+
+### Make your life a bit easier
+
+When you use virtualenvwrapper, you can add postactivation commands:
+
+``` bash
+(swh) ~/swh-environment$ cat >>$VIRTUAL_ENV/bin/postactivate <<EOF
+# unfortunately, the interface cmd for the click autocompletion
+# depends on the shell
+# https://click.palletsprojects.com/en/7.x/bashcomplete/#activation
+
+shell=$(basename $SHELL)
+case "$shell" in
+ "zsh")
+ autocomplete_cmd=source_zsh
+ ;;
+ *)
+ autocomplete_cmd=source
+ ;;
+esac
+
+eval "$(_SWH_SCHEDULER_COMPLETE=$autocomplete_cmd swh-scheduler)"
+export SWH_SCHEDULER_URL=http://127.0.0.1:5008/
+export CELERY_BROKER_URL=amqp://127.0.0.1:5072/
+export COMPOSE_FILE=~/swh-environment/swh-docker-dev/docker-compose.yml:~/swh-environment/swh-docker-dev/docker-compose.override.yml
+alias doco=docker-compose
+
+function swhclean {
+ find ~/swh-environment -type d -name __pycache__ -exec rm -rf {} \;
+ find ~/swh-environment -type d -name .tox -exec rm -rf {} \;
+ find ~/swh-environment -type d -name .hypothesis -exec rm -rf {} \;
+}
+EOF
+```
+
+This postactivate script does:
+
+- install a shell completion handler for the swh-scheduler command,
+- preset a bunch of environment variables
+
+ - `SWH_SCHEDULER_URL` so that you can just run `sch-scheduler` against the
+ scheduler API instance running in docker, without having to specify the
+ endpoint URL,
+
+ - `CELERY_BROKER` so you can execute the `celery` tool without options
+ against the rabbitmq server running in the docker environment,
+
+ - `COMPOSE_FILE` so you can run `docker-compose` from everywhere,
+
+- create an alias `doco` for `docker-compose` because this later is way too
+ long to type,
+
+- add a `swhclean` shell function to clean your source directories so that
+ there is no conflict with docker containers using local swh repositories (see
+ below). This will delete any `.tox`, `__pycache__` and `.hypothesis`
+ directory found in your swh-environment directory.
+
+So now you can easily:
+
+* Start the SWH platform:
+
+ ``` bash
+ (swh) ~/swh-environment$ docker-compose up -d
+ [...]
+ ```
+
+* Check celery:
+
+ ``` bash
+ (swh) ~/swh-environment$ celery status
+ listers@50ac2185c6c9: OK
+ loader@b164f9055637: OK
+ indexer@33bc6067a5b8: OK
+ ```
+
+* List task-types:
+
+ ``` bash
+ (swh) ~/swh-environment$ swh-scheduler task-type list
+ [...]
+ ```
+
+* Get more info on a task type:
+
+ ``` bash
+ (swh) ~/swh-environment$ swh-scheduler task-type list -v -t origin-update-hg
+ Known task types:
+ origin-update-hg: swh.loader.mercurial.tasks.LoadMercurial
+ Loading mercurial repository swh-loader-mercurial
+ interval: 1 day, 0:00:00 [1 day, 0:00:00, 1 day, 0:00:00]
+ backoff_factor: 1.0
+ max_queue_length: 1000
+ num_retries: None
+ retry_delay: None
+ ```
+
+* Add a new task:
+
+ ``` bash
+ (swh) ~/swh-environment$ swh-scheduler task add origin-update-hg \
+ origin_url=https://hg.logilab.org/master/cubicweb
+ Created 1 tasks
+ Task 1
+ Next run: just now (2019-02-06 12:36:58+00:00)
+ Interval: 1 day, 0:00:00
+ Type: origin-update-hg
+ Policy: recurring
+ Args:
+ Keyword args:
+ origin_url: https://hg.logilab.org/master/cubicweb
+
+* Respawn a task:
+
+ ``` bash
+ (swh) ~/swh-environment$ swh-scheduler task respawn 1
+ ```
+
+
+[develop setup guide](https://docs.softwareheritage.org/devel/developer-setup.html)
diff --git a/docker-compose.yml b/docker-compose.yml
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -74,6 +74,7 @@
env_file: ./env/scheduler.env
environment:
SWH_CONFIG_FILENAME: /scheduler.yml
+ SWH_SCHEDULER_CONFIG_FILE: /scheduler.yml
depends_on:
- swh-scheduler-db
ports:
@@ -85,6 +86,9 @@
image: swh/scheduler-worker
build: ./dockerfiles/swh-scheduler-worker
env_file: ./env/scheduler.env
+ environment:
+ SWH_CONFIG_FILENAME: /scheduler.yml
+ SWH_SCHEDULER_CONFIG_FILE: /scheduler.yml
command: listener
depends_on:
- swh-scheduler-api
@@ -96,6 +100,9 @@
image: swh/scheduler-worker
build: ./dockerfiles/swh-scheduler-worker
env_file: ./env/scheduler.env
+ environment:
+ SWH_CONFIG_FILENAME: /scheduler.yml
+ SWH_SCHEDULER_CONFIG_FILE: /scheduler.yml
command: runner -p 10
depends_on:
- swh-scheduler-api
@@ -291,15 +298,6 @@
volumes:
- "./conf/indexer.yml:/indexer.yml:ro"
- swh-indexer-journal-client:
- image: swh/indexer-journal-client
- build: ./dockerfiles/swh-indexer-journal-client
- depends_on:
- - swh-journal-publisher
- - swh-scheduler-api
- volumes:
- - "./conf/journal_client.yml:/etc/softwareheritage/indexer/journal_client.yml:ro"
-
# Journal related
swh-storage-listener:
@@ -328,3 +326,12 @@
- swh-journal-publisher
volumes:
- "./conf/journal_client.yml:/etc/softwareheritage/journal/logger.yml:ro"
+
+ swh-indexer-journal-client:
+ image: swh/indexer-journal-client
+ build: ./dockerfiles/swh-indexer-journal-client
+ depends_on:
+ - swh-journal-publisher
+ - swh-scheduler-api
+ volumes:
+ - "./conf/journal_client.yml:/etc/softwareheritage/indexer/journal_client.yml:ro"
diff --git a/env/scheduler.env b/env/scheduler.env
--- a/env/scheduler.env
+++ b/env/scheduler.env
@@ -4,3 +4,4 @@
PGUSER=postgres
SWH_WORKER_INSTANCE=scheduler
LOGLEVEL=INFO
+CELERY_BROKER_URL=amqp://amqp//

File Metadata

Mime Type
text/plain
Expires
Fri, Jun 20, 9:04 PM (4 w, 1 d ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3233968

Event Timeline