Page MenuHomeSoftware Heritage

D5438.id.diff
No OneTemporary

D5438.id.diff

diff --git a/docker/README.md b/docker/README.md
deleted file mode 100644
--- a/docker/README.md
+++ /dev/null
@@ -1,701 +0,0 @@
-# Docker environment
-
-This directory contains Dockerfiles to run a small Software Heritage instance
-on development machines. The end goal is to smooth the contributors/developers
-workflow. Focus on coding, not configuring!
-
-WARNING: Running a Software Heritage instance on your machine can consume
- quite a bit of resources: if you play a bit too hard (e.g., if you
- try to list all GitHub repositories with the corresponding lister),
- you may fill your hard drive, and consume a lot of CPU, memory and
- network bandwidth.
-
-
-## Dependencies
-
-This uses docker with docker-compose, so ensure you have a working
-docker environment and docker-compose is installed.
-
-We recommend using the latest version of docker, so please read
-https://docs.docker.com/install/linux/docker-ce/debian/ for more details on how
-to install docker on your machine.
-
-On a debian system, docker-compose can be installed from Debian repositories:
-
-```
-~$ sudo apt install docker-compose
-```
-
-## Quick start
-
-First, change to the docker dir if you aren't there yet:
-
-```
-~$ cd swh-environment/docker
-```
-
-Then, start containers:
-
-```
-~/swh-environment/docker$ docker-compose up -d
-[...]
-Creating docker_amqp_1 ... done
-Creating docker_zookeeper_1 ... done
-Creating docker_kafka_1 ... done
-Creating docker_flower_1 ... done
-Creating docker_swh-scheduler-db_1 ... done
-[...]
-```
-
-This will build docker images and run them.
-Check everything is running fine with:
-
-```
-~/swh-environment/docker$ docker-compose ps
- Name Command State Ports
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-docker_amqp_1 docker-entrypoint.sh rabbi ... Up 15671/tcp, 0.0.0.0:5018->15672/tcp, 25672/tcp, 4369/tcp, 5671/tcp, 5672/tcp
-docker_flower_1 flower --broker=amqp://gue ... Up 0.0.0.0:5555->5555/tcp
-docker_kafka_1 start-kafka.sh Up 0.0.0.0:9092->9092/tcp
-docker_swh-deposit-db_1 docker-entrypoint.sh postgres Up 5432/tcp
-docker_swh-deposit_1 /entrypoint.sh Up 0.0.0.0:5006->5006/tcp
-[...]
-```
-
-The startup of some containers may fail the first time for dependency-related
-problems. If some containers failed to start, just run the `docker-compose up
--d` command again.
-
-If a container really refuses to start properly, you can check why using the
-`docker-compose logs` command. For example:
-
-```
-~/swh-environment/docker$ docker-compose logs swh-lister
-Attaching to docker_swh-lister_1
-[...]
-swh-lister_1 | Processing /src/swh-scheduler
-swh-lister_1 | Could not install packages due to an EnvironmentError: [('/src/swh-scheduler/.hypothesis/unicodedata/8.0.0/charmap.json.gz', '/tmp/pip-req-build-pm7nsax3/.hypothesis/unicodedata/8.0.0/charmap.json.gz', "[Errno 13] Permission denied: '/src/swh-scheduler/.hypothesis/unicodedata/8.0.0/charmap.json.gz'")]
-swh-lister_1 |
-```
-
-Once all containers are running, you can use the web interface by opening
-http://localhost:5080/ in your web browser.
-
-At this point, the archive is empty and needs to be filled with some content.
-To do so, you can create tasks that will scrape a forge. For example, to inject
-the code from the https://0xacab.org gitlab forge:
-
-```
-~/swh-environment/docker$ docker-compose exec swh-scheduler \
- swh scheduler task add list-gitlab-full \
- -p oneshot url=https://0xacab.org/api/v4
-
-Created 1 tasks
-
-Task 1
- Next run: just now (2018-12-19 14:58:49+00:00)
- Interval: 90 days, 0:00:00
- Type: list-gitlab-full
- Policy: oneshot
- Args:
- Keyword args:
- url=https://0xacab.org/api/v4
-```
-
-This task will scrape the forge's project list and create subtasks to inject
-each git repository found there.
-
-This will take a bit af time to complete.
-
-To increase the speed at which git repositories are imported, you can spawn more
-`swh-loader-git` workers:
-
-```
-~/swh-environment/docker$ docker-compose exec swh-scheduler \
- celery status
-listers@50ac2185c6c9: OK
-loader@b164f9055637: OK
-indexer@33bc6067a5b8: OK
-vault@c9fef1bbfdc1: OK
-
-4 nodes online.
-~/swh-environment/docker$ docker-compose exec swh-scheduler \
- celery control pool_grow 3 -d loader@b164f9055637
--> loader@b164f9055637: OK
- pool will grow
-~/swh-environment/docker$ docker-compose exec swh-scheduler \
- celery inspect -d loader@b164f9055637 stats | grep prefetch_count
- "prefetch_count": 4
-```
-
-Now there are 4 workers ingesting git repositories.
-You can also increase the number of `swh-loader-git` containers:
-
-```
-~/swh-environment/docker$ docker-compose up -d --scale swh-loader=4
-[...]
-Creating docker_swh-loader_2 ... done
-Creating docker_swh-loader_3 ... done
-Creating docker_swh-loader_4 ... done
-```
-
-## Updating the docker image
-
-All containers started by `docker-compose` are bound to a docker image named
-`swh/stack` including all the software components of Software Heritage. When
-new versions of these components are released, the docker image will not be
-automatically updated. In order to update all Software Heritage components to
-their latest version, the docker image needs to be explicitly rebuilt by
-issuing the following command from within the `docker` directory:
-
-```
-~/swh-environment/docker$ docker build --no-cache -t swh/stack .
-```
-
-## Details
-
-This runs the following services on their respectively standard ports, all of
-the following services are configured to communicate with each other:
-
-- swh-storage-db: a `softwareheritage` instance db that stores the Merkle DAG,
-
-- swh-objstorage: Content-addressable object storage,
-
-- swh-storage: Abstraction layer over the archive, allowing to access all
- stored source code artifacts as well as their metadata,
-
-- swh-web: the Software Heritage web user interface,
-
-- swh-scheduler: the API service as well as 2 utilities,
- the runner and the listener,
-
-- swh-lister: celery workers dedicated to running lister tasks,
-
-- swh-loaders: celery workers dedicated to importing/updating source code
- content (VCS repos, source packages, etc.),
-
-- swh-journal: Persistent logger of changes to the archive, with
- publish-subscribe support.
-
-That means you can start doing the ingestion using those services using the
-same setup described in the getting-started starting directly at
-https://docs.softwareheritage.org/devel/getting-started.html#step-4-ingest-repositories
-
-
-### Exposed Ports
-
-Several services have their listening ports exposed on the host:
-
-- amqp: 5072
-- kafka: 5092
-- nginx: 5080
-
-And for SWH services:
-
-- scheduler API: 5008
-- storage API: 5002
-- object storage API: 5003
-- indexer API: 5007
-- web app: 5004
-- deposit app: 5006
-
-Beware that these ports are not the same as the ports used from within the
-docker network. This means that the same command executed from the host or from
-a docker container will not use the same urls to access services. For example,
-to use the `celery` utility from the host, you may type:
-
-```
-~/swh-environment/docker$ CELERY_BROKER_URL=amqp://:5072// celery status
-loader@61704103668c: OK
-[...]
-```
-
-To run the same command from within a container:
-
-```
-~/swh-environment/docker$ docker-compose exec swh-scheduler celery status
-loader@61704103668c: OK
-[...]
-```
-
-
-## Managing tasks
-
-One of the main components of the Software Heritage platform is the task system.
-These are used to manage everything related to background process, like
-discovering new git repositories to import, ingesting them, checking a known
-repository is up to date, etc.
-
-The task system is based on Celery but uses a custom database-based scheduler.
-
-So when we refer to the term 'task', it may designate either a Celery task or a
-SWH one (ie. the entity in the database). When we refer to simply a "task" in
-the documentation, it designates the SWH task.
-
-When a SWH task is ready to be executed, a Celery task is created to handle the
-actual SWH task's job. Note that not all Celery tasks are directly linked to a
-SWH task (some SWH tasks are implemented using a Celery task that spawns Celery
-subtasks).
-
-A (SWH) task can be `recurring` or `oneshot`. `oneshot` tasks are only executed
-once, whereas `recurring` are regularly executed. The scheduling configuration
-of these recurring tasks can be set via the fields `current_interval` and
-`priority` (can be 'high', 'normal' or 'low') of the task database entity.
-
-
-### Inserting a new lister task
-
-To list the content of a source code provider like github or a Debian
-distribution, you may add a new task for this.
-
-This task will (generally) scrape a web page or use a public API to identify
-the list of published software artefacts (git repos, debian source packages,
-etc.)
-
-Then, for each repository, a new task will be created to ingest this repository
-and keep it up to date.
-
-For example, to add a (one shot) task that will list git repos on the
-0xacab.org gitlab instance, one can do (from this git repository):
-
-```
-~/swh-environment/docker$ docker-compose exec swh-scheduler \
- swh scheduler task add list-gitlab-full \
- -p oneshot url=https://0xacab.org/api/v4
-
-Created 1 tasks
-
-Task 12
- Next run: just now (2018-12-19 14:58:49+00:00)
- Interval: 90 days, 0:00:00
- Type: list-gitlab-full
- Policy: oneshot
- Args:
- Keyword args:
- url=https://0xacab.org/api/v4
-```
-
-This will insert a new task in the scheduler. To list existing tasks for a
-given task type:
-
-```
-~/swh-environment/docker$ docker-compose exec swh-scheduler \
- swh scheduler task list-pending list-gitlab-full
-
-Found 1 list-gitlab-full tasks
-
-Task 12
- Next run: 2 minutes ago (2018-12-19 14:58:49+00:00)
- Interval: 90 days, 0:00:00
- Type: list-gitlab-full
- Policy: oneshot
- Args:
- Keyword args:
- url=https://0xacab.org/api/v4
-```
-
-To list all existing task types:
-
-```
-~/swh-environment/docker$ docker-compose exec swh-scheduler \
- swh scheduler task-type list
-
-Known task types:
-load-svn-from-archive:
- Loading svn repositories from svn dump
-load-svn:
- Create dump of a remote svn repository, mount it and load it
-load-deposit:
- Loading deposit archive into swh through swh-loader-tar
-check-deposit:
- Pre-checking deposit step before loading into swh archive
-cook-vault-bundle:
- Cook a Vault bundle
-load-hg:
- Loading mercurial repository swh-loader-mercurial
-load-hg-from-archive:
- Loading archive mercurial repository swh-loader-mercurial
-load-git:
- Update an origin of type git
-list-github-incremental:
- Incrementally list GitHub
-list-github-full:
- Full update of GitHub repos list
-list-debian-distribution:
- List a Debian distribution
-list-gitlab-incremental:
- Incrementally list a Gitlab instance
-list-gitlab-full:
- Full update of a Gitlab instance's repos list
-list-pypi:
- Full pypi lister
-load-pypi:
- Load Pypi origin
-index-mimetype:
- Mimetype indexer task
-index-mimetype-for-range:
- Mimetype Range indexer task
-index-fossology-license:
- Fossology license indexer task
-index-fossology-license-for-range:
- Fossology license range indexer task
-index-origin-head:
- Origin Head indexer task
-index-revision-metadata:
- Revision Metadata indexer task
-index-origin-metadata:
- Origin Metadata indexer task
-
-```
-
-
-### Monitoring activity
-
-You can monitor the workers activity by connecting to the RabbitMQ console on
-`http://localhost:5080/rabbitmq` or the grafana dashboard on
-`http://localhost:5080/grafana`.
-
-If you cannot see any task being executed, check the logs of the
-`swh-scheduler-runner` service (here is a failure example due to the
-debian lister task not being properly registered on the
-swh-scheduler-runner service):
-
-```
-~/swh-environment/docker$ docker-compose logs --tail=10 swh-scheduler-runner
-Attaching to docker_swh-scheduler-runner_1
-swh-scheduler-runner_1 | "__main__", mod_spec)
-swh-scheduler-runner_1 | File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
-swh-scheduler-runner_1 | exec(code, run_globals)
-swh-scheduler-runner_1 | File "/usr/local/lib/python3.7/site-packages/swh/scheduler/celery_backend/runner.py", line 107, in <module>
-swh-scheduler-runner_1 | run_ready_tasks(main_backend, main_app)
-swh-scheduler-runner_1 | File "/usr/local/lib/python3.7/site-packages/swh/scheduler/celery_backend/runner.py", line 81, in run_ready_tasks
-swh-scheduler-runner_1 | task_types[task['type']]['backend_name']
-swh-scheduler-runner_1 | File "/usr/local/lib/python3.7/site-packages/celery/app/registry.py", line 21, in __missing__
-swh-scheduler-runner_1 | raise self.NotRegistered(key)
-swh-scheduler-runner_1 | celery.exceptions.NotRegistered: 'swh.lister.debian.tasks.DebianListerTask'
-```
-
-
-## Using docker setup development and integration testing
-
-If you hack the code of one or more archive components with a virtual
-env based setup as described in the
-[[https://docs.softwareheritage.org/devel/developer-setup.html|developer
-setup guide]], you may want to test your modifications in a working
-Software Heritage instance. The simplest way to achieve this is to use
-this docker-based environment.
-
-If you haven't followed the
-[[https://docs.softwareheritage.org/devel/developer-setup.html|developer setup guide]],
-you must clone the the [swh-environment] repo in your `swh-environment`
-directory:
-
-```
-~/swh-environment$ git clone https://forge.softwareheritage.org/source/swh-environment.git .
-```
-
-Note the `.` at the end of this command: we want the git repository to be
-cloned directly in the `~/swh-environment` directory, not in a sub directory.
-Also note that if you haven't done it yet and you want to hack the source code
-of one or more Software Heritage packages, you really should read the
-[[https://docs.softwareheritage.org/devel/developer-setup.html|developer setup guide]].
-
-From there, we will checkout or update all the swh packages:
-
-```
-~/swh-environment$ ./bin/update
-```
-
-
-### Install a swh package from sources in a container
-
-It is possible to run a docker container with some swh packages installed from
-sources instead of using the latest published packages from pypi. To do this
-you must write a docker-compose override file (`docker-compose.override.yml`).
-An example is given in the `docker-compose.override.yml.example` file:
-
-``` yaml
-version: '2'
-
-services:
- swh-objstorage:
- volumes:
- - "$HOME/swh-environment/swh-objstorage:/src/swh-objstorage"
-```
-
-The file named `docker-compose.override.yml` will automatically be loaded by
-`docker-compose`.
-
-This example shows the simplest case of the `swh-objstorage` package:
-you just have to mount it in the container in `/src` and the
-entrypoint will ensure every swh-* package found in `/src/` is
-installed (using `pip install -e` so you can easily hack your
-code). If the application you play with has autoreload support, there
-is no need to restart the impacted container.)
-
-
-### Using locally installed swh tools with docker
-
-In all examples above, we have executed swh commands from within a running
-container. Now we also have these swh commands locally available in our virtual
-env, we can use them to interact with swh services running in docker
-containers.
-
-For this, we just need to configure a few environment variables. First, ensure
-your Software Heritage virtualenv is activated (here, using virtualenvwrapper):
-
-```
-~$ workon swh
-(swh) ~/swh-environment$ export SWH_SCHEDULER_URL=http://127.0.0.1:5008/
-(swh) ~/swh-environment$ export CELERY_BROKER_URL=amqp://127.0.0.1:5072/
-```
-
-Now we can use the `celery` command directly to control the celery system
-running in the docker environment:
-
-```
-(swh) ~/swh-environment$ celery status
-vault@c9fef1bbfdc1: OK
-listers@ba66f18e7d02: OK
-indexer@cb14c33cbbfb: OK
-loader@61704103668c: OK
-
-4 nodes online.
-(swh) ~/swh-environment$ celery control -d loader@61704103668c pool_grow 3
-```
-
-And we can use the `swh-scheduler` command all the same:
-
-```
-(swh) ~/swh-environment$ swh scheduler task-type list
-Known task types:
-index-fossology-license:
- Fossology license indexer task
-index-mimetype:
- Mimetype indexer task
-[...]
-```
-
-
-### Make your life a bit easier
-
-When you use virtualenvwrapper, you can add postactivation commands:
-
-```
-(swh) ~/swh-environment$ cat >>$VIRTUAL_ENV/bin/postactivate <<'EOF'
-# unfortunately, the interface cmd for the click autocompletion
-# depends on the shell
-# https://click.palletsprojects.com/en/7.x/bashcomplete/#activation
-
-shell=$(basename $SHELL)
-case "$shell" in
- "zsh")
- autocomplete_cmd=source_zsh
- ;;
- *)
- autocomplete_cmd=source
- ;;
-esac
-
-eval "$(_SWH_COMPLETE=$autocomplete_cmd swh)"
-export SWH_SCHEDULER_URL=http://127.0.0.1:5008/
-export CELERY_BROKER_URL=amqp://127.0.0.1:5072/
-export COMPOSE_FILE=~/swh-environment/docker/docker-compose.yml:~/swh-environment/docker/docker-compose.override.yml
-alias doco=docker-compose
-
-EOF
-```
-
-This postactivate script does:
-
-- install a shell completion handler for the swh-scheduler command,
-- preset a bunch of environment variables
-
- - `SWH_SCHEDULER_URL` so that you can just run `swh scheduler` against the
- scheduler API instance running in docker, without having to specify the
- endpoint URL,
-
- - `CELERY_BROKER` so you can execute the `celery` tool (without cli options)
- against the rabbitmq server running in the docker environment,
-
- - `COMPOSE_FILE` so you can run `docker-compose` from everywhere,
-
-- create an alias `doco` for `docker-compose` because this is way too
- long to type,
-
-So now you can easily:
-
-* Start the SWH platform:
-
-```
- (swh) ~/swh-environment$ doco up -d
- [...]
-```
-
-* Check celery:
-
-```
- (swh) ~/swh-environment$ celery status
- listers@50ac2185c6c9: OK
- loader@b164f9055637: OK
- indexer@33bc6067a5b8: OK
-```
-
-* List task-types:
-
-```
- (swh) ~/swh-environment$ swh scheduler task-type list
- [...]
-```
-
-* Get more info on a task type:
-
-```
- (swh) ~/swh-environment$ swh scheduler task-type list -v -t load-hg
- Known task types:
- load-hg: swh.loader.mercurial.tasks.LoadMercurial
- Loading mercurial repository swh-loader-mercurial
- interval: 1 day, 0:00:00 [1 day, 0:00:00, 1 day, 0:00:00]
- backoff_factor: 1.0
- max_queue_length: 1000
- num_retries: None
- retry_delay: None
-```
-
-* Add a new task:
-
-```
- (swh) ~/swh-environment$ swh scheduler task add load-hg \
- origin_url=https://hg.logilab.org/master/cubicweb
- Created 1 tasks
- Task 1
- Next run: just now (2019-02-06 12:36:58+00:00)
- Interval: 1 day, 0:00:00
- Type: load-hg
- Policy: recurring
- Args:
- Keyword args:
- origin_url: https://hg.logilab.org/master/cubicweb
-```
-
-* Respawn a task:
-
-```
- (swh) ~/swh-environment$ swh scheduler task respawn 1
-```
-
-## Data persistence for a development setting
-
-The default `docker-compose.yml` configuration is not geared towards data persistence,
-but application testing.
-
-Volumes defined in associated images are anonymous and may get either unused or removed
-on the next `docker-compose up`.
-
-One way to make sure these volumes persist is to use named volumes.
-The volumes may be defined as follows in a `docker-compose.override.yml`.
-Note that volume definitions are merged with other compose files based on
-destination path.
-
-```
-services:
- swh-storage-db:
- volumes:
- - "swh_storage_data:/var/lib/postgresql/data"
- swh-objstorage:
- volumes:
- - "swh_objstorage_data:/srv/softwareheritage/objects"
-
-volumes:
- swh_storage_data:
- swh_objstorage_data:
-```
-
-This way, `docker-compose down` without the `-v` flag will not remove those volumes
-and data will persist.
-
-## Starting a kafka-powered mirror of the storage
-
-This repo comes with an optional `docker-compose.storage-mirror.yml`
-docker compose file that can be used to test the kafka-powered mirror
-mecanism for the main storage.
-
-This can be used like:
-
-```
-~/swh-environment/docker$ docker-compose -f docker-compose.yml -f docker-compose.storage-mirror.yml up -d
-[...]
-```
-
-Compared to the original compose file, this will:
-
-- overrides the swh-storage service to activate the kafka direct writer
- on swh.journal.objects prefixed topics using thw swh.storage.master ID,
-- overrides the swh-web service to make it use the mirror instead of the
- master storage,
-- starts a db for the mirror,
-- starts a storage service based on this db,
-- starts a replayer service that runs the process that listen to kafka to
- keeps the mirror in sync.
-
-When using it, you will have a setup in which the master storage is used by
-workers and most other services, whereas the storage mirror will be used to
-by the web application and should be kept in sync with the master storage
-by kafka.
-
-
-Note that the object storage is not replicated here, only the graph storage.
-
-
-## Starting the backfiller
-
-Reading from the storage the objects <object-type> from within range
-[start-object, end-object] to the kafka topics.
-
-```
-(swh)$ docker-compose \
- -f docker-compose.yml \
- -f docker-compose.storage-mirror.yml \
- -f docker-compose.storage-mirror.override.yml \
- run \
- swh-journal-backfiller \
- snapshot \
- --start-object 000000 \
- --end-object 000001 \
- --dry-run
-```
-
-
-## Using Sentry
-
-All entrypoints to SWH code (CLI, gunicorn, celery, ...) are, or should be,
-intrumented using Sentry. By default this is disabled, but if you run your
-own Sentry instance, you can use it.
-
-To do so, you must get a DSN from your Sentry instance, and set it as the
-value of `SWH_SENTRY_DSN` in the file `env/common_python.env`.
-You may also set it per-service in the `environment` section of each services
-in `docker-compose.override.yml`.
-
-
-## Caveats
-
-Running a lister task can lead to a lot of loading tasks, which can fill your
-hard drive pretty fast. Make sure to monitor your available storage space
-regularly when playing with this stack.
-
-Also, a few containers (`swh-storage`, `swh-xxx-db`) use a volume for storing
-the blobs or the database files. With the default configuration provided in the
-`docker-compose.yml` file, these volumes are not persistant. So removing the
-containers will delete the volumes!
-
-Also note that for the `swh-objstorage`, since the volume can be pretty big,
-the remove operation can be quite long (several minutes is not uncommon), which
-may mess a bit with the `docker-compose` command.
-
-If you have an error message like:
-
- Error response from daemon: removal of container 928de3110381 is already in progress
-
-it means that you need to wait for this process to finish before being able to
-(re)start your docker stack again.
diff --git a/docker/README.rst b/docker/README.rst
new file mode 100644
--- /dev/null
+++ b/docker/README.rst
@@ -0,0 +1,666 @@
+Docker environment
+==================
+
+``swh-environment/docker/`` contains Dockerfiles to run a small Software Heritage
+instance on development machines. The end goal is to smooth the
+contributors/developers workflow. Focus on coding, not configuring!
+
+.. warning::
+ Running a Software Heritage instance on your machine can
+ consume quite a bit of resources: if you play a bit too hard (e.g., if
+ you try to list all GitHub repositories with the corresponding lister),
+ you may fill your hard drive, and consume a lot of CPU, memory and
+ network bandwidth.
+
+Dependencies
+------------
+
+This uses docker with docker-compose, so ensure you have a working
+docker environment and docker-compose is installed.
+
+We recommend using the latest version of docker, so please read
+https://docs.docker.com/install/linux/docker-ce/debian/ for more details
+on how to install docker on your machine.
+
+On a debian system, docker-compose can be installed from Debian
+repositories::
+
+ ~$ sudo apt install docker-compose
+
+Quick start
+-----------
+
+First, change to the docker dir if you aren’t there yet::
+
+ ~$ cd swh-environment/docker
+
+Then, start containers::
+
+ ~/swh-environment/docker$ docker-compose up -d
+ [...]
+ Creating docker_amqp_1 ... done
+ Creating docker_zookeeper_1 ... done
+ Creating docker_kafka_1 ... done
+ Creating docker_flower_1 ... done
+ Creating docker_swh-scheduler-db_1 ... done
+ [...]
+
+This will build docker images and run them. Check everything is running
+fine with::
+
+ ~/swh-environment/docker$ docker-compose ps
+ Name Command State Ports
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+ docker_amqp_1 docker-entrypoint.sh rabbi ... Up 15671/tcp, 0.0.0.0:5018->15672/tcp, 25672/tcp, 4369/tcp, 5671/tcp, 5672/tcp
+ docker_flower_1 flower --broker=amqp://gue ... Up 0.0.0.0:5555->5555/tcp
+ docker_kafka_1 start-kafka.sh Up 0.0.0.0:9092->9092/tcp
+ docker_swh-deposit-db_1 docker-entrypoint.sh postgres Up 5432/tcp
+ docker_swh-deposit_1 /entrypoint.sh Up 0.0.0.0:5006->5006/tcp
+ [...]
+
+The startup of some containers may fail the first time for
+dependency-related problems. If some containers failed to start, just
+run the ``docker-compose up -d`` command again.
+
+If a container really refuses to start properly, you can check why using
+the ``docker-compose logs`` command. For example::
+
+ ~/swh-environment/docker$ docker-compose logs swh-lister
+ Attaching to docker_swh-lister_1
+ [...]
+ swh-lister_1 | Processing /src/swh-scheduler
+ swh-lister_1 | Could not install packages due to an EnvironmentError: [('/src/swh-scheduler/.hypothesis/unicodedata/8.0.0/charmap.json.gz', '/tmp/pip-req-build-pm7nsax3/.hypothesis/unicodedata/8.0.0/charmap.json.gz', "[Errno 13] Permission denied: '/src/swh-scheduler/.hypothesis/unicodedata/8.0.0/charmap.json.gz'")]
+ swh-lister_1 |
+
+Once all containers are running, you can use the web interface by
+opening http://localhost:5080/ in your web browser.
+
+At this point, the archive is empty and needs to be filled with some
+content. To do so, you can create tasks that will scrape a forge. For
+example, to inject the code from the https://0xacab.org gitlab forge::
+
+ ~/swh-environment/docker$ docker-compose exec swh-scheduler \
+ swh scheduler task add list-gitlab-full \
+ -p oneshot url=https://0xacab.org/api/v4
+
+ Created 1 tasks
+
+ Task 1
+ Next run: just now (2018-12-19 14:58:49+00:00)
+ Interval: 90 days, 0:00:00
+ Type: list-gitlab-full
+ Policy: oneshot
+ Args:
+ Keyword args:
+ url=https://0xacab.org/api/v4
+
+This task will scrape the forge’s project list and create subtasks to
+inject each git repository found there.
+
+This will take a bit af time to complete.
+
+To increase the speed at which git repositories are imported, you can
+spawn more ``swh-loader-git`` workers::
+
+ ~/swh-environment/docker$ docker-compose exec swh-scheduler \
+ celery status
+ listers@50ac2185c6c9: OK
+ loader@b164f9055637: OK
+ indexer@33bc6067a5b8: OK
+ vault@c9fef1bbfdc1: OK
+
+ 4 nodes online.
+ ~/swh-environment/docker$ docker-compose exec swh-scheduler \
+ celery control pool_grow 3 -d loader@b164f9055637
+ -> loader@b164f9055637: OK
+ pool will grow
+ ~/swh-environment/docker$ docker-compose exec swh-scheduler \
+ celery inspect -d loader@b164f9055637 stats | grep prefetch_count
+ "prefetch_count": 4
+
+Now there are 4 workers ingesting git repositories. You can also
+increase the number of ``swh-loader-git`` containers::
+
+ ~/swh-environment/docker$ docker-compose up -d --scale swh-loader=4
+ [...]
+ Creating docker_swh-loader_2 ... done
+ Creating docker_swh-loader_3 ... done
+ Creating docker_swh-loader_4 ... done
+
+Updating the docker image
+-------------------------
+
+All containers started by ``docker-compose`` are bound to a docker image
+named ``swh/stack`` including all the software components of Software
+Heritage. When new versions of these components are released, the docker
+image will not be automatically updated. In order to update all Software
+Heritage components to their latest version, the docker image needs to
+be explicitly rebuilt by issuing the following command from within the
+``docker`` directory::
+
+ ~/swh-environment/docker$ docker build --no-cache -t swh/stack .
+
+Details
+-------
+
+This runs the following services on their respectively standard ports,
+all of the following services are configured to communicate with each
+other:
+
+- swh-storage-db: a ``softwareheritage`` instance db that stores the
+ Merkle DAG,
+
+- swh-objstorage: Content-addressable object storage,
+
+- swh-storage: Abstraction layer over the archive, allowing to access
+ all stored source code artifacts as well as their metadata,
+
+- swh-web: the Software Heritage web user interface,
+
+- swh-scheduler: the API service as well as 2 utilities, the runner and
+ the listener,
+
+- swh-lister: celery workers dedicated to running lister tasks,
+
+- swh-loaders: celery workers dedicated to importing/updating source
+ code content (VCS repos, source packages, etc.),
+
+- swh-journal: Persistent logger of changes to the archive, with
+ publish-subscribe support.
+
+That means you can start doing the ingestion using those services using
+the same setup described in the getting-started starting directly at
+https://docs.softwareheritage.org/devel/getting-started.html#step-4-ingest-repositories
+
+Exposed Ports
+~~~~~~~~~~~~~
+
+Several services have their listening ports exposed on the host:
+
+- amqp: 5072
+- kafka: 5092
+- nginx: 5080
+
+And for SWH services:
+
+- scheduler API: 5008
+- storage API: 5002
+- object storage API: 5003
+- indexer API: 5007
+- web app: 5004
+- deposit app: 5006
+
+Beware that these ports are not the same as the ports used from within
+the docker network. This means that the same command executed from the
+host or from a docker container will not use the same urls to access
+services. For example, to use the ``celery`` utility from the host, you
+may type::
+
+ ~/swh-environment/docker$ CELERY_BROKER_URL=amqp://:5072// celery status
+ loader@61704103668c: OK
+ [...]
+
+To run the same command from within a container::
+
+ ~/swh-environment/docker$ docker-compose exec swh-scheduler celery status
+ loader@61704103668c: OK
+ [...]
+
+Managing tasks
+--------------
+
+One of the main components of the Software Heritage platform is the task
+system. These are used to manage everything related to background
+process, like discovering new git repositories to import, ingesting
+them, checking a known repository is up to date, etc.
+
+The task system is based on Celery but uses a custom database-based
+scheduler.
+
+So when we refer to the term ‘task’, it may designate either a Celery
+task or a SWH one (ie. the entity in the database). When we refer to
+simply a “task” in the documentation, it designates the SWH task.
+
+When a SWH task is ready to be executed, a Celery task is created to
+handle the actual SWH task’s job. Note that not all Celery tasks are
+directly linked to a SWH task (some SWH tasks are implemented using a
+Celery task that spawns Celery subtasks).
+
+A (SWH) task can be ``recurring`` or ``oneshot``. ``oneshot`` tasks are
+only executed once, whereas ``recurring`` are regularly executed. The
+scheduling configuration of these recurring tasks can be set via the
+fields ``current_interval`` and ``priority`` (can be ‘high’, ‘normal’ or
+‘low’) of the task database entity.
+
+Inserting a new lister task
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To list the content of a source code provider like github or a Debian
+distribution, you may add a new task for this.
+
+This task will (generally) scrape a web page or use a public API to
+identify the list of published software artefacts (git repos, debian
+source packages, etc.)
+
+Then, for each repository, a new task will be created to ingest this
+repository and keep it up to date.
+
+For example, to add a (one shot) task that will list git repos on the
+0xacab.org gitlab instance, one can do (from this git repository)::
+
+ ~/swh-environment/docker$ docker-compose exec swh-scheduler \
+ swh scheduler task add list-gitlab-full \
+ -p oneshot url=https://0xacab.org/api/v4
+
+ Created 1 tasks
+
+ Task 12
+ Next run: just now (2018-12-19 14:58:49+00:00)
+ Interval: 90 days, 0:00:00
+ Type: list-gitlab-full
+ Policy: oneshot
+ Args:
+ Keyword args:
+ url=https://0xacab.org/api/v4
+
+This will insert a new task in the scheduler. To list existing tasks for
+a given task type::
+
+ ~/swh-environment/docker$ docker-compose exec swh-scheduler \
+ swh scheduler task list-pending list-gitlab-full
+
+ Found 1 list-gitlab-full tasks
+
+ Task 12
+ Next run: 2 minutes ago (2018-12-19 14:58:49+00:00)
+ Interval: 90 days, 0:00:00
+ Type: list-gitlab-full
+ Policy: oneshot
+ Args:
+ Keyword args:
+ url=https://0xacab.org/api/v4
+
+To list all existing task types::
+
+ ~/swh-environment/docker$ docker-compose exec swh-scheduler \
+ swh scheduler task-type list
+
+ Known task types:
+ load-svn-from-archive:
+ Loading svn repositories from svn dump
+ load-svn:
+ Create dump of a remote svn repository, mount it and load it
+ load-deposit:
+ Loading deposit archive into swh through swh-loader-tar
+ check-deposit:
+ Pre-checking deposit step before loading into swh archive
+ cook-vault-bundle:
+ Cook a Vault bundle
+ load-hg:
+ Loading mercurial repository swh-loader-mercurial
+ load-hg-from-archive:
+ Loading archive mercurial repository swh-loader-mercurial
+ load-git:
+ Update an origin of type git
+ list-github-incremental:
+ Incrementally list GitHub
+ list-github-full:
+ Full update of GitHub repos list
+ list-debian-distribution:
+ List a Debian distribution
+ list-gitlab-incremental:
+ Incrementally list a Gitlab instance
+ list-gitlab-full:
+ Full update of a Gitlab instance's repos list
+ list-pypi:
+ Full pypi lister
+ load-pypi:
+ Load Pypi origin
+ index-mimetype:
+ Mimetype indexer task
+ index-mimetype-for-range:
+ Mimetype Range indexer task
+ index-fossology-license:
+ Fossology license indexer task
+ index-fossology-license-for-range:
+ Fossology license range indexer task
+ index-origin-head:
+ Origin Head indexer task
+ index-revision-metadata:
+ Revision Metadata indexer task
+ index-origin-metadata:
+ Origin Metadata indexer task
+
+Monitoring activity
+~~~~~~~~~~~~~~~~~~~
+
+You can monitor the workers activity by connecting to the RabbitMQ
+console on ``http://localhost:5080/rabbitmq`` or the grafana dashboard
+on ``http://localhost:5080/grafana``.
+
+If you cannot see any task being executed, check the logs of the
+``swh-scheduler-runner`` service (here is a failure example due to the
+debian lister task not being properly registered on the
+swh-scheduler-runner service)::
+
+ ~/swh-environment/docker$ docker-compose logs --tail=10 swh-scheduler-runner
+ Attaching to docker_swh-scheduler-runner_1
+ swh-scheduler-runner_1 | "__main__", mod_spec)
+ swh-scheduler-runner_1 | File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
+ swh-scheduler-runner_1 | exec(code, run_globals)
+ swh-scheduler-runner_1 | File "/usr/local/lib/python3.7/site-packages/swh/scheduler/celery_backend/runner.py", line 107, in <module>
+ swh-scheduler-runner_1 | run_ready_tasks(main_backend, main_app)
+ swh-scheduler-runner_1 | File "/usr/local/lib/python3.7/site-packages/swh/scheduler/celery_backend/runner.py", line 81, in run_ready_tasks
+ swh-scheduler-runner_1 | task_types[task['type']]['backend_name']
+ swh-scheduler-runner_1 | File "/usr/local/lib/python3.7/site-packages/celery/app/registry.py", line 21, in __missing__
+ swh-scheduler-runner_1 | raise self.NotRegistered(key)
+ swh-scheduler-runner_1 | celery.exceptions.NotRegistered: 'swh.lister.debian.tasks.DebianListerTask'
+
+Using docker setup development and integration testing
+------------------------------------------------------
+
+If you hack the code of one or more archive components with a virtual
+env based setup as described in the
+[[https://docs.softwareheritage.org/devel/developer-setup.html|developer
+setup guide]], you may want to test your modifications in a working
+Software Heritage instance. The simplest way to achieve this is to use
+this docker-based environment.
+
+If you haven’t followed the
+[[https://docs.softwareheritage.org/devel/developer-setup.html|developer
+setup guide]], you must clone the the [swh-environment] repo in your
+``swh-environment`` directory::
+
+ ~/swh-environment$ git clone https://forge.softwareheritage.org/source/swh-environment.git .
+
+Note the ``.`` at the end of this command: we want the git repository to
+be cloned directly in the ``~/swh-environment`` directory, not in a sub
+directory. Also note that if you haven’t done it yet and you want to
+hack the source code of one or more Software Heritage packages, you
+really should read the
+[[https://docs.softwareheritage.org/devel/developer-setup.html|developer
+setup guide]].
+
+From there, we will checkout or update all the swh packages::
+
+ ~/swh-environment$ ./bin/update
+
+Install a swh package from sources in a container
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It is possible to run a docker container with some swh packages
+installed from sources instead of using the latest published packages
+from pypi. To do this you must write a docker-compose override file
+(``docker-compose.override.yml``). An example is given in the
+``docker-compose.override.yml.example`` file:
+
+.. code:: yaml
+
+ version: '2'
+
+ services:
+ swh-objstorage:
+ volumes:
+ - "$HOME/swh-environment/swh-objstorage:/src/swh-objstorage"
+
+The file named ``docker-compose.override.yml`` will automatically be
+loaded by ``docker-compose``.
+
+This example shows the simplest case of the ``swh-objstorage`` package:
+you just have to mount it in the container in ``/src`` and the
+entrypoint will ensure every swh-\* package found in ``/src/`` is
+installed (using ``pip install -e`` so you can easily hack your code).
+If the application you play with has autoreload support, there is no
+need to restart the impacted container.)
+
+Using locally installed swh tools with docker
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In all examples above, we have executed swh commands from within a
+running container. Now we also have these swh commands locally available
+in our virtual env, we can use them to interact with swh services
+running in docker containers.
+
+For this, we just need to configure a few environment variables. First,
+ensure your Software Heritage virtualenv is activated (here, using
+virtualenvwrapper)::
+
+ ~$ workon swh
+ (swh) ~/swh-environment$ export SWH_SCHEDULER_URL=http://127.0.0.1:5008/
+ (swh) ~/swh-environment$ export CELERY_BROKER_URL=amqp://127.0.0.1:5072/
+
+Now we can use the ``celery`` command directly to control the celery
+system running in the docker environment::
+
+ (swh) ~/swh-environment$ celery status
+ vault@c9fef1bbfdc1: OK
+ listers@ba66f18e7d02: OK
+ indexer@cb14c33cbbfb: OK
+ loader@61704103668c: OK
+
+ 4 nodes online.
+ (swh) ~/swh-environment$ celery control -d loader@61704103668c pool_grow 3
+
+And we can use the ``swh-scheduler`` command all the same::
+
+ (swh) ~/swh-environment$ swh scheduler task-type list
+ Known task types:
+ index-fossology-license:
+ Fossology license indexer task
+ index-mimetype:
+ Mimetype indexer task
+ [...]
+
+Make your life a bit easier
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When you use virtualenvwrapper, you can add postactivation commands::
+
+ (swh) ~/swh-environment$ cat >>$VIRTUAL_ENV/bin/postactivate <<'EOF'
+ # unfortunately, the interface cmd for the click autocompletion
+ # depends on the shell
+ # https://click.palletsprojects.com/en/7.x/bashcomplete/#activation
+
+ shell=$(basename $SHELL)
+ case "$shell" in
+ "zsh")
+ autocomplete_cmd=source_zsh
+ ;;
+ *)
+ autocomplete_cmd=source
+ ;;
+ esac
+
+ eval "$(_SWH_COMPLETE=$autocomplete_cmd swh)"
+ export SWH_SCHEDULER_URL=http://127.0.0.1:5008/
+ export CELERY_BROKER_URL=amqp://127.0.0.1:5072/
+ export COMPOSE_FILE=~/swh-environment/docker/docker-compose.yml:~/swh-environment/docker/docker-compose.override.yml
+ alias doco=docker-compose
+
+ EOF
+
+This postactivate script does:
+
+- install a shell completion handler for the swh-scheduler command,
+- preset a bunch of environment variables
+
+ - ``SWH_SCHEDULER_URL`` so that you can just run ``swh scheduler``
+ against the scheduler API instance running in docker, without
+ having to specify the endpoint URL,
+
+ - ``CELERY_BROKER`` so you can execute the ``celery`` tool (without
+ cli options) against the rabbitmq server running in the docker
+ environment,
+
+ - ``COMPOSE_FILE`` so you can run ``docker-compose`` from
+ everywhere,
+
+- create an alias ``doco`` for ``docker-compose`` because this is way
+ too long to type,
+
+So now you can easily:
+
+- Start the SWH platform::
+
+ (swh) ~/swh-environment$ doco up -d
+ [...]
+
+- Check celery::
+
+ (swh) ~/swh-environment$ celery status
+ listers@50ac2185c6c9: OK
+ loader@b164f9055637: OK
+ indexer@33bc6067a5b8: OK
+
+- List task-types::
+
+ (swh) ~/swh-environment$ swh scheduler task-type list
+ [...]
+
+- Get more info on a task type::
+
+ (swh) ~/swh-environment$ swh scheduler task-type list -v -t load-hg
+ Known task types:
+ load-hg: swh.loader.mercurial.tasks.LoadMercurial
+ Loading mercurial repository swh-loader-mercurial
+ interval: 1 day, 0:00:00 [1 day, 0:00:00, 1 day, 0:00:00]
+ backoff_factor: 1.0
+ max_queue_length: 1000
+ num_retries: None
+ retry_delay: None
+
+- Add a new task::
+
+ (swh) ~/swh-environment$ swh scheduler task add load-hg \
+ origin_url=https://hg.logilab.org/master/cubicweb
+ Created 1 tasks
+ Task 1
+ Next run: just now (2019-02-06 12:36:58+00:00)
+ Interval: 1 day, 0:00:00
+ Type: load-hg
+ Policy: recurring
+ Args:
+ Keyword args:
+ origin_url: https://hg.logilab.org/master/cubicweb
+
+- Respawn a task::
+
+ (swh) ~/swh-environment$ swh scheduler task respawn 1
+
+Data persistence for a development setting
+------------------------------------------
+
+The default ``docker-compose.yml`` configuration is not geared towards
+data persistence, but application testing.
+
+Volumes defined in associated images are anonymous and may get either
+unused or removed on the next ``docker-compose up``.
+
+One way to make sure these volumes persist is to use named volumes. The
+volumes may be defined as follows in a ``docker-compose.override.yml``.
+Note that volume definitions are merged with other compose files based
+on destination path.
+
+::
+
+ services:
+ swh-storage-db:
+ volumes:
+ - "swh_storage_data:/var/lib/postgresql/data"
+ swh-objstorage:
+ volumes:
+ - "swh_objstorage_data:/srv/softwareheritage/objects"
+
+ volumes:
+ swh_storage_data:
+ swh_objstorage_data:
+
+This way, ``docker-compose down`` without the ``-v`` flag will not
+remove those volumes and data will persist.
+
+Starting a kafka-powered mirror of the storage
+----------------------------------------------
+
+This repo comes with an optional ``docker-compose.storage-mirror.yml``
+docker compose file that can be used to test the kafka-powered mirror
+mecanism for the main storage.
+
+This can be used like::
+
+ ~/swh-environment/docker$ docker-compose -f docker-compose.yml -f docker-compose.storage-mirror.yml up -d
+ [...]
+
+Compared to the original compose file, this will:
+
+- overrides the swh-storage service to activate the kafka direct writer
+ on swh.journal.objects prefixed topics using thw swh.storage.master
+ ID,
+- overrides the swh-web service to make it use the mirror instead of
+ the master storage,
+- starts a db for the mirror,
+- starts a storage service based on this db,
+- starts a replayer service that runs the process that listen to kafka
+ to keeps the mirror in sync.
+
+When using it, you will have a setup in which the master storage is used
+by workers and most other services, whereas the storage mirror will be
+used to by the web application and should be kept in sync with the
+master storage by kafka.
+
+Note that the object storage is not replicated here, only the graph
+storage.
+
+Starting the backfiller
+-----------------------
+
+Reading from the storage the objects from within range [start-object,
+end-object] to the kafka topics.
+
+::
+
+ (swh)$ docker-compose \
+ -f docker-compose.yml \
+ -f docker-compose.storage-mirror.yml \
+ -f docker-compose.storage-mirror.override.yml \
+ run \
+ swh-journal-backfiller \
+ snapshot \
+ --start-object 000000 \
+ --end-object 000001 \
+ --dry-run
+
+Using Sentry
+------------
+
+All entrypoints to SWH code (CLI, gunicorn, celery, …) are, or should
+be, intrumented using Sentry. By default this is disabled, but if you
+run your own Sentry instance, you can use it.
+
+To do so, you must get a DSN from your Sentry instance, and set it as
+the value of ``SWH_SENTRY_DSN`` in the file ``env/common_python.env``.
+You may also set it per-service in the ``environment`` section of each
+services in ``docker-compose.override.yml``.
+
+Caveats
+-------
+
+Running a lister task can lead to a lot of loading tasks, which can fill
+your hard drive pretty fast. Make sure to monitor your available storage
+space regularly when playing with this stack.
+
+Also, a few containers (``swh-storage``, ``swh-xxx-db``) use a volume
+for storing the blobs or the database files. With the default
+configuration provided in the ``docker-compose.yml`` file, these volumes
+are not persistant. So removing the containers will delete the volumes!
+
+Also note that for the ``swh-objstorage``, since the volume can be
+pretty big, the remove operation can be quite long (several minutes is
+not uncommon), which may mess a bit with the ``docker-compose`` command.
+
+If you have an error message like:
+
+Error response from daemon: removal of container 928de3110381 is already
+in progress
+
+it means that you need to wait for this process to finish before being
+able to (re)start your docker stack again.

File Metadata

Mime Type
text/plain
Expires
Mar 17 2025, 7:37 PM (7 w, 4 d ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3225517

Event Timeline