diff --git a/docker/README.md b/docker/README.md index 0b59f8d..38db0f5 100644 --- a/docker/README.md +++ b/docker/README.md @@ -1,662 +1,673 @@ # Docker environment This directory contains Dockerfiles to run a small Software Heritage instance on development machines. The end goal is to smooth the contributors/developers workflow. Focus on coding, not configuring! WARNING: Running a Software Heritage instance on your machine can consume quite a bit of resources: if you play a bit too hard (e.g., if you try to list all GitHub repositories with the corresponding lister), you may fill your hard drive, and consume a lot of CPU, memory and network bandwidth. ## Dependencies This uses docker with docker-compose, so ensure you have a working docker environment and docker-compose is installed. We recommend using the latest version of docker, so please read https://docs.docker.com/install/linux/docker-ce/debian/ for more details on how to install docker on your machine. On a debian system, docker-compose can be installed from Debian repositories: ``` ~$ sudo apt install docker-compose ``` ## Quick start First, change to the docker dir if you aren't there yet: ``` ~$ cd swh-environment/docker ``` Then, start containers: ``` ~/swh-environment/docker$ docker-compose up -d [...] Creating docker_amqp_1 ... done Creating docker_zookeeper_1 ... done Creating docker_kafka_1 ... done Creating docker_flower_1 ... done Creating docker_swh-scheduler-db_1 ... done [...] ``` This will build docker images and run them. Check everything is running fine with: ``` ~/swh-environment/docker$ docker-compose ps Name Command State Ports ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- docker_amqp_1 docker-entrypoint.sh rabbi ... Up 15671/tcp, 0.0.0.0:5018->15672/tcp, 25672/tcp, 4369/tcp, 5671/tcp, 5672/tcp docker_flower_1 flower --broker=amqp://gue ... Up 0.0.0.0:5555->5555/tcp docker_kafka_1 start-kafka.sh Up 0.0.0.0:9092->9092/tcp docker_swh-deposit-db_1 docker-entrypoint.sh postgres Up 5432/tcp docker_swh-deposit_1 /entrypoint.sh Up 0.0.0.0:5006->5006/tcp [...] ``` The startup of some containers may fail the first time for dependency-related problems. If some containers failed to start, just run the `docker-compose up -d` command again. If a container really refuses to start properly, you can check why using the `docker-compose logs` command. For example: ``` ~/swh-environment/docker$ docker-compose logs swh-lister Attaching to docker_swh-lister_1 [...] swh-lister_1 | Processing /src/swh-scheduler swh-lister_1 | Could not install packages due to an EnvironmentError: [('/src/swh-scheduler/.hypothesis/unicodedata/8.0.0/charmap.json.gz', '/tmp/pip-req-build-pm7nsax3/.hypothesis/unicodedata/8.0.0/charmap.json.gz', "[Errno 13] Permission denied: '/src/swh-scheduler/.hypothesis/unicodedata/8.0.0/charmap.json.gz'")] swh-lister_1 | ``` Once all containers are running, you can use the web interface by opening http://localhost:5080/ in your web browser. At this point, the archive is empty and needs to be filled with some content. To do so, you can create tasks that will scrape a forge. For example, to inject the code from the https://0xacab.org gitlab forge: ``` ~/swh-environment/docker$ docker-compose exec swh-scheduler \ swh scheduler task add list-gitlab-full \ -p oneshot url=https://0xacab.org/api/v4 Created 1 tasks Task 1 Next run: just now (2018-12-19 14:58:49+00:00) Interval: 90 days, 0:00:00 Type: list-gitlab-full Policy: oneshot Args: Keyword args: url=https://0xacab.org/api/v4 ``` This task will scrape the forge's project list and create subtasks to inject each git repository found there. This will take a bit af time to complete. To increase the speed at which git repositories are imported, you can spawn more `swh-loader-git` workers: ``` ~/swh-environment/docker$ docker-compose exec swh-scheduler \ celery status listers@50ac2185c6c9: OK loader@b164f9055637: OK indexer@33bc6067a5b8: OK vault@c9fef1bbfdc1: OK 4 nodes online. ~/swh-environment/docker$ docker-compose exec swh-scheduler \ celery control pool_grow 3 -d loader@b164f9055637 -> loader@b164f9055637: OK pool will grow ~/swh-environment/docker$ docker-compose exec swh-scheduler \ celery inspect -d loader@b164f9055637 stats | grep prefetch_count "prefetch_count": 4 ``` Now there are 4 workers ingesting git repositories. You can also increase the number of `swh-loader-git` containers: ``` ~/swh-environment/docker$ docker-compose up -d --scale swh-loader=4 [...] Creating docker_swh-loader_2 ... done Creating docker_swh-loader_3 ... done Creating docker_swh-loader_4 ... done ``` ## Updating the docker image All containers started by `docker-compose` are bound to a docker image named `swh/stack` including all the software components of Software Heritage. When new versions of these components are released, the docker image will not be automatically updated. In order to update all Software Heritage components to their latest version, the docker image needs to be explicitly rebuilt by issuing the following command from within the `docker` directory: ``` ~/swh-environment/docker$ docker build --no-cache -t swh/stack . ``` ## Details This runs the following services on their respectively standard ports, all of the following services are configured to communicate with each other: - swh-storage-db: a `softwareheritage` instance db that stores the Merkle DAG, - swh-objstorage: Content-addressable object storage, - swh-storage: Abstraction layer over the archive, allowing to access all stored source code artifacts as well as their metadata, - swh-web: the Software Heritage web user interface, - swh-scheduler: the API service as well as 2 utilities, the runner and the listener, - swh-lister: celery workers dedicated to running lister tasks, - swh-loaders: celery workers dedicated to importing/updating source code content (VCS repos, source packages, etc.), - swh-journal: Persistent logger of changes to the archive, with publish-subscribe support. That means you can start doing the ingestion using those services using the same setup described in the getting-started starting directly at https://docs.softwareheritage.org/devel/getting-started.html#step-4-ingest-repositories ### Exposed Ports Several services have their listening ports exposed on the host: - amqp: 5072 - kafka: 5092 - nginx: 5080 And for SWH services: - scheduler API: 5008 - storage API: 5002 - object storage API: 5003 - indexer API: 5007 - web app: 5004 - deposit app: 5006 Beware that these ports are not the same as the ports used from within the docker network. This means that the same command executed from the host or from a docker container will not use the same urls to access services. For example, to use the `celery` utility from the host, you may type: ``` ~/swh-environment/docker$ CELERY_BROKER_URL=amqp://:5072// celery status loader@61704103668c: OK [...] ``` To run the same command from within a container: ``` ~/swh-environment/docker$ docker-compose exec swh-scheduler celery status loader@61704103668c: OK [...] ``` ## Managing tasks One of the main components of the Software Heritage platform is the task system. These are used to manage everything related to background process, like discovering new git repositories to import, ingesting them, checking a known repository is up to date, etc. The task system is based on Celery but uses a custom database-based scheduler. So when we refer to the term 'task', it may designate either a Celery task or a SWH one (ie. the entity in the database). When we refer to simply a "task" in the documentation, it designates the SWH task. When a SWH task is ready to be executed, a Celery task is created to handle the actual SWH task's job. Note that not all Celery tasks are directly linked to a SWH task (some SWH tasks are implemented using a Celery task that spawns Celery subtasks). A (SWH) task can be `recurring` or `oneshot`. `oneshot` tasks are only executed once, whereas `recurring` are regularly executed. The scheduling configuration of these recurring tasks can be set via the fields `current_interval` and `priority` (can be 'high', 'normal' or 'low') of the task database entity. ### Inserting a new lister task To list the content of a source code provider like github or a Debian distribution, you may add a new task for this. This task will (generally) scrape a web page or use a public API to identify the list of published software artefacts (git repos, debian source packages, etc.) Then, for each repository, a new task will be created to ingest this repository and keep it up to date. For example, to add a (one shot) task that will list git repos on the 0xacab.org gitlab instance, one can do (from this git repository): ``` ~/swh-environment/docker$ docker-compose exec swh-scheduler \ swh scheduler task add list-gitlab-full \ -p oneshot url=https://0xacab.org/api/v4 Created 1 tasks Task 12 Next run: just now (2018-12-19 14:58:49+00:00) Interval: 90 days, 0:00:00 Type: list-gitlab-full Policy: oneshot Args: Keyword args: url=https://0xacab.org/api/v4 ``` This will insert a new task in the scheduler. To list existing tasks for a given task type: ``` ~/swh-environment/docker$ docker-compose exec swh-scheduler \ swh scheduler task list-pending list-gitlab-full Found 1 list-gitlab-full tasks Task 12 Next run: 2 minutes ago (2018-12-19 14:58:49+00:00) Interval: 90 days, 0:00:00 Type: list-gitlab-full Policy: oneshot Args: Keyword args: url=https://0xacab.org/api/v4 ``` To list all existing task types: ``` ~/swh-environment/docker$ docker-compose exec swh-scheduler \ swh scheduler task-type list Known task types: load-svn-from-archive: Loading svn repositories from svn dump load-svn: Create dump of a remote svn repository, mount it and load it load-deposit: Loading deposit archive into swh through swh-loader-tar check-deposit: Pre-checking deposit step before loading into swh archive cook-vault-bundle: Cook a Vault bundle load-hg: Loading mercurial repository swh-loader-mercurial load-hg-from-archive: Loading archive mercurial repository swh-loader-mercurial load-git: Update an origin of type git list-github-incremental: Incrementally list GitHub list-github-full: Full update of GitHub repos list list-debian-distribution: List a Debian distribution list-gitlab-incremental: Incrementally list a Gitlab instance list-gitlab-full: Full update of a Gitlab instance's repos list list-pypi: Full pypi lister load-pypi: Load Pypi origin index-mimetype: Mimetype indexer task index-mimetype-for-range: Mimetype Range indexer task index-fossology-license: Fossology license indexer task index-fossology-license-for-range: Fossology license range indexer task index-origin-head: Origin Head indexer task index-revision-metadata: Revision Metadata indexer task index-origin-metadata: Origin Metadata indexer task ``` ### Monitoring activity You can monitor the workers activity by connecting to the RabbitMQ console on `http://localhost:5080/rabbitmq` or the grafana dashboard on `http://localhost:5080/grafana`. If you cannot see any task being executed, check the logs of the `swh-scheduler-runner` service (here is a failure example due to the debian lister task not being properly registered on the swh-scheduler-runner service): ``` ~/swh-environment/docker$ docker-compose logs --tail=10 swh-scheduler-runner Attaching to docker_swh-scheduler-runner_1 swh-scheduler-runner_1 | "__main__", mod_spec) swh-scheduler-runner_1 | File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code swh-scheduler-runner_1 | exec(code, run_globals) swh-scheduler-runner_1 | File "/usr/local/lib/python3.7/site-packages/swh/scheduler/celery_backend/runner.py", line 107, in swh-scheduler-runner_1 | run_ready_tasks(main_backend, main_app) swh-scheduler-runner_1 | File "/usr/local/lib/python3.7/site-packages/swh/scheduler/celery_backend/runner.py", line 81, in run_ready_tasks swh-scheduler-runner_1 | task_types[task['type']]['backend_name'] swh-scheduler-runner_1 | File "/usr/local/lib/python3.7/site-packages/celery/app/registry.py", line 21, in __missing__ swh-scheduler-runner_1 | raise self.NotRegistered(key) swh-scheduler-runner_1 | celery.exceptions.NotRegistered: 'swh.lister.debian.tasks.DebianListerTask' ``` ## Using docker setup development and integration testing If you hack the code of one or more archive components with a virtual env based setup as described in the [[https://docs.softwareheritage.org/devel/developer-setup.html|developer setup guide]], you may want to test your modifications in a working Software Heritage instance. The simplest way to achieve this is to use this docker-based environment. If you haven't followed the [[https://docs.softwareheritage.org/devel/developer-setup.html|developer setup guide]], you must clone the the [swh-environment] repo in your `swh-environment` directory: ``` ~/swh-environment$ git clone https://forge.softwareheritage.org/source/swh-environment.git . ``` Note the `.` at the end of this command: we want the git repository to be cloned directly in the `~/swh-environment` directory, not in a sub directory. Also note that if you haven't done it yet and you want to hack the source code of one or more Software Heritage packages, you really should read the [[https://docs.softwareheritage.org/devel/developer-setup.html|developer setup guide]]. From there, we will checkout or update all the swh packages: ``` ~/swh-environment$ ./bin/update ``` ### Install a swh package from sources in a container It is possible to run a docker container with some swh packages installed from sources instead of using the latest published packages from pypi. To do this you must write a docker-compose override file (`docker-compose.override.yml`). An example is given in the `docker-compose.override.yml.example` file: ``` yaml version: '2' services: swh-objstorage: volumes: - "$HOME/swh-environment/swh-objstorage:/src/swh-objstorage" ``` The file named `docker-compose.override.yml` will automatically be loaded by `docker-compose`. This example shows the simplest case of the `swh-objstorage` package: you just have to mount it in the container in `/src` and the entrypoint will ensure every swh-* package found in `/src/` is installed (using `pip install -e` so you can easily hack your code). If the application you play with has autoreload support, there is no need to restart the impacted container.) Note: if the docker fails to start when using local sources for one or more swh package, it's most probably due to permission problems on cache files. For example, if you have executed tests locally (using pytest or tox), you have cache files (__pycache__ etc.) that will prevent `pip install` from working within the docker. The solution is to clean these files and directories before trying to spawn the docker. ``` ~/swh-environment$ find . -type d -name __pycache__ -exec rm -rf {} \; ~/swh-environment$ find . -type d -name .tox -exec rm -rf {} \; ~/swh-environment$ find . -type d -name .hypothesis -exec rm -rf {} \; ``` ### Using locally installed swh tools with docker In all examples above, we have executed swh commands from within a running container. Now we also have these swh commands locally available in our virtual env, we can use them to interact with swh services running in docker containers. For this, we just need to configure a few environment variables. First, ensure your Software Heritage virtualenv is activated (here, using virtualenvwrapper): ``` ~$ workon swh (swh) ~/swh-environment$ export SWH_SCHEDULER_URL=http://127.0.0.1:5008/ (swh) ~/swh-environment$ export CELERY_BROKER_URL=amqp://127.0.0.1:5072/ ``` Now we can use the `celery` command directly to control the celery system running in the docker environment: ``` (swh) ~/swh-environment$ celery status vault@c9fef1bbfdc1: OK listers@ba66f18e7d02: OK indexer@cb14c33cbbfb: OK loader@61704103668c: OK 4 nodes online. (swh) ~/swh-environment$ celery control -d loader@61704103668c pool_grow 3 ``` And we can use the `swh-scheduler` command all the same: ``` (swh) ~/swh-environment$ swh scheduler task-type list Known task types: index-fossology-license: Fossology license indexer task index-mimetype: Mimetype indexer task [...] ``` ### Make your life a bit easier When you use virtualenvwrapper, you can add postactivation commands: ``` (swh) ~/swh-environment$ cat >>$VIRTUAL_ENV/bin/postactivate <<'EOF' # unfortunately, the interface cmd for the click autocompletion # depends on the shell # https://click.palletsprojects.com/en/7.x/bashcomplete/#activation shell=$(basename $SHELL) case "$shell" in "zsh") autocomplete_cmd=source_zsh ;; *) autocomplete_cmd=source ;; esac eval "$(_SWH_COMPLETE=$autocomplete_cmd swh)" export SWH_SCHEDULER_URL=http://127.0.0.1:5008/ export CELERY_BROKER_URL=amqp://127.0.0.1:5072/ export COMPOSE_FILE=~/swh-environment/docker/docker-compose.yml:~/swh-environment/docker/docker-compose.override.yml alias doco=docker-compose function swhclean { find ~/swh-environment -type d -name __pycache__ -exec rm -rf {} \; find ~/swh-environment -type d -name .tox -exec rm -rf {} \; find ~/swh-environment -type d -name .hypothesis -exec rm -rf {} \; } EOF ``` This postactivate script does: - install a shell completion handler for the swh-scheduler command, - preset a bunch of environment variables - `SWH_SCHEDULER_URL` so that you can just run `swh scheduler` against the scheduler API instance running in docker, without having to specify the endpoint URL, - `CELERY_BROKER` so you can execute the `celery` tool (without cli options) against the rabbitmq server running in the docker environment, - `COMPOSE_FILE` so you can run `docker-compose` from everywhere, - create an alias `doco` for `docker-compose` because this is way too long to type, - add a `swhclean` shell function to clean your source directories so that there is no conflict with docker containers using local swh repositories (see below). This will delete any `.tox`, `__pycache__` and `.hypothesis` directory found in your swh-environment directory. So now you can easily: * Start the SWH platform: ``` (swh) ~/swh-environment$ doco up -d [...] ``` * Check celery: ``` (swh) ~/swh-environment$ celery status listers@50ac2185c6c9: OK loader@b164f9055637: OK indexer@33bc6067a5b8: OK ``` * List task-types: ``` (swh) ~/swh-environment$ swh scheduler task-type list [...] ``` * Get more info on a task type: ``` (swh) ~/swh-environment$ swh scheduler task-type list -v -t load-hg Known task types: load-hg: swh.loader.mercurial.tasks.LoadMercurial Loading mercurial repository swh-loader-mercurial interval: 1 day, 0:00:00 [1 day, 0:00:00, 1 day, 0:00:00] backoff_factor: 1.0 max_queue_length: 1000 num_retries: None retry_delay: None ``` * Add a new task: ``` (swh) ~/swh-environment$ swh scheduler task add load-hg \ origin_url=https://hg.logilab.org/master/cubicweb Created 1 tasks Task 1 Next run: just now (2019-02-06 12:36:58+00:00) Interval: 1 day, 0:00:00 Type: load-hg Policy: recurring Args: Keyword args: origin_url: https://hg.logilab.org/master/cubicweb ``` * Respawn a task: ``` (swh) ~/swh-environment$ swh scheduler task respawn 1 ``` ## Starting a kafka-powered replica of the storage This repo comes with an optional `docker-compose.storage-replica.yml` docker compose file that can be used to test the kafka-powered replication mecanism for the main storage. This can be used like: ``` ~/swh-environment/docker$ docker-compose -f docker-compose.yml -f docker-compose.storage-replica.yml up -d [...] ``` Compared to the original compose file, this will: - overrides the swh-storage service to activate the kafka direct writer on swh.journal.objects prefixed topics using thw swh.storage.master ID, - overrides the swh-web service to make it use the replica instead of the master storage, - starts a db for the replica, - starts a storage service based on this db, - starts a replayer service that runs the process that listen to kafka to keeps the replica in sync. When using it, you will have a setup in which the master storage is used by workers and most other services, whereas the storage replica will be used to by the web application and should be kept in sync with the master storage by kafka. Note that the object storage is not replicated here, only the graph storage. ## Starting the backfiller Reading from the storage the objects from within range [start-object, end-object] to the kafka topics. ``` (swh)$ docker-compose \ -f docker-compose.yml \ -f docker-compose.storage-replica.yml \ -f docker-compose.storage-replica.override.yml \ run \ swh-journal-backfiller \ snapshot \ --start-object 000000 \ --end-object 000001 \ --dry-run ``` + +## Using Sentry + +All entrypoints to SWH code (CLI, gunicorn, celery, ...) are, or should be, +intrumented using Sentry. By default this is disabled, but if you run your +own Sentry instance, you can use it. + +To do so, you must get a DSN from your Sentry instance, and set it as the +value of `SWH_SENTRY_DSN` in the file `env/common_python.env`. +You may also set it per-service in the `environment` section of each services +in `docker-compose.override.yml`. diff --git a/docker/docker-compose.yml b/docker/docker-compose.yml index 43a2392..8ebb4c9 100644 --- a/docker/docker-compose.yml +++ b/docker/docker-compose.yml @@ -1,407 +1,422 @@ version: '2' services: amqp: image: rabbitmq:3.6-management ports: - 5072:5672 # flower: # image: mher/flower # command: --broker=amqp://guest:guest@amqp:5672// --url_prefix=flower # ports: # - 5055:5555 # depends_on: # - amqp zookeeper: image: wurstmeister/zookeeper restart: always kafka: image: wurstmeister/kafka ports: - "5092:9092" env_file: ./env/kafka.env depends_on: - zookeeper kafka-manager: image: hlebalbau/kafka-manager:stable ports: - "5093:9000" environment: ZK_HOSTS: zookeeper:2181 APPLICATION_SECRET: random-secret command: -Dpidfile.path=/dev/null prometheus: image: prom/prometheus depends_on: - prometheus-statsd-exporter command: # Needed for the reverse-proxy - "--web.external-url=/prometheus" - "--config.file=/etc/prometheus/prometheus.yml" volumes: - "./conf/prometheus.yml:/etc/prometheus/prometheus.yml:ro" restart: unless-stopped prometheus-statsd-exporter: image: prom/statsd-exporter command: - "--statsd.mapping-config=/etc/prometheus/statsd-mapping.yml" volumes: - "./conf/prometheus-statsd-mapping.yml:/etc/prometheus/statsd-mapping.yml:ro" restart: unless-stopped prometheus-rabbitmq-exporter: image: kbudde/rabbitmq-exporter restart: unless-stopped environment: SKIP_QUEUES: "RPC_.*" MAX_QUEUES: 5000 RABBIT_URL: http://amqp:15672 grafana: image: grafana/grafana restart: unless-stopped depends_on: - prometheus environment: GF_SERVER_ROOT_URL: http://localhost:5080/grafana volumes: - "./conf/grafana/provisioning:/etc/grafana/provisioning:ro" - "./conf/grafana/dashboards:/var/lib/grafana/dashboards" nginx: image: nginx volumes: - "./conf/nginx.conf:/etc/nginx/nginx.conf:ro" ports: - 5081:80 # Scheduler swh-scheduler-db: image: postgres:11 env_file: + - ./env/common_python.env - ./env/scheduler-db.env environment: # unset PGHOST as db service crashes otherwise PGHOST: swh-scheduler: image: swh/stack build: ./ env_file: + - ./env/common_python.env - ./env/scheduler-db.env - ./env/scheduler.env - ./env/statsd.env environment: SWH_CONFIG_FILENAME: /scheduler.yml SWH_SCHEDULER_CONFIG_FILE: /scheduler.yml entrypoint: /entrypoint.sh depends_on: - swh-scheduler-db ports: - 5008:5008 volumes: - "./conf/scheduler.yml:/scheduler.yml:ro" - "./services/swh-scheduler/entrypoint.sh:/entrypoint.sh:ro" swh-scheduler-listener: image: swh/stack build: ./ env_file: + - ./env/common_python.env - ./env/scheduler-db.env - ./env/scheduler.env - ./env/statsd.env environment: SWH_CONFIG_FILENAME: /scheduler.yml SWH_SCHEDULER_CONFIG_FILE: /scheduler.yml entrypoint: /entrypoint.sh command: start-listener depends_on: - swh-scheduler - amqp volumes: - "./conf/scheduler.yml:/scheduler.yml:ro" - "./services/swh-scheduler-worker/entrypoint.sh:/entrypoint.sh:ro" swh-scheduler-runner: image: swh/stack build: ./ env_file: + - ./env/common_python.env - ./env/scheduler-db.env - ./env/scheduler.env - ./env/statsd.env environment: SWH_CONFIG_FILENAME: /scheduler.yml SWH_SCHEDULER_CONFIG_FILE: /scheduler.yml entrypoint: /entrypoint.sh command: start-runner -p 10 depends_on: - swh-scheduler - amqp volumes: - "./conf/scheduler.yml:/scheduler.yml:ro" - "./services/swh-scheduler-worker/entrypoint.sh:/entrypoint.sh:ro" # Graph storage swh-storage-db: image: postgres:11 env_file: - ./env/storage-db.env environment: # unset PGHOST as db service crashes otherwise PGHOST: swh-storage: image: swh/stack build: ./ ports: - 5002:5002 depends_on: - swh-storage-db - swh-objstorage - kafka env_file: + - ./env/common_python.env - ./env/storage-db.env - ./env/statsd.env environment: SWH_CONFIG_FILENAME: /storage.yml STORAGE_BACKEND: postgresql entrypoint: /entrypoint.sh volumes: - "./conf/storage.yml:/storage.yml:ro" - "./services/swh-storage/entrypoint.sh:/entrypoint.sh:ro" # Object storage swh-objstorage: build: ./ image: swh/stack ports: - 5003:5003 env_file: + - ./env/common_python.env - ./env/statsd.env environment: SWH_CONFIG_FILENAME: /objstorage.yml entrypoint: /entrypoint.sh volumes: - "./conf/objstorage.yml:/objstorage.yml:ro" - "./services/swh-objstorage/entrypoint.sh:/entrypoint.sh:ro" # Indexer storage swh-idx-storage-db: image: postgres:11 env_file: - ./env/indexers-db.env environment: # unset PGHOST as db service crashes otherwise PGHOST: swh-idx-storage: image: swh/stack build: ./ ports: - 5007:5007 depends_on: - swh-idx-storage-db env_file: + - ./env/common_python.env - ./env/indexers-db.env environment: SWH_CONFIG_FILENAME: /indexer_storage.yml entrypoint: /entrypoint.sh volumes: - "./conf/indexer_storage.yml:/indexer_storage.yml:ro" - "./services/swh-indexer-storage/entrypoint.sh:/entrypoint.sh:ro" # Web interface swh-web: build: ./ image: swh/stack ports: - 5004:5004 depends_on: - swh-objstorage - swh-storage - swh-idx-storage env_file: + - ./env/common_python.env - ./env/statsd.env environment: VERBOSITY: 3 DJANGO_SETTINGS_MODULE: swh.web.settings.development SWH_CONFIG_FILENAME: /web.yml entrypoint: /entrypoint.sh volumes: - "./conf/web.yml:/web.yml:ro" - "./services/swh-web/entrypoint.sh:/entrypoint.sh:ro" swh-deposit-db: image: postgres:11 env_file: - ./env/deposit-db.env environment: # unset PGHOST as db service crashes otherwise PGHOST: swh-deposit: image: swh/stack build: ./ ports: - 5006:5006 depends_on: - swh-deposit-db - swh-scheduler env_file: + - ./env/common_python.env - ./env/statsd.env + - ./env/deposit-db.env environment: VERBOSITY: 3 SWH_CONFIG_FILENAME: /deposit.yml DJANGO_SETTINGS_MODULE: swh.deposit.settings.production - env_file: - - ./env/deposit-db.env entrypoint: /entrypoint.sh volumes: - "./conf/deposit.yml:/deposit.yml:ro" - "./services/swh-deposit/entrypoint.sh:/entrypoint.sh:ro" swh-vault-db: image: postgres:11 env_file: - ./env/vault-db.env environment: # unset PGHOST as db service crashes otherwise PGHOST: swh-vault: image: swh/stack build: ./ env_file: + - ./env/common_python.env - ./env/vault-db.env - ./env/statsd.env environment: SWH_CONFIG_FILENAME: /vault.yml command: server ports: - 5005:5005 depends_on: - swh-vault-db - swh-objstorage - swh-storage - swh-scheduler entrypoint: /entrypoint.sh volumes: - "./conf/vault.yml:/vault.yml:ro" - "./services/swh-vault/entrypoint.sh:/entrypoint.sh:ro" swh-vault-worker: image: swh/stack build: ./ command: worker env_file: + - ./env/common_python.env - ./env/statsd.env environment: SWH_CONFIG_FILENAME: /cooker.yml depends_on: - swh-vault - swh-storage entrypoint: /entrypoint.sh volumes: - "./conf/vault-worker.yml:/cooker.yml:ro" - "./services/swh-vault/entrypoint.sh:/entrypoint.sh:ro" # Lister Celery workers swh-listers-db: image: postgres:11 env_file: - ./env/listers-db.env environment: # unset PGHOST as db service crashes otherwise PGHOST: swh-lister: image: swh/stack build: ./ env_file: + - ./env/common_python.env - ./env/listers-db.env - ./env/listers.env - ./env/statsd.env user: swh environment: SWH_WORKER_INSTANCE: listers SWH_CONFIG_FILENAME: /lister.yml depends_on: - swh-listers-db - swh-scheduler - swh-scheduler-runner - swh-storage - amqp entrypoint: /entrypoint.sh volumes: - "./conf/lister.yml:/lister.yml:ro" - "./services/swh-listers-worker/entrypoint.sh:/entrypoint.sh:ro" # Loader Celery workers swh-loader: image: swh/stack build: ./ env_file: + - ./env/common_python.env - ./env/listers.env - ./env/statsd.env user: swh environment: SWH_WORKER_INSTANCE: loader SWH_CONFIG_FILENAME: /loader.yml entrypoint: /entrypoint.sh depends_on: - swh-storage - swh-scheduler - amqp volumes: - "./conf/loader.yml:/loader.yml:ro" - "./services/swh-loaders-worker/entrypoint.sh:/entrypoint.sh:ro" # Indexer Celery workers swh-indexer: image: swh/stack build: ./ user: swh env_file: + - ./env/common_python.env - ./env/indexers-db.env - ./env/indexers.env - ./env/statsd.env entrypoint: /entrypoint.sh depends_on: - swh-scheduler-runner - swh-idx-storage - swh-storage - swh-objstorage - amqp volumes: - "./conf/indexer.yml:/indexer.yml:ro" - "./services/swh-indexer-worker/entrypoint.sh:/entrypoint.sh:ro" # Journal related swh-indexer-journal-client: image: swh/stack build: ./ entrypoint: /entrypoint.sh + env_file: + - ./env/common_python.env depends_on: - kafka - swh-storage - swh-scheduler volumes: - "./conf/indexer_journal_client.yml:/etc/softwareheritage/indexer/journal_client.yml:ro" - "./services/swh-indexer-journal-client/entrypoint.sh:/entrypoint.sh:ro" diff --git a/docker/env/common_python.env b/docker/env/common_python.env new file mode 100644 index 0000000..2e2011e --- /dev/null +++ b/docker/env/common_python.env @@ -0,0 +1,5 @@ +# Forces Python's stdout to be shown in docker logs before they exit: +PYTHONUNBUFFERED=1 + +# Uncomment and set this to a Sentry DSN to report errors to Sentry: +# SWH_SENTRY_DSN= diff --git a/docker/services/swh-deposit/entrypoint.sh b/docker/services/swh-deposit/entrypoint.sh index 8b6994f..07b7cf7 100755 --- a/docker/services/swh-deposit/entrypoint.sh +++ b/docker/services/swh-deposit/entrypoint.sh @@ -1,35 +1,36 @@ #!/bin/bash set -ex source /srv/softwareheritage/utils/pyutils.sh setup_pip source /srv/softwareheritage/utils/pgsql.sh setup_pgsql if [ "$1" = 'shell' ] ; then exec bash -i else wait_pgsql echo "Migrating db" django-admin migrate --settings=${DJANGO_SETTINGS_MODULE} swh-deposit admin user exists test || \ swh-deposit admin user create \ --username test \ --password test \ --provider-url https://softwareheritage.org \ --domain softwareheritage.org echo "starting swh-deposit server" exec gunicorn --bind 0.0.0.0:5006 \ --reload \ --threads 2 \ --workers 2 \ --log-level DEBUG \ --timeout 3600 \ + --config 'python:swh.core.api.gunicorn_config' \ 'django.core.wsgi:get_wsgi_application()' fi diff --git a/docker/services/swh-indexer-storage/entrypoint.sh b/docker/services/swh-indexer-storage/entrypoint.sh index ec5d92c..69484fe 100755 --- a/docker/services/swh-indexer-storage/entrypoint.sh +++ b/docker/services/swh-indexer-storage/entrypoint.sh @@ -1,32 +1,33 @@ #!/bin/bash set -e source /srv/softwareheritage/utils/pyutils.sh setup_pip source /srv/softwareheritage/utils/pgsql.sh setup_pgsql case "$1" in "shell") exec bash -i ;; *) wait_pgsql echo Setup the database PGPASSWORD=${POSTGRES_PASSWORD} swh db-init \ --db-name ${POSTGRES_DB} indexer echo Starting the swh-indexer-storage API server exec gunicorn --bind 0.0.0.0:5007 \ --reload \ --threads 2 \ --workers 2 \ --log-level DEBUG \ --timeout 3600 \ + --config 'python:swh.core.api.gunicorn_config' \ 'swh.indexer.storage.api.server:make_app_from_configfile()' ;; esac diff --git a/docker/services/swh-objstorage/entrypoint.sh b/docker/services/swh-objstorage/entrypoint.sh index 30fcefc..370fc20 100755 --- a/docker/services/swh-objstorage/entrypoint.sh +++ b/docker/services/swh-objstorage/entrypoint.sh @@ -1,24 +1,25 @@ #!/bin/bash set -e source /srv/softwareheritage/utils/pyutils.sh setup_pip echo Installed Python packages: pip list if [ "$1" = 'shell' ] ; then exec bash -i else echo Starting the swh-objstorage API server exec gunicorn --bind 0.0.0.0:5003 \ --worker-class aiohttp.worker.GunicornWebWorker \ --log-level DEBUG \ --threads 4 \ --workers 2 \ --reload \ --timeout 3600 \ + --config 'python:swh.core.api.gunicorn_config' \ 'swh.objstorage.api.server:make_app_from_configfile()' fi diff --git a/docker/services/swh-scheduler/entrypoint.sh b/docker/services/swh-scheduler/entrypoint.sh index b0a3381..ebdd30c 100755 --- a/docker/services/swh-scheduler/entrypoint.sh +++ b/docker/services/swh-scheduler/entrypoint.sh @@ -1,34 +1,35 @@ #!/bin/bash set -e source /srv/softwareheritage/utils/pyutils.sh setup_pip source /srv/softwareheritage/utils/pgsql.sh setup_pgsql case "$1" in "shell") exec bash -i ;; "swh-scheduler") exec $@ ;; *) wait_pgsql echo Setup the swh-scheduler API database PGPASSWORD=${POSTGRES_PASSWORD} swh db-init \ --db-name ${POSTGRES_DB} scheduler echo Starting the swh-scheduler API server exec gunicorn --bind 0.0.0.0:5008 \ --log-level DEBUG \ --threads 2 \ --workers 2 \ --reload \ --timeout 3600 \ + --config 'python:swh.core.api.gunicorn_config' \ 'swh.scheduler.api.server:make_app_from_configfile()' esac diff --git a/docker/services/swh-search/entrypoint.sh b/docker/services/swh-search/entrypoint.sh index 2c44fb7..551a578 100755 --- a/docker/services/swh-search/entrypoint.sh +++ b/docker/services/swh-search/entrypoint.sh @@ -1,31 +1,32 @@ #!/bin/bash set -e source /srv/softwareheritage/utils/pyutils.sh setup_pip case "$1" in "shell") exec bash -i ;; *) echo Starting the swh-search API server wait-for-it elasticsearch:9200 -s --timeout=0 echo "Waiting for ElasticSearch cluster to be up" cat << EOF | python3 import elasticsearch es = elasticsearch.Elasticsearch(['elasticsearch:9200']) es.cluster.health(wait_for_status='yellow') EOF echo "ElasticSearch cluster is up" swh search -C $SWH_CONFIG_FILENAME initialize exec gunicorn --bind 0.0.0.0:5010 \ --reload \ --threads 4 \ --workers 2 \ --log-level DEBUG \ --timeout 3600 \ + --config 'python:swh.core.api.gunicorn_config' \ 'swh.search.api.server:make_app_from_configfile()' ;; esac diff --git a/docker/services/swh-storage-replayer/entrypoint.sh b/docker/services/swh-storage-replayer/entrypoint.sh index 382d52c..105e9c0 100755 --- a/docker/services/swh-storage-replayer/entrypoint.sh +++ b/docker/services/swh-storage-replayer/entrypoint.sh @@ -1,28 +1,28 @@ #!/bin/bash set -e source /srv/softwareheritage/utils/pyutils.sh setup_pip source /srv/softwareheritage/utils/pgsql.sh setup_pgsql case "$1" in "shell") exec bash -i ;; *) wait_pgsql echo Setup the database PGPASSWORD=${POSTGRES_PASSWORD} swh db-init \ --db-name ${POSTGRES_DB} storage echo Starting the swh-storage Kafka storage replayer - exec swh-journal replay \ + exec swh journal replay \ --broker kafka \ --prefix swh.journal.objects \ --consumer-id swh.storage.replica ;; esac diff --git a/docker/services/swh-storage/entrypoint.sh b/docker/services/swh-storage/entrypoint.sh index 359880c..5b3f3e4 100755 --- a/docker/services/swh-storage/entrypoint.sh +++ b/docker/services/swh-storage/entrypoint.sh @@ -1,45 +1,46 @@ #!/bin/bash set -e source /srv/softwareheritage/utils/pyutils.sh setup_pip if [ "$STORAGE_BACKEND" = "postgresql" ]; then source /srv/softwareheritage/utils/pgsql.sh setup_pgsql elif [ "$STORAGE_BACKEND" = "cassandra" ]; then echo Waiting for Cassandra to start wait-for-it ${CASSANDRA_SEED}:9042 -s --timeout=0 echo Creating keyspace cat << EOF | python3 from swh.storage.cassandra import create_keyspace create_keyspace(['cassandra-seed'], 'swh') EOF fi case "$1" in "shell") exec bash -i ;; *) if [ "$STORAGE_BACKEND" = "postgresql" ]; then wait_pgsql echo Setup the database PGPASSWORD=${POSTGRES_PASSWORD} swh db-init \ --db-name ${POSTGRES_DB} storage fi echo Starting the swh-storage API server exec gunicorn --bind 0.0.0.0:5002 \ --reload \ --threads 4 \ --workers 2 \ --log-level DEBUG \ --timeout 3600 \ + --config 'python:swh.core.api.gunicorn_config' \ 'swh.storage.api.server:make_app_from_configfile()' ;; esac diff --git a/docker/services/swh-web/entrypoint.sh b/docker/services/swh-web/entrypoint.sh index d33a904..7affc60 100755 --- a/docker/services/swh-web/entrypoint.sh +++ b/docker/services/swh-web/entrypoint.sh @@ -1,38 +1,39 @@ #!/bin/bash set -e create_admin_script=" from django.contrib.auth import get_user_model; username = 'admin'; password = 'admin'; email = 'admin@swh-web.org'; User = get_user_model(); if not User.objects.filter(username = username).exists(): User.objects.create_superuser(username, email, password); " source /srv/softwareheritage/utils/pyutils.sh setup_pip case "$1" in "shell") exec bash -i ;; *) echo "Migrating db using ${DJANGO_SETTINGS_MODULE}" django-admin migrate --settings=${DJANGO_SETTINGS_MODULE} echo "Creating admin user" echo "$create_admin_script" | python3 -m swh.web.manage shell echo "starting the swh-web server" exec gunicorn --bind 0.0.0.0:5004 \ --threads 2 \ --workers 2 \ --timeout 3600 \ + --config 'python:swh.web.gunicorn_config' \ 'django.core.wsgi:get_wsgi_application()' esac