Page MenuHomeSoftware Heritage

Elastic worker infrastructure
Started, Work in Progress, HighPublic

Description

staging first:

  • D7606: Reuse rancher cluster (used for our gitlab in-house experiment)
  • D7600: elastic worker node needs a specific role with docker prepared
  • D7624: P1342: Upgrade proxmox vm template
  • D7625, P1343: Declare new vm template with zfs dependency ready (so automation is not requiring reboot in the middle)
  • D7607: Register vms to cluster rancher
  • Build more recent image (softwareheritage/loaders:2022-04-27) [3]
  • Push to softwareheritage hub registry (no ci just yet)
  • correctness: vms runs docker container images of lister/loader images [4]
  • Properly declare vms to run docker images of lister/loader services
  • Monitor services: install prometheus, grafana [2]
  • Federate "elastic" prometheus it to the main swh prometheus
  • Push "elastic" services' pushed to the main swh log infrastructure
  • (Optional) Rework puppet manifest to actually run the registration command [1]

End goal:

  • listing and loading happens
  • resulting logs are pushed to our standard kibana logs infrastructure
  • stats results are pushed to our standard grafana

Annex:

  • ci build swh docker images (we can reuse existing ones at first)

[1] It's currently proxmox but later we'll have to do it without proxmox with baremetal
machines

[2] https://rancher.euwest.azure.internal.softwareheritage.org/k8s/clusters/c-t85mz/api/v1/namespaces/cattle-monitoring-system/services/http:rancher-monitoring-grafana:80/proxy/d/rancher-home-1/home?orgId=1&from=1651067146437&to=1651070746437

[3] Built out of swh-environment's swh/stack image for now (and just tagged with loaders in it)

[4]

$ cd $SWH_ENVIRONMENT_HOME/snippets/sysadmin/T3592-elastic-workers
$ cat loader-pypi.staging.values.yaml
# Default values for worker.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

amqp:
  username: <redacted>
  password: <redacted>
  host: scheduler0.internal.staging.swh.network
  queue_threshold: 10  # spawn worker per increment of `value` messages
  queues:
      - swh.loader.package.pypi.tasks.LoadPyPI

storage:
  host: storage1.internal.staging.swh.network

swh:
  loader:
    image: softwareheritage/loaders
    version: latest
$ helm install -f ./loader-pypi.staging.values.yaml  workers ./worker
NAME: workers
LAST DEPLOYED: Wed Apr 27 18:57:03 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
kubectl get pods -w
NAME                       READY   STATUS    RESTARTS       AGE
loaders-6bf6ddd897-gjh2w   1/1     Running   0              10m
loaders-6bf6ddd897-kkb7s   1/1     Running   0              10m
loaders-6bf6ddd897-lxl6b   1/1     Running   0              10m
loaders-6bf6ddd897-sjl26   1/1     Running   0              10m
loaders-6bf6ddd897-t59p7   1/1     Running   0              10m
...
$ kubectl logs loaders-6bf6ddd897-t59p7 | tail
[2022-04-27 17:05:00,462: INFO/MainProcess] Task swh.loader.package.pypi.tasks.LoadPyPI[a54e5fb2-8bc0-4a42-a586-58b5f8d3ebc1] received
[2022-04-27 17:05:04,767: INFO/MainProcess] sync with celery@loaders-6bf6ddd897-kkb7s
[2022-04-27 17:05:04,773: INFO/MainProcess] sync with celery@loaders-6bf6ddd897-jflck
[2022-04-27 17:05:27,170: INFO/MainProcess] missed heartbeat from celery@loaders-6bf6ddd897-jflck
[2022-04-27 17:06:01,504: INFO/ForkPoolWorker-1] Task swh.loader.package.pypi.tasks.LoadPyPI[b1d7bc2f-1294-47fd-8c8b-00775fe6a990] succeeded in 61.043102499999804s: {'status': 'eventful', 'snapshot_id': '7ca9564774a0fc2bfc2cf1234c8816c5193e33c2'}
[2022-04-27 17:06:01,514: INFO/MainProcess] Task swh.loader.package.pypi.tasks.LoadPyPI[186abb71-8595-4e97-a26c-830fc472a5dc] received
[2022-04-27 17:06:56,132: INFO/ForkPoolWorker-1] Task swh.loader.package.pypi.tasks.LoadPyPI[a54e5fb2-8bc0-4a42-a586-58b5f8d3ebc1] succeeded in 54.61585931099944s: {'status': 'eventful', 'snapshot_id': 'f2eaeb4d4d729bb4dcdb26eadd48e6ead2af5c9b'}
[2022-04-27 17:06:57,748: INFO/MainProcess] Task swh.loader.package.pypi.tasks.LoadPyPI[89e482b0-7d46-477c-9737-ba286dca5f31] received
[2022-04-27 17:10:20,229: INFO/ForkPoolWorker-2] Task swh.loader.package.pypi.tasks.LoadPyPI[186abb71-8595-4e97-a26c-830fc472a5dc] succeeded in 202.43778917999953s: {'status': 'eventful', 'snapshot_id': '52565afbd0c3483cb76782f1d57303f6c02e52ed'}
[2022-04-27 17:10:20,242: INFO/MainProcess] Task swh.loader.package.pypi.tasks.LoadPyPI[6418b3f5-f8e3-41f2-91b8-a19d0220b746] received

Related Objects

StatusAssignedTask
Work in Progressardumont
Resolvedardumont

Event Timeline

ardumont triaged this task as Normal priority.Apr 14 2022, 3:51 PM
ardumont created this task.
ardumont raised the priority of this task from Normal to High.Apr 14 2022, 6:02 PM
ardumont updated the task description. (Show Details)
ardumont changed the task status from Open to Work in Progress.Apr 19 2022, 3:51 PM
ardumont moved this task from Backlog to in-progress on the System administration board.
ardumont updated the task description. (Show Details)

I concur with @vsellier [1].

Reproduced to make sure my computer is able to discuss with the rancher cluster as well.

./test-inter-node-network.sh
Error from server (AlreadyExists): error when creating "overlay-test.yaml": daemonsets.apps "overlaytest" already exists
=> Start network overlay test
elastic-worker1 can reach elastic-worker1
elastic-worker1 can reach elastic-worker0
elastic-worker1 can reach elastic-worker2
elastic-worker0 can reach elastic-worker1
elastic-worker0 can reach elastic-worker0
elastic-worker0 can reach elastic-worker2
elastic-worker2 can reach elastic-worker1
elastic-worker2 can reach elastic-worker0
elastic-worker2 can reach elastic-worker2

Source: P1350

[1] T3592#83795

@vsellier here is a rapid summary on:

  • how I currently deploy the container image for now:
$ cd $SWH_ENVIRONMENT_HOME/swh-environment/docker
$ swh-doco-rebuild --no-cache
+ DOCKER_CMD=/nix/store/5a5i251w81licm57ikbiysc9fcpw391s-docker-20.10.14/bin/docker
+ cd /home/tony/work/inria/repo/swh/swh-environment/docker
+ /nix/store/5a5i251w81licm57ikbiysc9fcpw391s-docker-20.10.14/bin/docker build -f Dockerfile -t swh/stack --no-cache .
Sending build context to Docker daemon     99MB
...
Successfully built 73acae6847b4
Successfully tagged swh/stack:latest
$ docker tag swh/stack:latest softwareheritage/loaders:2022-04-29
$ docker tag swh/stack:latest softwareheritage/loaders:latest
$ docker login
$ docker push softwareheritage/loaders:2022-04-29
$ docker push softwareheritage/loaders:latest
$ docker run -it softwareheritage/loaders:2022-04-29 pip list | grep swh.loader.
swh.loader.bzr        1.3.1
swh.loader.core       3.3.0
swh.loader.cvs        0.2.2
swh.loader.git        1.6.0
swh.loader.mercurial  3.1.1
swh.loader.metadata   0.0.2
swh.loader.svn        1.3.2
  • What's running:
$ pwd
$SWH_ENVIRONMENT_HOME/snippets/sysadmin/T3592-elastic-workers/worker
$ export KUBECONFIG=~/.config/swh/staging-workers.yml  # find it in rancher cluster instance
$ for TYPE in bzr cvs git maven pypi npm svn; do REL=workers-$TYPE; helm install -f ./loader-$TYPE.staging.values.yaml $REL ./; done
...
$ helm list
NAME            NAMESPACE       REVISION        UPDATED                                         STATUS          CHART           APP VERSION
workers-bzr     default         2               2022-04-29 15:32:29.955838425 +0200 CEST        deployed        worker-0.1.0    1.16.0
workers-cvs     default         2               2022-04-29 15:32:36.201907985 +0200 CEST        deployed        worker-0.1.0    1.16.0
workers-git     default         5               2022-04-29 15:32:41.997367337 +0200 CEST        deployed        worker-0.1.0    1.16.0
workers-maven   default         1               2022-04-29 15:35:59.969255742 +0200 CEST        deployed        worker-0.1.0    1.16.0
workers-npm     default         2               2022-04-29 15:32:58.461646407 +0200 CEST        deployed        worker-0.1.0    1.16.0
workers-pypi    default         2               2022-04-29 15:32:50.474776347 +0200 CEST        deployed        worker-0.1.0    1.16.0
workers-svn     default         2               2022-04-29 15:33:09.9233799 +0200 CEST          deployed        worker-0.1.0    1.16.0