Sure, we should have authentication / rate limit on this.
But if I'm not wrong, the target is to test the mirroring with ENEA.
If we add authentication, we need to improve the objstorage-replayer / objstorage to support it.

Oct 8 2021, 6:00 PM · System administration

vsellier added a revision to T3621: Create a production read-only objstorage: D6448: Deploy a read-only objstorage on moma.

Oct 8 2021, 5:14 PM · System administration

vsellier requested review of D6448: Deploy a read-only objstorage on moma.

Oct 8 2021, 5:14 PM

vsellier added a comment to T3621: Create a production read-only objstorage.

rSENV646f62805ef564bceed4d3a4d84d8fb6890f2d19 declares the new certificate for the vagrant tests (wrong task on the commit message)

Oct 8 2021, 2:21 PM · System administration

vsellier committed rSENV646f62805ef5: Add read-only storage self-signed certificate (authored by vsellier).

Add read-only storage self-signed certificate

Oct 8 2021, 2:20 PM

Oct 6 2021

vsellier closed T3615: Adapt rabbitmq monitoring for bullseye, a subtask of T3487: Installation of the new provenance server, as Resolved.

Oct 6 2021, 6:19 PM · System administration

vsellier closed T3615: Adapt rabbitmq monitoring for bullseye as Resolved.

Oct 6 2021, 6:19 PM · System administration

vsellier closed D6367: Adapt the prometheus rabbitmq plugin for bullseye.

Oct 6 2021, 6:13 PM · System administration

vsellier committed rSPSITE6277c45abc11: Adapt the prometheus rabbitmq plugin for bullseye (authored by ardumont).

Adapt the prometheus rabbitmq plugin for bullseye

Oct 6 2021, 6:13 PM

vsellier updated the diff for D6367: Adapt the prometheus rabbitmq plugin for bullseye.

update commit message

Oct 6 2021, 6:11 PM · System administration

vsellier changed the status of T3621: Create a production read-only objstorage from Open to Work in Progress.

Oct 6 2021, 6:02 PM · System administration

vsellier retitled D6367: Adapt the prometheus rabbitmq plugin for bullseye from wip: Adapt the prometheus rabbitmq plugin for bullseye to Adapt the prometheus rabbitmq plugin for bullseye.

Oct 6 2021, 5:48 PM · System administration

vsellier updated the diff for D6367: Adapt the prometheus rabbitmq plugin for bullseye.

factorize the exported configuration
use the right exporter port on met

Oct 6 2021, 5:45 PM · System administration

vsellier updated the diff for D6367: Adapt the prometheus rabbitmq plugin for bullseye.

rebase

Oct 6 2021, 4:14 PM · System administration

vsellier commandeered D6367: Adapt the prometheus rabbitmq plugin for bullseye.

Oct 6 2021, 4:13 PM · System administration

vsellier closed T3320: Test rancher pros/cons as Resolved.

I think the issue can be closed.
The pros are:

it simplify the cluster management (create, configuration and most of all, kubernetes upgrades)
centralize the global view of the cluster and what is running on it
OSS and transparent policy

Oct 6 2021, 2:36 PM · System administration

vsellier added a comment to T3630: staging - journal0 needs more space.

The propose plan looks too naive as some zk configuration also needs to be updated.
A recommended way is to add a new node on the cluster, migrate the partitions on the new node and shutdown the old one.
Doing like this ensure all the data and configuration will be correctly migrated, and without downtime, which is not negligible as the mirror tests of enea are in progress.

Oct 6 2021, 12:46 PM · System administration

vsellier committed rDSNIP0b2c9ffb779e: grid5000/cassandra: increase objstorage capacity (authored by vsellier).

grid5000/cassandra: increase objstorage capacity

Oct 6 2021, 2:46 AM

vsellier committed rDSNIP2a7a2efac087: grid5000/cassandra: improve git loader benchmark stability (authored by vsellier).

grid5000/cassandra: improve git loader benchmark stability

Oct 6 2021, 2:21 AM

vsellier added a comment to T3577: Parallel loaders performances .

The loader were finally stabilized. It was due to a wrong celery configuration.
Changing the pool configuration from solo to prefork solved the problem even if the concurrency is kept to one.
Solo looked indicated in environment like the POC but for obvious reasons, it was not working as expected:

Oct 6 2021, 2:11 AM · System administration, Storage manager

Oct 5 2021

vsellier closed T3633: staging/production - Kafka access for ENEA mirror as Resolved.

credentials added on the credential database under the refs:

operations/kafka/credentials/staging/swh-enea
operations/kafka/credentials/production/swh-enea

Oct 5 2021, 4:51 PM · System administration

vsellier added a comment to T3633: staging/production - Kafka access for ENEA mirror.

Production credentials created:

+ export zookeeper_servers=kafka1.internal.softwareheritage.org:2181
+ zookeeper_servers=kafka1.internal.softwareheritage.org:2181
+ export bootstrap_servers=kafka1.internal.softwareheritage.org:9092
+ bootstrap_servers=kafka1.internal.softwareheritage.org:9092
+ '[' -z swh-enea -o -z redacted ']'
+ set -eu
+ /opt/kafka/bin/kafka-configs.sh --zookeeper kafka1.internal.softwareheritage.org:2181/kafka/softwareheritage --alter --add-config 'SCRAM-SHA-256=[iterations=8192,password=redacted],SCRAM-SHA-512=[password=redacted]' --entity-type users --entity-name swh-enea
Warning: --zookeeper is deprecated and will be removed in a future version of Kafka.
Use --bootstrap-server instead to specify a broker to connect to.
Completed updating config for entity: user-principal 'swh-enea'.
+ /opt/kafka/bin/kafka-acls.sh --bootstrap-server kafka1.internal.softwareheritage.org:9092 --add --resource-pattern-type PREFIXED --topic swh.journal.objects. --allow-principal User:swh-enea --operation READ
Adding ACLs for resource `ResourcePattern(resourceType=TOPIC, name=swh.journal.objects., patternType=PREFIXED)`: 
 	(principal=User:swh-enea, host=*, operation=READ, permissionType=ALLOW)

Oct 5 2021, 4:49 PM · System administration

vsellier renamed T3633: staging/production - Kafka access for ENEA mirror from staging - Kafka access for ENEA mirror to staging/production - Kafka access for ENEA mirror.

Oct 5 2021, 4:24 PM · System administration

vsellier added a comment to T3633: staging/production - Kafka access for ENEA mirror.

Credentials create in stagingd:

ACLs for principal `User:swh-enea`
Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=swh.journal.objects., patternType=PREFIXED)`: 
 	(principal=User:swh-enea, host=*, operation=DESCRIBE, permissionType=ALLOW)
	(principal=User:swh-enea, host=*, operation=READ, permissionType=ALLOW)

Oct 5 2021, 3:50 PM · System administration

vsellier added a comment to T3633: staging/production - Kafka access for ENEA mirror.

export username=swh-enea
export password=XXXXX

Oct 5 2021, 3:49 PM · System administration

vsellier updated subscribers of T3633: staging/production - Kafka access for ENEA mirror.

Oct 5 2021, 3:28 PM · System administration

vsellier moved T3633: staging/production - Kafka access for ENEA mirror from Backlog to in-progress on the System administration board.

Oct 5 2021, 3:28 PM · System administration

vsellier changed the status of T3633: staging/production - Kafka access for ENEA mirror from Open to Work in Progress.

Oct 5 2021, 3:28 PM · System administration

vsellier added a comment to T3630: staging - journal0 needs more space.

Actions to perform for the migration:

add the role::swh_kafka_broker role to the new server in puppet and deploy
stop the staging workers
stop kafka on journal0
rsync the content from journal0 to the new server
update the staging configurations to use the new server as journal
update the NAT public staging for staging journal on the firewall (https://192.168.50.1/firewall_nat.php)

Oct 5 2021, 11:36 AM · System administration

vsellier triaged T3630: staging - journal0 needs more space as High priority.

Oct 5 2021, 11:28 AM · System administration

vsellier accepted D6407: Adapt logrotate configuration so extra directory is also logrotated.

LGTM thanks

Oct 5 2021, 11:26 AM

vsellier added inline comments to D6407: Adapt logrotate configuration so extra directory is also logrotated.

Oct 5 2021, 11:11 AM

vsellier closed D6406: provenance: Configure the postgresql max_connections.

Oct 5 2021, 10:56 AM

vsellier committed rSPSITEefb36f766516: provenance: Configure the postgresql max_connections (authored by vsellier).

provenance: Configure the postgresql max_connections

Oct 5 2021, 10:56 AM

Oct 4 2021

vsellier added a revision to T3487: Installation of the new provenance server: D6406: provenance: Configure the postgresql max_connections.

Oct 4 2021, 5:59 PM · System administration

vsellier requested review of D6406: provenance: Configure the postgresql max_connections.

Oct 4 2021, 5:59 PM

vsellier added a comment to T3592: POC elastic worker infrastructure.

keda looks promising. P1193 is an example of configuration working for the docker environment. It's able to scale to 0 when no messages are present on the queue.
When messages are present, the loaders are launched progressively until the limit of cpu/memory of the host is reached or the max number of allowed worker is reached.

Oct 4 2021, 9:21 AM · System administration

vsellier created P1193 keda configuration for docker environment.

Oct 4 2021, 9:19 AM

Oct 1 2021

vsellier added a comment to T3577: Parallel loaders performances .

intermediary status:

the bench lab is easily deployable on g5k on several workers to distribute the load [1]
it's working well when the load is not so high. When the number of worker is increased, it seems the workers have some issues to talk with rabbitmq:

[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-p9ds5                                                                                                                                                     
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-n6pvm                    
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-mrcjj                                                                                                                                                     
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-7bn4s                                                                                       
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-lg2bd

and also an unexplained time drift:

[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,447: WARNING/MainProcess] Substantial drift from celery@loaders-77cdd444df-lxjpl may mean clocks are out of sync.  Current drift is 
[loaders-77cdd444df-flcv9 loaders] 356 seconds.  [orig: 2021-09-30 23:46:55.447181 recv: 2021-09-30 23:40:59.633444]                                                                                                                                                                     
[loaders-77cdd444df-flcv9 loaders]                                                                                                                                                                            
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,447: WARNING/MainProcess] Substantial drift from celery@loaders-77cdd444df-jd6v9 may mean clocks are out of sync.  Current drift is                                                                                              
[loaders-77cdd444df-flcv9 loaders] 355 seconds.  [orig: 2021-09-30 23:46:55.447552 recv: 2021-09-30 23:41:00.723983]                                  
[loaders-77cdd444df-flcv9 loaders]

Oct 1 2021, 5:07 PM · System administration, Storage manager

vsellier committed rDSNIP7cc495e333e2: grid5000/cassandra: kubernetes configuration for massive parallel loader test (authored by vsellier).

grid5000/cassandra: kubernetes configuration for massive parallel loader test

Oct 1 2021, 4:37 PM

vsellier added a comment to T3592: POC elastic worker infrastructure.

Intermediary status:

We have successfully ran loaders in staging using the helm chart we have wrote [1] and an hardcoded number of worker, It adds the possibility to perform rolling upgrades for example
We have tried the integrated horizontal pod autoscaler [2], it works pretty well but it's not adapted for our worker scenario. It's based on the cpu consumption(on our test [3], but can be other things) of the pod to decide if the number of running pods must be upscaled or downscaled. It can be very useful to manage classical load like for gunicorn container, but not for the scenario of long running tasks
Kubernetes also has some functionalities to reduce the pressure on a node when some limts are reached but it looks like it's more emergency actions than proper scaling management. It's configured at the kubelet level and not dynamic at all [4]. It was rapidly tested but we have lost the node due to oom before the node eviction starts.

Oct 1 2021, 4:18 PM · System administration

Sep 30 2021

vsellier updated subscribers of T3624: Update swh-graph from 0.3.0 to 0.5.0 on granet.

cc @seirl

Sep 30 2021, 5:10 PM · Compressed graph service, System administration

vsellier updated the task description for T3487: Installation of the new provenance server.

Sep 30 2021, 2:27 PM · System administration

vsellier closed D6378: provenance: Declare 10 pre-provisioned databases for the different experiments.

Sep 30 2021, 2:23 PM

vsellier committed rSPSITEe42b581fc789: provenance: Declare 10 pre-provisioned databases for the different experiments (authored by vsellier).

provenance: Declare 10 pre-provisioned databases for the different experiments

Sep 30 2021, 2:23 PM

vsellier updated the task description for T3487: Installation of the new provenance server.

Sep 30 2021, 12:50 PM · System administration

vsellier requested review of D6378: provenance: Declare 10 pre-provisioned databases for the different experiments.

Sep 30 2021, 12:50 PM

vsellier added a revision to T3487: Installation of the new provenance server: D6378: provenance: Declare 10 pre-provisioned databases for the different experiments.

Sep 30 2021, 12:50 PM · System administration

Sep 29 2021

vsellier triaged T3621: Create a production read-only objstorage as Normal priority.

Sep 29 2021, 5:39 PM · System administration

vsellier added a comment to T3592: POC elastic worker infrastructure.

After having hard time, we have solved several issues:

The rancher initialization problem was because we were using a wrong version of k3s compared to the compatibility matrix of rancher.

We installed rancher 2.5.9 on a recent version of k3s installing kubernetes 1.22.2. According to the compatibility matrix of rancher[1], using a older version of k3s solved the problem and the clusters start correctly after that

Sep 29 2021, 10:42 AM · System administration

Sep 28 2021

vsellier triaged T3617: Create a journalbeat package for bulleye as Normal priority.

Sep 28 2021, 4:08 PM · System administration

vsellier triaged T3616: Create a prometheus-statsd-exporter package for bullseye as Normal priority.

Sep 28 2021, 4:03 PM · System administration

vsellier triaged T3615: Adapt rabbitmq monitoring for bullseye as Normal priority.

Sep 28 2021, 4:02 PM · System administration

vsellier accepted D6365: Adapt postgresql connection information on the provenance server.

Sep 28 2021, 2:54 PM

vsellier closed D6364: provenance: declare rabbitmq users.

Sep 28 2021, 2:49 PM

vsellier committed rSPSITEd38d468c061c: provenance: declare rabbitmq users (authored by vsellier).

provenance: declare rabbitmq users

Sep 28 2021, 2:49 PM

vsellier updated the diff for D6364: provenance: declare rabbitmq users.

rebase

Sep 28 2021, 2:48 PM

vsellier committed rSPPRIVC2362ec691041: Generate censored data from uncensored repository (authored by vsellier).

Generate censored data from uncensored repository

Sep 28 2021, 2:47 PM

vsellier accepted D6363: Adapt postgresql connection information on the provenance server.

Sep 28 2021, 2:38 PM

vsellier requested review of D6364: provenance: declare rabbitmq users.

Sep 28 2021, 2:36 PM

vsellier added a revision to T3487: Installation of the new provenance server: D6364: provenance: declare rabbitmq users.

Sep 28 2021, 2:36 PM · System administration

vsellier added a comment to T3487: Installation of the new provenance server.

The zfs pool and dataset are configured:

pool configuration

## nvme drives pool
#zpool create data mirror nvme-eui.36315030525005540025384500000003 nvme-eui.36315030525005800025384500000003 mirror nvme-eui.36315030525005620025384500000003 nvme-eui.36315030525005890025384500000003

Sep 28 2021, 11:45 AM · System administration

vsellier added a comment to T3487: Installation of the new provenance server.

I forgot to mention there is a gift from dell on the server: an additional 600Go 10rpm disk

Sep 28 2021, 9:45 AM · System administration

vsellier added a comment to T3487: Installation of the new provenance server.

The server is installed. It remains few task to perform manually:

configure the zfs datasets (will configure 2 mirror pool for ~12To available, tell me if it's not what it's expected)
build few missing packages for bullseye (relative to the monitoring: prometheus-rabbitmq-exporter, prometheus-statsd-exporter, journalbeat)
configure a rabbitmq admin user

Sep 28 2021, 9:38 AM · System administration

Sep 27 2021

vsellier committed rSPPRIVC44570ad137d7: Generate censored data from uncensored repository (authored by vsellier).

Generate censored data from uncensored repository

Sep 27 2021, 7:45 PM

vsellier committed rSPSITE5a7dc21e8403: Fix database reference name (authored by vsellier).

Fix database reference name

Sep 27 2021, 7:38 PM

vsellier closed D6359: Prepare the configuration of the provenance server.

Sep 27 2021, 7:35 PM

vsellier committed rSPSITEa35c3550f9a6: Prepare the configuration of the provenance server (authored by vsellier).

Prepare the configuration of the provenance server

Sep 27 2021, 7:35 PM

vsellier closed D6356: Upgrade the debian sid release name.

Sep 27 2021, 7:35 PM

vsellier committed rSPSITE27a4f043fb04: Upgrade the debian sid release name (authored by vsellier).

Upgrade the debian sid release name

Sep 27 2021, 7:35 PM

vsellier closed D6354: Improve vagrant initialization time.

Sep 27 2021, 7:32 PM

vsellier committed rSENVfce61dca6c18: add provenance server (authored by vsellier).

add provenance server

Sep 27 2021, 7:32 PM

vsellier committed rSENV5105d6c4ada7: Improve vagrant initialization time (authored by vsellier).

Improve vagrant initialization time

Sep 27 2021, 7:32 PM

vsellier requested review of D6359: Prepare the configuration of the provenance server.

Sep 27 2021, 5:02 PM

vsellier added a revision to T3487: Installation of the new provenance server: D6359: Prepare the configuration of the provenance server.

Sep 27 2021, 5:02 PM · System administration

vsellier added a comment to T3487: Installation of the new provenance server.

yes pgbouncer will be used and it's configured by default to 2000 // connections
I don't know the kind of load the provenance client will generate but the default 100 connections allowed by postgres will be probably too short and needed to be increased too

Sep 27 2021, 4:55 PM · System administration

vsellier added a comment to T3487: Installation of the new provenance server.

As see with @aeviso , we will install the following components on the server (the os will be debian11)