Page MenuHomeSoftware Heritage
Feed Advanced Search

Oct 12 2021

vsellier committed rDDOCd6e02eecf52a: sysadm: add network architecture section (authored by vsellier).
sysadm: add network architecture section
Oct 12 2021, 12:39 PM

Oct 11 2021

vsellier closed T3616: Create a prometheus-statsd-exporter package for bullseye, a subtask of T3487: Installation of the new provenance server, as Resolved.
Oct 11 2021, 1:23 PM · System administration
vsellier closed T3616: Create a prometheus-statsd-exporter package for bullseye as Resolved.

solved by T3487#71814

Oct 11 2021, 1:23 PM · System administration
vsellier closed T3617: Create a journalbeat package for bulleye, a subtask of T3487: Installation of the new provenance server, as Resolved.
Oct 11 2021, 1:23 PM · System administration
vsellier closed T3617: Create a journalbeat package for bulleye as Resolved.

solved by T3487#71814

Oct 11 2021, 1:23 PM · System administration

Oct 8 2021

vsellier updated subscribers of T3621: Create a production read-only objstorage.

Sure, we should have authentication / rate limit on this.
But if I'm not wrong, the target is to test the mirroring with ENEA.
If we add authentication, we need to improve the objstorage-replayer / objstorage to support it.

Oct 8 2021, 6:00 PM · System administration
vsellier added a revision to T3621: Create a production read-only objstorage: D6448: Deploy a read-only objstorage on moma.
Oct 8 2021, 5:14 PM · System administration
vsellier requested review of D6448: Deploy a read-only objstorage on moma.
Oct 8 2021, 5:14 PM
vsellier added a comment to T3621: Create a production read-only objstorage.

rSENV646f62805ef564bceed4d3a4d84d8fb6890f2d19 declares the new certificate for the vagrant tests (wrong task on the commit message)

Oct 8 2021, 2:21 PM · System administration
vsellier committed rSENV646f62805ef5: Add read-only storage self-signed certificate (authored by vsellier).
Add read-only storage self-signed certificate
Oct 8 2021, 2:20 PM

Oct 6 2021

vsellier closed T3615: Adapt rabbitmq monitoring for bullseye, a subtask of T3487: Installation of the new provenance server, as Resolved.
Oct 6 2021, 6:19 PM · System administration
vsellier closed T3615: Adapt rabbitmq monitoring for bullseye as Resolved.
Oct 6 2021, 6:19 PM · System administration
vsellier closed D6367: Adapt the prometheus rabbitmq plugin for bullseye.
Oct 6 2021, 6:13 PM · System administration
vsellier committed rSPSITE6277c45abc11: Adapt the prometheus rabbitmq plugin for bullseye (authored by ardumont).
Adapt the prometheus rabbitmq plugin for bullseye
Oct 6 2021, 6:13 PM
vsellier updated the diff for D6367: Adapt the prometheus rabbitmq plugin for bullseye.

update commit message

Oct 6 2021, 6:11 PM · System administration
vsellier changed the status of T3621: Create a production read-only objstorage from Open to Work in Progress.
Oct 6 2021, 6:02 PM · System administration
vsellier retitled D6367: Adapt the prometheus rabbitmq plugin for bullseye from wip: Adapt the prometheus rabbitmq plugin for bullseye to Adapt the prometheus rabbitmq plugin for bullseye.
Oct 6 2021, 5:48 PM · System administration
vsellier updated the diff for D6367: Adapt the prometheus rabbitmq plugin for bullseye.
  • factorize the exported configuration
  • use the right exporter port on met
Oct 6 2021, 5:45 PM · System administration
vsellier updated the diff for D6367: Adapt the prometheus rabbitmq plugin for bullseye.

rebase

Oct 6 2021, 4:14 PM · System administration
vsellier commandeered D6367: Adapt the prometheus rabbitmq plugin for bullseye.
Oct 6 2021, 4:13 PM · System administration
vsellier closed T3320: Test rancher pros/cons as Resolved.

I think the issue can be closed.
The pros are:

  • it simplify the cluster management (create, configuration and most of all, kubernetes upgrades)
  • centralize the global view of the cluster and what is running on it
  • OSS and transparent policy
Oct 6 2021, 2:36 PM · System administration
vsellier added a comment to T3630: staging - journal0 needs more space.

The propose plan looks too naive as some zk configuration also needs to be updated.
A recommended way is to add a new node on the cluster, migrate the partitions on the new node and shutdown the old one.
Doing like this ensure all the data and configuration will be correctly migrated, and without downtime, which is not negligible as the mirror tests of enea are in progress.

Oct 6 2021, 12:46 PM · System administration
vsellier committed rDSNIP0b2c9ffb779e: grid5000/cassandra: increase objstorage capacity (authored by vsellier).
grid5000/cassandra: increase objstorage capacity
Oct 6 2021, 2:46 AM
vsellier committed rDSNIP2a7a2efac087: grid5000/cassandra: improve git loader benchmark stability (authored by vsellier).
grid5000/cassandra: improve git loader benchmark stability
Oct 6 2021, 2:21 AM
vsellier added a comment to T3577: Parallel loaders performances .

The loader were finally stabilized. It was due to a wrong celery configuration.
Changing the pool configuration from solo to prefork solved the problem even if the concurrency is kept to one.
Solo looked indicated in environment like the POC but for obvious reasons, it was not working as expected:

Oct 6 2021, 2:11 AM · System administration, Storage manager

Oct 5 2021

vsellier closed T3633: staging/production - Kafka access for ENEA mirror as Resolved.

credentials added on the credential database under the refs:

  • operations/kafka/credentials/staging/swh-enea
  • operations/kafka/credentials/production/swh-enea
Oct 5 2021, 4:51 PM · System administration
vsellier added a comment to T3633: staging/production - Kafka access for ENEA mirror.

Production credentials created:

+ export zookeeper_servers=kafka1.internal.softwareheritage.org:2181
+ zookeeper_servers=kafka1.internal.softwareheritage.org:2181
+ export bootstrap_servers=kafka1.internal.softwareheritage.org:9092
+ bootstrap_servers=kafka1.internal.softwareheritage.org:9092
+ '[' -z swh-enea -o -z redacted ']'
+ set -eu
+ /opt/kafka/bin/kafka-configs.sh --zookeeper kafka1.internal.softwareheritage.org:2181/kafka/softwareheritage --alter --add-config 'SCRAM-SHA-256=[iterations=8192,password=redacted],SCRAM-SHA-512=[password=redacted]' --entity-type users --entity-name swh-enea
Warning: --zookeeper is deprecated and will be removed in a future version of Kafka.
Use --bootstrap-server instead to specify a broker to connect to.
Completed updating config for entity: user-principal 'swh-enea'.
+ /opt/kafka/bin/kafka-acls.sh --bootstrap-server kafka1.internal.softwareheritage.org:9092 --add --resource-pattern-type PREFIXED --topic swh.journal.objects. --allow-principal User:swh-enea --operation READ
Adding ACLs for resource `ResourcePattern(resourceType=TOPIC, name=swh.journal.objects., patternType=PREFIXED)`: 
 	(principal=User:swh-enea, host=*, operation=READ, permissionType=ALLOW)
Oct 5 2021, 4:49 PM · System administration
vsellier renamed T3633: staging/production - Kafka access for ENEA mirror from staging - Kafka access for ENEA mirror to staging/production - Kafka access for ENEA mirror.
Oct 5 2021, 4:24 PM · System administration
vsellier added a comment to T3633: staging/production - Kafka access for ENEA mirror.

Credentials create in stagingd:

ACLs for principal `User:swh-enea`
Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=swh.journal.objects., patternType=PREFIXED)`: 
 	(principal=User:swh-enea, host=*, operation=DESCRIBE, permissionType=ALLOW)
	(principal=User:swh-enea, host=*, operation=READ, permissionType=ALLOW)
Oct 5 2021, 3:50 PM · System administration
vsellier added a comment to T3633: staging/production - Kafka access for ENEA mirror.
export username=swh-enea
export password=XXXXX
Oct 5 2021, 3:49 PM · System administration
vsellier updated subscribers of T3633: staging/production - Kafka access for ENEA mirror.
Oct 5 2021, 3:28 PM · System administration
vsellier moved T3633: staging/production - Kafka access for ENEA mirror from Backlog to in-progress on the System administration board.
Oct 5 2021, 3:28 PM · System administration
vsellier changed the status of T3633: staging/production - Kafka access for ENEA mirror from Open to Work in Progress.
Oct 5 2021, 3:28 PM · System administration
vsellier added a comment to T3630: staging - journal0 needs more space.

Actions to perform for the migration:

  • add the role::swh_kafka_broker role to the new server in puppet and deploy
  • stop the staging workers
  • stop kafka on journal0
  • rsync the content from journal0 to the new server
  • update the staging configurations to use the new server as journal
  • update the NAT public staging for staging journal on the firewall (https://192.168.50.1/firewall_nat.php)
Oct 5 2021, 11:36 AM · System administration
vsellier triaged T3630: staging - journal0 needs more space as High priority.
Oct 5 2021, 11:28 AM · System administration
vsellier accepted D6407: Adapt logrotate configuration so extra directory is also logrotated.

LGTM thanks

Oct 5 2021, 11:26 AM
vsellier added inline comments to D6407: Adapt logrotate configuration so extra directory is also logrotated.
Oct 5 2021, 11:11 AM
vsellier closed D6406: provenance: Configure the postgresql max_connections.
Oct 5 2021, 10:56 AM
vsellier committed rSPSITEefb36f766516: provenance: Configure the postgresql max_connections (authored by vsellier).
provenance: Configure the postgresql max_connections
Oct 5 2021, 10:56 AM

Oct 4 2021

vsellier added a revision to T3487: Installation of the new provenance server: D6406: provenance: Configure the postgresql max_connections.
Oct 4 2021, 5:59 PM · System administration
vsellier requested review of D6406: provenance: Configure the postgresql max_connections.
Oct 4 2021, 5:59 PM
vsellier added a comment to T3592: POC elastic worker infrastructure.

keda looks promising. P1193 is an example of configuration working for the docker environment. It's able to scale to 0 when no messages are present on the queue.
When messages are present, the loaders are launched progressively until the limit of cpu/memory of the host is reached or the max number of allowed worker is reached.

Oct 4 2021, 9:21 AM · System administration
vsellier created P1193 keda configuration for docker environment.
Oct 4 2021, 9:19 AM

Oct 1 2021

vsellier added a comment to T3577: Parallel loaders performances .

intermediary status:

  • the bench lab is easily deployable on g5k on several workers to distribute the load [1]
  • it's working well when the load is not so high. When the number of worker is increased, it seems the workers have some issues to talk with rabbitmq:
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-p9ds5                                                                                                                                                     
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-n6pvm                    
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-mrcjj                                                                                                                                                     
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-7bn4s                                                                                       
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-lg2bd

and also an unexplained time drift:

[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,447: WARNING/MainProcess] Substantial drift from celery@loaders-77cdd444df-lxjpl may mean clocks are out of sync.  Current drift is 
[loaders-77cdd444df-flcv9 loaders] 356 seconds.  [orig: 2021-09-30 23:46:55.447181 recv: 2021-09-30 23:40:59.633444]                                                                                                                                                                     
[loaders-77cdd444df-flcv9 loaders]                                                                                                                                                                            
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,447: WARNING/MainProcess] Substantial drift from celery@loaders-77cdd444df-jd6v9 may mean clocks are out of sync.  Current drift is                                                                                              
[loaders-77cdd444df-flcv9 loaders] 355 seconds.  [orig: 2021-09-30 23:46:55.447552 recv: 2021-09-30 23:41:00.723983]                                  
[loaders-77cdd444df-flcv9 loaders]
Oct 1 2021, 5:07 PM · System administration, Storage manager
vsellier committed rDSNIP7cc495e333e2: grid5000/cassandra: kubernetes configuration for massive parallel loader test (authored by vsellier).
grid5000/cassandra: kubernetes configuration for massive parallel loader test
Oct 1 2021, 4:37 PM
vsellier added a comment to T3592: POC elastic worker infrastructure.

Intermediary status:

  • We have successfully ran loaders in staging using the helm chart we have wrote [1] and an hardcoded number of worker, It adds the possibility to perform rolling upgrades for example
  • We have tried the integrated horizontal pod autoscaler [2], it works pretty well but it's not adapted for our worker scenario. It's based on the cpu consumption(on our test [3], but can be other things) of the pod to decide if the number of running pods must be upscaled or downscaled. It can be very useful to manage classical load like for gunicorn container, but not for the scenario of long running tasks
  • Kubernetes also has some functionalities to reduce the pressure on a node when some limts are reached but it looks like it's more emergency actions than proper scaling management. It's configured at the kubelet level and not dynamic at all [4]. It was rapidly tested but we have lost the node due to oom before the node eviction starts.
Oct 1 2021, 4:18 PM · System administration

Sep 30 2021

vsellier updated subscribers of T3624: Update swh-graph from 0.3.0 to 0.5.0 on granet.

cc @seirl

Sep 30 2021, 5:10 PM · Compressed graph service, System administration
vsellier updated the task description for T3487: Installation of the new provenance server.
Sep 30 2021, 2:27 PM · System administration
vsellier closed D6378: provenance: Declare 10 pre-provisioned databases for the different experiments.
Sep 30 2021, 2:23 PM
vsellier committed rSPSITEe42b581fc789: provenance: Declare 10 pre-provisioned databases for the different experiments (authored by vsellier).
provenance: Declare 10 pre-provisioned databases for the different experiments
Sep 30 2021, 2:23 PM
vsellier updated the task description for T3487: Installation of the new provenance server.
Sep 30 2021, 12:50 PM · System administration
vsellier requested review of D6378: provenance: Declare 10 pre-provisioned databases for the different experiments.
Sep 30 2021, 12:50 PM
vsellier added a revision to T3487: Installation of the new provenance server: D6378: provenance: Declare 10 pre-provisioned databases for the different experiments.
Sep 30 2021, 12:50 PM · System administration

Sep 29 2021

vsellier triaged T3621: Create a production read-only objstorage as Normal priority.
Sep 29 2021, 5:39 PM · System administration
vsellier added a comment to T3592: POC elastic worker infrastructure.

After having hard time, we have solved several issues:

  • The rancher initialization problem was because we were using a wrong version of k3s compared to the compatibility matrix of rancher.

We installed rancher 2.5.9 on a recent version of k3s installing kubernetes 1.22.2. According to the compatibility matrix of rancher[1], using a older version of k3s solved the problem and the clusters start correctly after that

Sep 29 2021, 10:42 AM · System administration

Sep 28 2021

vsellier triaged T3617: Create a journalbeat package for bulleye as Normal priority.
Sep 28 2021, 4:08 PM · System administration
vsellier triaged T3616: Create a prometheus-statsd-exporter package for bullseye as Normal priority.
Sep 28 2021, 4:03 PM · System administration
vsellier triaged T3615: Adapt rabbitmq monitoring for bullseye as Normal priority.
Sep 28 2021, 4:02 PM · System administration
vsellier accepted D6365: Adapt postgresql connection information on the provenance server.
Sep 28 2021, 2:54 PM
vsellier closed D6364: provenance: declare rabbitmq users.
Sep 28 2021, 2:49 PM
vsellier committed rSPSITEd38d468c061c: provenance: declare rabbitmq users (authored by vsellier).
provenance: declare rabbitmq users
Sep 28 2021, 2:49 PM
vsellier updated the diff for D6364: provenance: declare rabbitmq users.

rebase

Sep 28 2021, 2:48 PM
vsellier committed rSPPRIVC2362ec691041: Generate censored data from uncensored repository (authored by vsellier).
Generate censored data from uncensored repository
Sep 28 2021, 2:47 PM
vsellier accepted D6363: Adapt postgresql connection information on the provenance server.
Sep 28 2021, 2:38 PM
vsellier requested review of D6364: provenance: declare rabbitmq users.
Sep 28 2021, 2:36 PM
vsellier added a revision to T3487: Installation of the new provenance server: D6364: provenance: declare rabbitmq users.
Sep 28 2021, 2:36 PM · System administration
vsellier added a comment to T3487: Installation of the new provenance server.

The zfs pool and dataset are configured:

  • pool configuration
## nvme drives pool
#zpool create data mirror nvme-eui.36315030525005540025384500000003 nvme-eui.36315030525005800025384500000003 mirror nvme-eui.36315030525005620025384500000003 nvme-eui.36315030525005890025384500000003
Sep 28 2021, 11:45 AM · System administration
vsellier added a comment to T3487: Installation of the new provenance server.

I forgot to mention there is a gift from dell on the server: an additional 600Go 10rpm disk

Sep 28 2021, 9:45 AM · System administration
vsellier added a comment to T3487: Installation of the new provenance server.

The server is installed. It remains few task to perform manually:

  • configure the zfs datasets (will configure 2 mirror pool for ~12To available, tell me if it's not what it's expected)
  • build few missing packages for bullseye (relative to the monitoring: prometheus-rabbitmq-exporter, prometheus-statsd-exporter, journalbeat)
  • configure a rabbitmq admin user
Sep 28 2021, 9:38 AM · System administration

Sep 27 2021

vsellier committed rSPPRIVC44570ad137d7: Generate censored data from uncensored repository (authored by vsellier).
Generate censored data from uncensored repository
Sep 27 2021, 7:45 PM
vsellier committed rSPSITE5a7dc21e8403: Fix database reference name (authored by vsellier).
Fix database reference name
Sep 27 2021, 7:38 PM
vsellier closed D6359: Prepare the configuration of the provenance server.
Sep 27 2021, 7:35 PM
vsellier committed rSPSITEa35c3550f9a6: Prepare the configuration of the provenance server (authored by vsellier).
Prepare the configuration of the provenance server
Sep 27 2021, 7:35 PM
vsellier closed D6356: Upgrade the debian sid release name.
Sep 27 2021, 7:35 PM
vsellier committed rSPSITE27a4f043fb04: Upgrade the debian sid release name (authored by vsellier).
Upgrade the debian sid release name
Sep 27 2021, 7:35 PM
vsellier closed D6354: Improve vagrant initialization time.
Sep 27 2021, 7:32 PM
vsellier committed rSENVfce61dca6c18: add provenance server (authored by vsellier).
add provenance server
Sep 27 2021, 7:32 PM
vsellier committed rSENV5105d6c4ada7: Improve vagrant initialization time (authored by vsellier).
Improve vagrant initialization time
Sep 27 2021, 7:32 PM
vsellier requested review of D6359: Prepare the configuration of the provenance server.
Sep 27 2021, 5:02 PM
vsellier added a revision to T3487: Installation of the new provenance server: D6359: Prepare the configuration of the provenance server.
Sep 27 2021, 5:02 PM · System administration
vsellier added a comment to T3487: Installation of the new provenance server.

yes pgbouncer will be used and it's configured by default to 2000 // connections
I don't know the kind of load the provenance client will generate but the default 100 connections allowed by postgres will be probably too short and needed to be increased too

Sep 27 2021, 4:55 PM · System administration
vsellier added a comment to T3487: Installation of the new provenance server.

As see with @aeviso , we will install the following components on the server (the os will be debian11)

  • rabbitmq
  • postgresql:13
    • a default swh-storage database will be managed by puppet
    • 1000 parallel connections allowed
    • shared_buffers 50go
  • docker
Sep 27 2021, 4:08 PM · System administration
vsellier requested review of D6356: Upgrade the debian sid release name.
Sep 27 2021, 3:38 PM
vsellier added a revision to T3579: Meta-task: upgrade infrastructure to Debian Bullseye: D6356: Upgrade the debian sid release name.
Sep 27 2021, 3:38 PM · System administration (Component upgrades)
vsellier requested review of D6354: Improve vagrant initialization time.
Sep 27 2021, 3:01 PM
vsellier closed D6350: service urls: Fix the public url of the staging brocker.
Sep 27 2021, 11:25 AM
vsellier committed rDDOCef4409979126: service urls: Fix the public url of the staging brocker (authored by vsellier).
service urls: Fix the public url of the staging brocker
Sep 27 2021, 11:25 AM
vsellier requested review of D6350: service urls: Fix the public url of the staging brocker.
Sep 27 2021, 10:57 AM
vsellier added a revision to T3408: Provide read-only access to production servers: D6350: service urls: Fix the public url of the staging brocker.
Sep 27 2021, 10:57 AM · System administration

Sep 24 2021

vsellier accepted D6305: opam: Install and maintain up-to-date shared opam root directories.

It looks ok for the puppet code

Sep 24 2021, 4:15 PM

Sep 23 2021

vsellier added a comment to T3592: POC elastic worker infrastructure.

Interesting documentations on how to manage jobs:

Sep 23 2021, 10:32 AM · System administration

Sep 22 2021

vsellier closed D6308: Add a documentation page to list the services urls.
Sep 22 2021, 12:35 PM · System administration
vsellier committed rDDOC3a935af020e7: add a documentation page to list the services urls (authored by vsellier).
add a documentation page to list the services urls
Sep 22 2021, 12:35 PM
vsellier committed rDENV2507d723a1e6: POC a default smaller profile equivalent to the default docker-compose (authored by vsellier).
POC a default smaller profile equivalent to the default docker-compose
Sep 22 2021, 12:28 PM
vsellier committed rDENV46090bb0b458: Upgrade the registry-ui and fix a CORS issue blocking the access to the registry (authored by vsellier).
Upgrade the registry-ui and fix a CORS issue blocking the access to the registry
Sep 22 2021, 12:28 PM
vsellier updated the task description for T3592: POC elastic worker infrastructure.
Sep 22 2021, 12:03 PM · System administration
vsellier updated the diff for D6308: Add a documentation page to list the services urls.
  • Remove useless ending '/'
  • define VPN / private meanings
Sep 22 2021, 11:25 AM · System administration
vsellier added inline comments to D6308: Add a documentation page to list the services urls.
Sep 22 2021, 10:56 AM · System administration
vsellier added inline comments to D6308: Add a documentation page to list the services urls.
Sep 22 2021, 10:41 AM · System administration

Sep 21 2021

vsellier updated the diff for D6308: Add a documentation page to list the services urls.

RabbitMq GUI -> RabbitMQ

Sep 21 2021, 11:49 AM · System administration