Page MenuHomeSoftware Heritage
Feed Advanced Search

Oct 14 2021

vsellier added a comment to T3630: staging - journal0 needs more space.

zfs dataset created on storage1:

root@storage1:~# zfs create -o mountpoint=/srv/kafka -o atime=off -o relatime=on data/kafka
root@storage1:~# zfs list
NAME           USED  AVAIL     REFER  MOUNTPOINT
data          5.39T  21.0T       96K  /data
data/kafka      96K  21.0T       96K  /srv/kafka
data/objects  5.38T  21.0T     5.38T  /srv/softwareheritage/objects
Oct 14 2021, 4:27 PM · System administration
vsellier added a comment to T3630: staging - journal0 needs more space.

When D6477 will be validated and applied, the move action will be:

Oct 14 2021, 4:22 PM · System administration
vsellier created P1201 staging: topics to move to storage1.
Oct 14 2021, 4:08 PM
vsellier requested review of D6477: staging/journal: Declare a new kafka node to migrate journal0.
Oct 14 2021, 4:07 PM
vsellier added a revision to T3630: staging - journal0 needs more space: D6477: staging/journal: Declare a new kafka node to migrate journal0.
Oct 14 2021, 4:07 PM · System administration
vsellier committed rSENV5650b21f5abc: vagrant: add kafka self-signed certificate for storage1 (test only) (authored by vsellier).
vagrant: add kafka self-signed certificate for storage1 (test only)
Oct 14 2021, 4:06 PM
vsellier committed rSPPRIVCc282d80f009c: Add storage1 broker password (authored by vsellier).
Add storage1 broker password
Oct 14 2021, 4:04 PM
vsellier committed rSPPRIVC94b6d450ce0e: add aeviso password (authored by vsellier).
add aeviso password
Oct 14 2021, 4:04 PM
vsellier committed rDDOC932896f50fd1: sysadm/network: remove a typo (authored by vsellier).
sysadm/network: remove a typo
Oct 14 2021, 11:53 AM
vsellier changed the status of T3630: staging - journal0 needs more space from Open to Work in Progress.
Oct 14 2021, 11:28 AM · System administration
vsellier closed D6465: sphinx: fix pip cache directory permissions.
Oct 14 2021, 11:16 AM
vsellier committed rCDFJb2eae5d37d3e: sphinx: fix pip cache directory permissions (authored by vsellier).
sphinx: fix pip cache directory permissions
Oct 14 2021, 11:16 AM
vsellier closed D6466: Proposal for network page.
Oct 14 2021, 11:14 AM
vsellier committed rDDOC449f7ad4d973: Proposal for network page (authored by vsellier).
Proposal for network page
Oct 14 2021, 11:14 AM
vsellier updated the diff for D6466: Proposal for network page.

rebase

Oct 14 2021, 11:13 AM
vsellier updated the diff for D6466: Proposal for network page.

update according to the feedbacks

Oct 14 2021, 11:06 AM
vsellier added a comment to T3573: [cassandra] directory and content read benchmarks.

Some flame graphs of storage was performed during the ingestion with 50 workers in //

Oct 14 2021, 10:08 AM · System administration, Storage manager

Oct 13 2021

vsellier updated the summary of D6466: Proposal for network page.
Oct 13 2021, 4:56 PM
vsellier requested review of D6466: Proposal for network page.
Oct 13 2021, 4:55 PM
vsellier added a revision to T3154: sysadm docs: Move relevant and public doc from intranet to swh-docs: D6466: Proposal for network page.
Oct 13 2021, 4:55 PM · System administration, Documentation
vsellier updated the test plan for D6465: sphinx: fix pip cache directory permissions.
Oct 13 2021, 3:41 PM
vsellier updated the test plan for D6465: sphinx: fix pip cache directory permissions.
Oct 13 2021, 3:41 PM
vsellier requested review of D6465: sphinx: fix pip cache directory permissions.
Oct 13 2021, 3:41 PM

Oct 12 2021

vsellier added a comment to T3577: Parallel loaders performances .

Some runs with the fix:
It globally improves the stability of the benchmark by reducing the timeouts.

Oct 12 2021, 6:27 PM · System administration, Storage manager
vsellier closed T3407: Upgrade sphinx docker image to use a more recent version of plantuml as Resolved.
Oct 12 2021, 5:59 PM · System administration, Documentation
vsellier closed D6462: sphinx: update the plantuml version installed by the debian package.
Oct 12 2021, 5:59 PM
vsellier committed rCDFJ727112cbfeac: sphinx: update the plantuml version installed by the debian package (authored by vsellier).
sphinx: update the plantuml version installed by the debian package
Oct 12 2021, 5:59 PM
vsellier updated the diff for D6462: sphinx: update the plantuml version installed by the debian package.

Remove an unnecessary linefeed

Oct 12 2021, 5:34 PM
vsellier added a revision to T3407: Upgrade sphinx docker image to use a more recent version of plantuml: D6462: sphinx: update the plantuml version installed by the debian package.
Oct 12 2021, 5:14 PM · System administration, Documentation
vsellier requested review of D6462: sphinx: update the plantuml version installed by the debian package.
Oct 12 2021, 5:14 PM
vsellier changed the status of T3407: Upgrade sphinx docker image to use a more recent version of plantuml from Open to Work in Progress.
Oct 12 2021, 5:04 PM · System administration, Documentation
vsellier committed rDDOC31a362fe33b9: sysadm/life-cycle: complete the tools life-cycle page (authored by vsellier).
sysadm/life-cycle: complete the tools life-cycle page
Oct 12 2021, 3:15 PM
vsellier committed rDDOCd05e4669e749: sysadm/network: fix missing anchor in devel doc (authored by vsellier).
sysadm/network: fix missing anchor in devel doc
Oct 12 2021, 2:51 PM
vsellier committed rDDOCb3c6ac084e69: sysadm/lifecycle: add work in progress markers (authored by vsellier).
sysadm/lifecycle: add work in progress markers
Oct 12 2021, 2:35 PM
vsellier committed rDDOCacf1b90b164a: sysadm/network: add work in progress markers (authored by vsellier).
sysadm/network: add work in progress markers
Oct 12 2021, 2:35 PM
vsellier committed rDDOC583e71a8974a: sysadm: add life-cycle-management section (authored by vsellier).
sysadm: add life-cycle-management section
Oct 12 2021, 12:39 PM
vsellier committed rDDOCd6e02eecf52a: sysadm: add network architecture section (authored by vsellier).
sysadm: add network architecture section
Oct 12 2021, 12:39 PM

Oct 11 2021

vsellier closed T3616: Create a prometheus-statsd-exporter package for bullseye, a subtask of T3487: Installation of the new provenance server, as Resolved.
Oct 11 2021, 1:23 PM · System administration
vsellier closed T3616: Create a prometheus-statsd-exporter package for bullseye as Resolved.

solved by T3487#71814

Oct 11 2021, 1:23 PM · System administration
vsellier closed T3617: Create a journalbeat package for bulleye, a subtask of T3487: Installation of the new provenance server, as Resolved.
Oct 11 2021, 1:23 PM · System administration
vsellier closed T3617: Create a journalbeat package for bulleye as Resolved.

solved by T3487#71814

Oct 11 2021, 1:23 PM · System administration

Oct 8 2021

vsellier updated subscribers of T3621: Create a production read-only objstorage.

Sure, we should have authentication / rate limit on this.
But if I'm not wrong, the target is to test the mirroring with ENEA.
If we add authentication, we need to improve the objstorage-replayer / objstorage to support it.

Oct 8 2021, 6:00 PM · System administration
vsellier added a revision to T3621: Create a production read-only objstorage: D6448: Deploy a read-only objstorage on moma.
Oct 8 2021, 5:14 PM · System administration
vsellier requested review of D6448: Deploy a read-only objstorage on moma.
Oct 8 2021, 5:14 PM
vsellier added a comment to T3621: Create a production read-only objstorage.

rSENV646f62805ef564bceed4d3a4d84d8fb6890f2d19 declares the new certificate for the vagrant tests (wrong task on the commit message)

Oct 8 2021, 2:21 PM · System administration
vsellier committed rSENV646f62805ef5: Add read-only storage self-signed certificate (authored by vsellier).
Add read-only storage self-signed certificate
Oct 8 2021, 2:20 PM

Oct 6 2021

vsellier closed T3615: Adapt rabbitmq monitoring for bullseye, a subtask of T3487: Installation of the new provenance server, as Resolved.
Oct 6 2021, 6:19 PM · System administration
vsellier closed T3615: Adapt rabbitmq monitoring for bullseye as Resolved.
Oct 6 2021, 6:19 PM · System administration
vsellier closed D6367: Adapt the prometheus rabbitmq plugin for bullseye.
Oct 6 2021, 6:13 PM · System administration
vsellier committed rSPSITE6277c45abc11: Adapt the prometheus rabbitmq plugin for bullseye (authored by ardumont).
Adapt the prometheus rabbitmq plugin for bullseye
Oct 6 2021, 6:13 PM
vsellier updated the diff for D6367: Adapt the prometheus rabbitmq plugin for bullseye.

update commit message

Oct 6 2021, 6:11 PM · System administration
vsellier changed the status of T3621: Create a production read-only objstorage from Open to Work in Progress.
Oct 6 2021, 6:02 PM · System administration
vsellier retitled D6367: Adapt the prometheus rabbitmq plugin for bullseye from wip: Adapt the prometheus rabbitmq plugin for bullseye to Adapt the prometheus rabbitmq plugin for bullseye.
Oct 6 2021, 5:48 PM · System administration
vsellier updated the diff for D6367: Adapt the prometheus rabbitmq plugin for bullseye.
  • factorize the exported configuration
  • use the right exporter port on met
Oct 6 2021, 5:45 PM · System administration
vsellier updated the diff for D6367: Adapt the prometheus rabbitmq plugin for bullseye.

rebase

Oct 6 2021, 4:14 PM · System administration
vsellier commandeered D6367: Adapt the prometheus rabbitmq plugin for bullseye.
Oct 6 2021, 4:13 PM · System administration
vsellier closed T3320: Test rancher pros/cons as Resolved.

I think the issue can be closed.
The pros are:

  • it simplify the cluster management (create, configuration and most of all, kubernetes upgrades)
  • centralize the global view of the cluster and what is running on it
  • OSS and transparent policy
Oct 6 2021, 2:36 PM · System administration
vsellier added a comment to T3630: staging - journal0 needs more space.

The propose plan looks too naive as some zk configuration also needs to be updated.
A recommended way is to add a new node on the cluster, migrate the partitions on the new node and shutdown the old one.
Doing like this ensure all the data and configuration will be correctly migrated, and without downtime, which is not negligible as the mirror tests of enea are in progress.

Oct 6 2021, 12:46 PM · System administration
vsellier committed rDSNIP0b2c9ffb779e: grid5000/cassandra: increase objstorage capacity (authored by vsellier).
grid5000/cassandra: increase objstorage capacity
Oct 6 2021, 2:46 AM
vsellier committed rDSNIP2a7a2efac087: grid5000/cassandra: improve git loader benchmark stability (authored by vsellier).
grid5000/cassandra: improve git loader benchmark stability
Oct 6 2021, 2:21 AM
vsellier added a comment to T3577: Parallel loaders performances .

The loader were finally stabilized. It was due to a wrong celery configuration.
Changing the pool configuration from solo to prefork solved the problem even if the concurrency is kept to one.
Solo looked indicated in environment like the POC but for obvious reasons, it was not working as expected:

Oct 6 2021, 2:11 AM · System administration, Storage manager

Oct 5 2021

vsellier closed T3633: staging/production - Kafka access for ENEA mirror as Resolved.

credentials added on the credential database under the refs:

  • operations/kafka/credentials/staging/swh-enea
  • operations/kafka/credentials/production/swh-enea
Oct 5 2021, 4:51 PM · System administration
vsellier added a comment to T3633: staging/production - Kafka access for ENEA mirror.

Production credentials created:

+ export zookeeper_servers=kafka1.internal.softwareheritage.org:2181
+ zookeeper_servers=kafka1.internal.softwareheritage.org:2181
+ export bootstrap_servers=kafka1.internal.softwareheritage.org:9092
+ bootstrap_servers=kafka1.internal.softwareheritage.org:9092
+ '[' -z swh-enea -o -z redacted ']'
+ set -eu
+ /opt/kafka/bin/kafka-configs.sh --zookeeper kafka1.internal.softwareheritage.org:2181/kafka/softwareheritage --alter --add-config 'SCRAM-SHA-256=[iterations=8192,password=redacted],SCRAM-SHA-512=[password=redacted]' --entity-type users --entity-name swh-enea
Warning: --zookeeper is deprecated and will be removed in a future version of Kafka.
Use --bootstrap-server instead to specify a broker to connect to.
Completed updating config for entity: user-principal 'swh-enea'.
+ /opt/kafka/bin/kafka-acls.sh --bootstrap-server kafka1.internal.softwareheritage.org:9092 --add --resource-pattern-type PREFIXED --topic swh.journal.objects. --allow-principal User:swh-enea --operation READ
Adding ACLs for resource `ResourcePattern(resourceType=TOPIC, name=swh.journal.objects., patternType=PREFIXED)`: 
 	(principal=User:swh-enea, host=*, operation=READ, permissionType=ALLOW)
Oct 5 2021, 4:49 PM · System administration
vsellier renamed T3633: staging/production - Kafka access for ENEA mirror from staging - Kafka access for ENEA mirror to staging/production - Kafka access for ENEA mirror.
Oct 5 2021, 4:24 PM · System administration
vsellier added a comment to T3633: staging/production - Kafka access for ENEA mirror.

Credentials create in stagingd:

ACLs for principal `User:swh-enea`
Current ACLs for resource `ResourcePattern(resourceType=TOPIC, name=swh.journal.objects., patternType=PREFIXED)`: 
 	(principal=User:swh-enea, host=*, operation=DESCRIBE, permissionType=ALLOW)
	(principal=User:swh-enea, host=*, operation=READ, permissionType=ALLOW)
Oct 5 2021, 3:50 PM · System administration
vsellier added a comment to T3633: staging/production - Kafka access for ENEA mirror.
export username=swh-enea
export password=XXXXX
Oct 5 2021, 3:49 PM · System administration
vsellier updated subscribers of T3633: staging/production - Kafka access for ENEA mirror.
Oct 5 2021, 3:28 PM · System administration
vsellier moved T3633: staging/production - Kafka access for ENEA mirror from Backlog to in-progress on the System administration board.
Oct 5 2021, 3:28 PM · System administration
vsellier changed the status of T3633: staging/production - Kafka access for ENEA mirror from Open to Work in Progress.
Oct 5 2021, 3:28 PM · System administration
vsellier added a comment to T3630: staging - journal0 needs more space.

Actions to perform for the migration:

  • add the role::swh_kafka_broker role to the new server in puppet and deploy
  • stop the staging workers
  • stop kafka on journal0
  • rsync the content from journal0 to the new server
  • update the staging configurations to use the new server as journal
  • update the NAT public staging for staging journal on the firewall (https://192.168.50.1/firewall_nat.php)
Oct 5 2021, 11:36 AM · System administration
vsellier triaged T3630: staging - journal0 needs more space as High priority.
Oct 5 2021, 11:28 AM · System administration
vsellier accepted D6407: Adapt logrotate configuration so extra directory is also logrotated.

LGTM thanks

Oct 5 2021, 11:26 AM
vsellier added inline comments to D6407: Adapt logrotate configuration so extra directory is also logrotated.
Oct 5 2021, 11:11 AM
vsellier closed D6406: provenance: Configure the postgresql max_connections.
Oct 5 2021, 10:56 AM
vsellier committed rSPSITEefb36f766516: provenance: Configure the postgresql max_connections (authored by vsellier).
provenance: Configure the postgresql max_connections
Oct 5 2021, 10:56 AM

Oct 4 2021

vsellier added a revision to T3487: Installation of the new provenance server: D6406: provenance: Configure the postgresql max_connections.
Oct 4 2021, 5:59 PM · System administration
vsellier requested review of D6406: provenance: Configure the postgresql max_connections.
Oct 4 2021, 5:59 PM
vsellier added a comment to T3592: POC elastic worker infrastructure.

keda looks promising. P1193 is an example of configuration working for the docker environment. It's able to scale to 0 when no messages are present on the queue.
When messages are present, the loaders are launched progressively until the limit of cpu/memory of the host is reached or the max number of allowed worker is reached.

Oct 4 2021, 9:21 AM · System administration
vsellier created P1193 keda configuration for docker environment.
Oct 4 2021, 9:19 AM

Oct 1 2021

vsellier added a comment to T3577: Parallel loaders performances .

intermediary status:

  • the bench lab is easily deployable on g5k on several workers to distribute the load [1]
  • it's working well when the load is not so high. When the number of worker is increased, it seems the workers have some issues to talk with rabbitmq:
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-p9ds5                                                                                                                                                     
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-n6pvm                    
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-mrcjj                                                                                                                                                     
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-7bn4s                                                                                       
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-lg2bd

and also an unexplained time drift:

[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,447: WARNING/MainProcess] Substantial drift from celery@loaders-77cdd444df-lxjpl may mean clocks are out of sync.  Current drift is 
[loaders-77cdd444df-flcv9 loaders] 356 seconds.  [orig: 2021-09-30 23:46:55.447181 recv: 2021-09-30 23:40:59.633444]                                                                                                                                                                     
[loaders-77cdd444df-flcv9 loaders]                                                                                                                                                                            
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,447: WARNING/MainProcess] Substantial drift from celery@loaders-77cdd444df-jd6v9 may mean clocks are out of sync.  Current drift is                                                                                              
[loaders-77cdd444df-flcv9 loaders] 355 seconds.  [orig: 2021-09-30 23:46:55.447552 recv: 2021-09-30 23:41:00.723983]                                  
[loaders-77cdd444df-flcv9 loaders]
Oct 1 2021, 5:07 PM · System administration, Storage manager
vsellier committed rDSNIP7cc495e333e2: grid5000/cassandra: kubernetes configuration for massive parallel loader test (authored by vsellier).
grid5000/cassandra: kubernetes configuration for massive parallel loader test
Oct 1 2021, 4:37 PM
vsellier added a comment to T3592: POC elastic worker infrastructure.

Intermediary status:

  • We have successfully ran loaders in staging using the helm chart we have wrote [1] and an hardcoded number of worker, It adds the possibility to perform rolling upgrades for example
  • We have tried the integrated horizontal pod autoscaler [2], it works pretty well but it's not adapted for our worker scenario. It's based on the cpu consumption(on our test [3], but can be other things) of the pod to decide if the number of running pods must be upscaled or downscaled. It can be very useful to manage classical load like for gunicorn container, but not for the scenario of long running tasks
  • Kubernetes also has some functionalities to reduce the pressure on a node when some limts are reached but it looks like it's more emergency actions than proper scaling management. It's configured at the kubelet level and not dynamic at all [4]. It was rapidly tested but we have lost the node due to oom before the node eviction starts.
Oct 1 2021, 4:18 PM · System administration

Sep 30 2021

vsellier updated subscribers of T3624: Update swh-graph from 0.3.0 to 0.5.0 on granet.

cc @seirl

Sep 30 2021, 5:10 PM · Compressed graph service, System administration
vsellier updated the task description for T3487: Installation of the new provenance server.
Sep 30 2021, 2:27 PM · System administration
vsellier closed D6378: provenance: Declare 10 pre-provisioned databases for the different experiments.
Sep 30 2021, 2:23 PM
vsellier committed rSPSITEe42b581fc789: provenance: Declare 10 pre-provisioned databases for the different experiments (authored by vsellier).
provenance: Declare 10 pre-provisioned databases for the different experiments
Sep 30 2021, 2:23 PM
vsellier updated the task description for T3487: Installation of the new provenance server.
Sep 30 2021, 12:50 PM · System administration
vsellier requested review of D6378: provenance: Declare 10 pre-provisioned databases for the different experiments.
Sep 30 2021, 12:50 PM
vsellier added a revision to T3487: Installation of the new provenance server: D6378: provenance: Declare 10 pre-provisioned databases for the different experiments.
Sep 30 2021, 12:50 PM · System administration

Sep 29 2021

vsellier triaged T3621: Create a production read-only objstorage as Normal priority.
Sep 29 2021, 5:39 PM · System administration
vsellier added a comment to T3592: POC elastic worker infrastructure.

After having hard time, we have solved several issues:

  • The rancher initialization problem was because we were using a wrong version of k3s compared to the compatibility matrix of rancher.

We installed rancher 2.5.9 on a recent version of k3s installing kubernetes 1.22.2. According to the compatibility matrix of rancher[1], using a older version of k3s solved the problem and the clusters start correctly after that

Sep 29 2021, 10:42 AM · System administration

Sep 28 2021

vsellier triaged T3617: Create a journalbeat package for bulleye as Normal priority.
Sep 28 2021, 4:08 PM · System administration
vsellier triaged T3616: Create a prometheus-statsd-exporter package for bullseye as Normal priority.
Sep 28 2021, 4:03 PM · System administration
vsellier triaged T3615: Adapt rabbitmq monitoring for bullseye as Normal priority.
Sep 28 2021, 4:02 PM · System administration
vsellier accepted D6365: Adapt postgresql connection information on the provenance server.
Sep 28 2021, 2:54 PM
vsellier closed D6364: provenance: declare rabbitmq users.
Sep 28 2021, 2:49 PM
vsellier committed rSPSITEd38d468c061c: provenance: declare rabbitmq users (authored by vsellier).
provenance: declare rabbitmq users
Sep 28 2021, 2:49 PM
vsellier updated the diff for D6364: provenance: declare rabbitmq users.

rebase

Sep 28 2021, 2:48 PM
vsellier committed rSPPRIVC2362ec691041: Generate censored data from uncensored repository (authored by vsellier).
Generate censored data from uncensored repository
Sep 28 2021, 2:47 PM
vsellier accepted D6363: Adapt postgresql connection information on the provenance server.
Sep 28 2021, 2:38 PM