│ INFO   [2022-09-16 15:02:34,268] [archive_production:67b1d310-35c5-11ed-8ea7-4b43418aeab2:67b9e963-35c5-11ed-8ea7-4b43418aeab2] i.c.s.SegmentRunner - Repair for segment 67b9e963-35c5-11ed-8ea7-4b43418aeab2 started, status wait will timeout in 1800000 millis                                                                  │
│ INFO   [2022-09-16 15:02:58,602] [archive_production:9a773740-35cf-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Maximum number of concurrent repairs reached. Repair 9a773740-35cf-11ed-8ea7-4b43418aeab2 will resume later.                                                                                                       │
│ INFO   [2022-09-16 15:02:58,602] [archive_production:9a773740-35cf-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Current active repair runners: [(67b1d310-35c5-11ed-8ea7-4b43418aeab2,1663335748289), (76982be0-35cf-11ed-8ea7-4b43418aeab2,1663340068254), (9a773740-35cf-11ed-8ea7-4b43418aeab2,1663340128436), (9a9cc0a0-35cf-1 │
│ INFO   [2022-09-16 15:02:58,787] [archive_production:9a9cc0a0-35cf-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Maximum number of concurrent repairs reached. Repair 9a9cc0a0-35cf-11ed-8ea7-4b43418aeab2 will resume later.                                                                                                       │
│ INFO   [2022-09-16 15:02:58,787] [archive_production:9a9cc0a0-35cf-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Current active repair runners: [(67b1d310-35c5-11ed-8ea7-4b43418aeab2,1663335748289), (76982be0-35cf-11ed-8ea7-4b43418aeab2,1663340068254), (9a773740-35cf-11ed-8ea7-4b43418aeab2,1663340128436), (9a9cc0a0-35cf-1 │
│ INFO   [2022-09-16 15:02:59,336] [archive_production:9aeae0a0-35cf-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Maximum number of concurrent repairs reached. Repair 9aeae0a0-35cf-11ed-8ea7-4b43418aeab2 will resume later.                                                                                                       │
│ INFO   [2022-09-16 15:02:59,336] [archive_production:9aeae0a0-35cf-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Current active repair runners: [(67b1d310-35c5-11ed-8ea7-4b43418aeab2,1663335748289), (76982be0-35cf-11ed-8ea7-4b43418aeab2,1663340068254), (9a773740-35cf-11ed-8ea7-4b43418aeab2,1663340128436), (9a9cc0a0-35cf-1 │
│ INFO   [2022-09-16 15:02:59,555] [archive_production:9b0c7260-35cf-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Maximum number of concurrent repairs reached. Repair 9b0c7260-35cf-11ed-8ea7-4b43418aeab2 will resume later.                                                                                                       │
│ INFO   [2022-09-16 15:02:59,555] [archive_production:9b0c7260-35cf-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Current active repair runners: [(67b1d310-35c5-11ed-8ea7-4b43418aeab2,1663335748289), (76982be0-35cf-11ed-8ea7-4b43418aeab2,1663340068254), (9a773740-35cf-11ed-8ea7-4b43418aeab2,1663340128436), (9a9cc0a0-35cf-1 │
│ INFO   [2022-09-16 15:02:59,779] [archive_production:76982be0-35cf-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Attempting to run new segment...                                                                                                                                                                                   │
│ INFO   [2022-09-16 15:02:59,813] [archive_production:76982be0-35cf-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Next segment to run : 76998b71-35cf-11ed-8ea7-4b43418aeab2                                                                                                                                                         │
│ INFO   [2022-09-16 15:02:59,849] [archive_production:76982be0-35cf-11ed-8ea7-4b43418aeab2:76998b71-35cf-11ed-8ea7-4b43418aeab2] i.c.j.JmxProxy - Triggering repair of range (-5797115047693728403,-5671075333212739092] for keyspace "reaper_db" on host 192.168.100.182, with repair parallelism dc_parallel, in cluster with Cas │
│ INFO   [2022-09-16 15:02:59,851] [archive_production:76982be0-35cf-11ed-8ea7-4b43418aeab2:76998b71-35cf-11ed-8ea7-4b43418aeab2] i.c.j.JmxProxy - Triggering repair for ranges -5797115047693728403:-5671075333212739092                                                                                                            │
│ INFO   [2022-09-16 15:02:59,863] [archive_production:76982be0-35cf-11ed-8ea7-4b43418aeab2:76998b71-35cf-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Triggered repair of segment 76998b71-35cf-11ed-8ea7-4b43418aeab2 via host 192.168.100.182                                                                                     │
│ INFO   [2022-09-16 15:02:59,863] [archive_production:76982be0-35cf-11ed-8ea7-4b43418aeab2:76998b71-35cf-11ed-8ea7-4b43418aeab2] i.c.s.SegmentRunner - Repair for segment 76998b71-35cf-11ed-8ea7-4b43418aeab2 started, status wait will timeout in 1800000 millis                                                                  │
│ INFO   [2022-09-16 15:03:04,227] [archive_production:67b1d310-35c5-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Attempting to run new segment...                                                                                                                                                                                   │
│ INFO   [2022-09-16 15:03:04,254] [archive_production:67b1d310-35c5-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - All nodes are busy or have too many pending compactions for the remaining candidate segments.                                                                                                                      │
│ INFO   [2022-09-16 15:03:04,262] [archive_production:67b1d310-35c5-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - All nodes are busy or have too many pending compactions for the remaining candidate segments.

Sep 16 2022, 5:07 PM · System administration

vsellier accepted D8493: indexer: Use public brokers in production, internal ones for staging.

Sep 16 2022, 4:42 PM

vsellier accepted D8492: indexer: Allow journal client authentication configuration.

perhaps a little concern regarding the length of the group_id but nothing blocking

Sep 16 2022, 3:52 PM

vsellier added a comment to T4458: Test reaper to automate the cassandra repair actions.

Reaper was manually deployed and running.
The main functionnalities for now are the scheduling of the different repair type, the orchestration of the segment to repair to avoid a repair of the same segment in different replicas.
Secondary functionalities can be useful too like the repair progress, stop / resume http://cassandra-reaper.io/docs/concepts/

Sep 16 2022, 3:31 PM · System administration

Sep 15 2022

vsellier added a comment to D8492: indexer: Allow journal client authentication configuration.

The group id of the authenticated consumers have to be probably updated to match the kafka acls

Sep 15 2022, 5:58 PM

vsellier requested changes to D8492: indexer: Allow journal client authentication configuration.

Sep 15 2022, 5:56 PM

vsellier accepted D8492: indexer: Allow journal client authentication configuration.

Sep 15 2022, 5:55 PM

vsellier closed D8482: rancher: use a zfs dataset for /var/lib/kubelet to use the local storage for EmptyDir volumes.

Sep 15 2022, 5:28 PM

vsellier committed rSPSITE386b3c6a40e3: rancher: use a zfs dataset for /var/lib/kubelet to use the local storage for… (authored by vsellier).

rancher: use a zfs dataset for /var/lib/kubelet to use the local storage for…

Sep 15 2022, 5:28 PM

vsellier updated the diff for D8482: rancher: use a zfs dataset for /var/lib/kubelet to use the local storage for EmptyDir volumes.

rebase

Sep 15 2022, 5:27 PM

vsellier closed T4506: Use local hypervisor storage in the loader pods as Resolved.

Sep 15 2022, 5:25 PM · System administration

vsellier closed T4506: Use local hypervisor storage in the loader pods, a subtask of T4144: Elastic worker infrastructure, as Resolved.

Sep 15 2022, 5:25 PM · meta-task, System administration, Roadmap 2022

vsellier added a comment to T4506: Use local hypervisor storage in the loader pods.

Example during the loading of https://github.com/torvalds/linux by a pod:

 % /usr/sbin/zfs list data/docker data/kubelet
NAME           USED  AVAIL     REFER  MOUNTPOINT
data/docker   3.81G  40.4G     83.2M  /var/lib/docker
data/kubelet  3.71G  40.4G     3.71G  /var/lib/kubelet

The compression is not as useful as for docker

 % /usr/sbin/zfs get compressratio data/kubelet data/docker 
NAME          PROPERTY       VALUE  SOURCE
data/docker   compressratio  2.95x  -
data/kubelet  compressratio  1.07x  -

Sep 15 2022, 3:42 PM · System administration

vsellier closed D8480: loaders: Use an emptydir volume type for /tmp.

Sep 15 2022, 3:00 PM

vsellier committed R260:3f0e96382e4d: loaders: Use an emptydir volume type for /tmp (authored by vsellier).

loaders: Use an emptydir volume type for /tmp

Sep 15 2022, 3:00 PM

vsellier updated the diff for D8482: rancher: use a zfs dataset for /var/lib/kubelet to use the local storage for EmptyDir volumes.

rebase

Sep 15 2022, 2:57 PM

vsellier committed rSPPRIVC834df81ebe3a: refresh censored data compared to the private data (authored by vsellier).

refresh censored data compared to the private data

Sep 15 2022, 2:54 PM

vsellier closed D8472: cassandra: Allow to configure the jmx for remote or local only access.

Sep 15 2022, 2:50 PM

vsellier committed rSPSITEde429d5dfd80: cassandra: Allow to configure the jmx for remote or local only access (authored by vsellier).

cassandra: Allow to configure the jmx for remote or local only access

Sep 15 2022, 2:50 PM

vsellier added a comment to D8482: rancher: use a zfs dataset for /var/lib/kubelet to use the local storage for EmptyDir volumes.

heh sorry for the title mess

Sep 15 2022, 2:43 PM

vsellier closed D8479: staging-cluster: Declare the 2d disk on the local storage of uffizi.

Sep 15 2022, 2:28 PM

vsellier committed rSPRE444d013f59ef: staging-cluster: Declare the 2d disk on the local storage of uffizi (authored by vsellier).

staging-cluster: Declare the 2d disk on the local storage of uffizi

Sep 15 2022, 2:28 PM

vsellier added a comment to T4506: Use local hypervisor storage in the loader pods.

The kubelet dataset will need to be manually created on all the rancher nodes (except staging worker2 and worker3 already configured) before applying D8482

cluster-argo
archive-staging
archive-production

Sep 15 2022, 10:32 AM · System administration

vsellier requested review of D8482: rancher: use a zfs dataset for /var/lib/kubelet to use the local storage for EmptyDir volumes.

Sep 15 2022, 10:18 AM

vsellier added a revision to T4506: Use local hypervisor storage in the loader pods: D8482: rancher: use a zfs dataset for /var/lib/kubelet to use the local storage for EmptyDir volumes.

Sep 15 2022, 10:18 AM · System administration

vsellier updated the test plan for D8472: cassandra: Allow to configure the jmx for remote or local only access.

Sep 15 2022, 9:18 AM

vsellier updated the diff for D8472: cassandra: Allow to configure the jmx for remote or local only access.

rebase
fix a typo on the jmxremote.access file name
configure the jvm to use it

Sep 15 2022, 9:17 AM

Sep 14 2022

vsellier added a revision to T4506: Use local hypervisor storage in the loader pods: D8480: loaders: Use an emptydir volume type for /tmp.

Sep 14 2022, 7:13 PM · System administration

vsellier requested review of D8480: loaders: Use an emptydir volume type for /tmp.

Sep 14 2022, 7:13 PM

vsellier updated the test plan for D8479: staging-cluster: Declare the 2d disk on the local storage of uffizi.

Sep 14 2022, 6:45 PM

vsellier requested review of D8479: staging-cluster: Declare the 2d disk on the local storage of uffizi.

Sep 14 2022, 6:40 PM

vsellier added a revision to T4506: Use local hypervisor storage in the loader pods: D8479: staging-cluster: Declare the 2d disk on the local storage of uffizi.

Sep 14 2022, 6:40 PM · System administration

vsellier added a comment to T4506: Use local hypervisor storage in the loader pods.

It works \o/

Sep 14 2022, 4:57 PM · System administration

vsellier created P1453 (An Untitled Masterwork).

Sep 14 2022, 4:55 PM

vsellier added a comment to T4506: Use local hypervisor storage in the loader pods.

rancher seems to create emptydir volume in /var/lib/kubelet, except the /var/lib/kubelet/pki directory, everything is ephemeral in this directory so we could easily use a partition backed by a local storage disk.
It will also remove an unecessary pressure on ceph for the pod relative data.
The /var/lib/docker directory could also be moved to this local partition as everything in docker can be lost.
I will manually try that on one staging node to check if it can work before changing the terraform / puppet code

Sep 14 2022, 3:40 PM · System administration

vsellier committed R260:c76b657d58b2: graphql: Declare startup and liveness probes (authored by vsellier).

graphql: Declare startup and liveness probes

Sep 14 2022, 2:25 PM

vsellier closed D8469: graphql: Declare startup and liveness probes.

Sep 14 2022, 2:25 PM

vsellier committed R260:aec926905060: graphql: change the sentry secret to optional (authored by vsellier).

graphql: change the sentry secret to optional

Sep 14 2022, 2:25 PM

vsellier updated the diff for D8469: graphql: Declare startup and liveness probes.

rebase

Sep 14 2022, 2:25 PM

vsellier added a comment to T4506: Use local hypervisor storage in the loader pods.

In order to test the local storage on nodes declared on uffizi, I configured a new scratch storage on this hypervisor.
Following T3707#73522 and https://pve.proxmox.com/wiki/Storage:_LVM_Thin

root@uffizi:~# lvcreate -L200G -n proxmox-scratch vg-louvre
  Logical volume "scratch" created.

Sep 14 2022, 1:14 PM · System administration

vsellier changed the status of T4506: Use local hypervisor storage in the loader pods, a subtask of T4144: Elastic worker infrastructure, from Open to Work in Progress.

Sep 14 2022, 11:01 AM · meta-task, System administration, Roadmap 2022

vsellier changed the status of T4506: Use local hypervisor storage in the loader pods from Open to Work in Progress.

Sep 14 2022, 11:01 AM · System administration

vsellier added a subtask for T4523: Dynamic infrastructure: T4534: Evaluate MetalLB as inbound loadbalancer.

Sep 14 2022, 10:57 AM · meta-task, System administration

vsellier added a parent task for T4534: Evaluate MetalLB as inbound loadbalancer: T4523: Dynamic infrastructure.

Sep 14 2022, 10:57 AM · System administration

vsellier triaged T4534: Evaluate MetalLB as inbound loadbalancer as Normal priority.

Sep 14 2022, 10:57 AM · System administration

vsellier requested review of D8472: cassandra: Allow to configure the jmx for remote or local only access.

Sep 14 2022, 10:51 AM

vsellier added a revision to T4458: Test reaper to automate the cassandra repair actions: D8472: cassandra: Allow to configure the jmx for remote or local only access.

Sep 14 2022, 10:51 AM · System administration

vsellier closed T4510: [cassandra] Profile the replayer cpu consumption, a subtask of T4373: [cassandra] Test the new hardware, as Resolved.

Sep 14 2022, 9:47 AM · Storage manager, System administration

vsellier closed T4510: [cassandra] Profile the replayer cpu consumption as Resolved.

I close this issue because after the @vlorentz 's analysis it seems there isn't a lot of things to improve

Sep 14 2022, 9:47 AM · Storage manager, System administration

Sep 13 2022

vsellier committed R260:b37028d1df3c: increase origin_visit_replayers (authored by vsellier).

increase origin_visit_replayers

Sep 13 2022, 8:22 PM

vsellier accepted D8470: Improve icinga2 prometheus metric checks.

neat

Sep 13 2022, 7:37 PM

vsellier accepted D8471: Summary: Bump local prometheus retention down to 1 month.

Sep 13 2022, 7:29 PM

vsellier committed R260:ad7fb9e0d846: cassandra: redispatch replayers (authored by vsellier).

cassandra: redispatch replayers

Sep 13 2022, 5:34 PM

vsellier requested review of D8469: graphql: Declare startup and liveness probes.

Sep 13 2022, 5:23 PM

vsellier added a comment to T4510: [cassandra] Profile the replayer cpu consumption.

These are the results of the different algorithms tests for the directory_add (with 20 directory replayers)

one-by-one

Sep 13 2022, 4:23 PM · Storage manager, System administration

vsellier raised the priority of T4455: Upgrade elk stack to a more recent version from Normal to High.

Sep 13 2022, 2:29 PM · System administration

vsellier created P1450 errors on webapp.

Sep 13 2022, 11:04 AM

vsellier added a comment to P1449 queries running on belvedere.

postgres=# select count(*) from pg_stat_activity where query like '%UNNEST(%';
 count 
-------
    64
(1 row)

Sep 13 2022, 10:41 AM

vsellier added a comment to P1449 queries running on belvedere.

postgres=# select count(*) from pg_stat_activity where query like '%UNNEST(%';
 count 
-------
    64
(1 row)

Sep 13 2022, 10:36 AM

vsellier created P1449 queries running on belvedere.

Sep 13 2022, 10:31 AM

Sep 12 2022

vsellier added a comment to T4459: Deploy swh-indexer > v2.6 on staging then production.

All the indexers were stopped at 20:00 FR because something was consummng all the bandwidth of the VPN between azure and the our infra.

root@pergamon:/etc/clustershell# clush -b -w @indexer-workers "puppet agent --disable 'stop indexer to avoid bandwith consumption'"
root@pergamon:/etc/clustershell# clush -b -w @indexer-workers "systemctl stop swh-indexer-journal-client@*"

Sep 12 2022, 8:10 PM · Indexer, System administration

vsellier committed R260:7b43b64925da: test the batch directory add algorithm (authored by vsellier).

test the batch directory add algorithm

Sep 12 2022, 6:55 PM

vsellier committed R260:275a980bb2bc: test the concurrent directory add algorithm (authored by vsellier).

test the concurrent directory add algorithm

Sep 12 2022, 5:00 PM

vsellier closed D8450: icinga: fix a typo on the graphql host declaration.

Sep 12 2022, 4:47 PM

vsellier committed rSPSITEaf3bd6139e3e: icinga: fix a typo on the graphql host declaration (authored by vsellier).

icinga: fix a typo on the graphql host declaration

Sep 12 2022, 4:47 PM