Page MenuHomeSoftware Heritage
Feed Advanced Search

Sep 14 2022

vsellier created P1453 (An Untitled Masterwork).
Sep 14 2022, 4:55 PM
vsellier added a comment to T4506: Use local hypervisor storage in the loader pods.

rancher seems to create emptydir volume in /var/lib/kubelet, except the /var/lib/kubelet/pki directory, everything is ephemeral in this directory so we could easily use a partition backed by a local storage disk.
It will also remove an unecessary pressure on ceph for the pod relative data.
The /var/lib/docker directory could also be moved to this local partition as everything in docker can be lost.
I will manually try that on one staging node to check if it can work before changing the terraform / puppet code

Sep 14 2022, 3:40 PM · System administration
vsellier committed R260:c76b657d58b2: graphql: Declare startup and liveness probes (authored by vsellier).
graphql: Declare startup and liveness probes
Sep 14 2022, 2:25 PM
vsellier closed D8469: graphql: Declare startup and liveness probes.
Sep 14 2022, 2:25 PM
vsellier committed R260:aec926905060: graphql: change the sentry secret to optional (authored by vsellier).
graphql: change the sentry secret to optional
Sep 14 2022, 2:25 PM
vsellier updated the diff for D8469: graphql: Declare startup and liveness probes.

rebase

Sep 14 2022, 2:25 PM
vsellier added a comment to T4506: Use local hypervisor storage in the loader pods.

In order to test the local storage on nodes declared on uffizi, I configured a new scratch storage on this hypervisor.
Following T3707#73522 and https://pve.proxmox.com/wiki/Storage:_LVM_Thin

root@uffizi:~# lvcreate -L200G -n proxmox-scratch vg-louvre
  Logical volume "scratch" created.
Sep 14 2022, 1:14 PM · System administration
vsellier changed the status of T4506: Use local hypervisor storage in the loader pods, a subtask of T4144: Elastic worker infrastructure, from Open to Work in Progress.
Sep 14 2022, 11:01 AM · meta-task, System administration, Roadmap 2022
vsellier changed the status of T4506: Use local hypervisor storage in the loader pods from Open to Work in Progress.
Sep 14 2022, 11:01 AM · System administration
vsellier added a subtask for T4523: Dynamic infrastructure: T4534: Evaluate MetalLB as inbound loadbalancer.
Sep 14 2022, 10:57 AM · meta-task, System administration
vsellier added a parent task for T4534: Evaluate MetalLB as inbound loadbalancer: T4523: Dynamic infrastructure.
Sep 14 2022, 10:57 AM · System administration
vsellier triaged T4534: Evaluate MetalLB as inbound loadbalancer as Normal priority.
Sep 14 2022, 10:57 AM · System administration
vsellier requested review of D8472: cassandra: Allow to configure the jmx for remote or local only access.
Sep 14 2022, 10:51 AM
vsellier added a revision to T4458: Test reaper to automate the cassandra repair actions: D8472: cassandra: Allow to configure the jmx for remote or local only access.
Sep 14 2022, 10:51 AM · System administration
vsellier closed T4510: [cassandra] Profile the replayer cpu consumption, a subtask of T4373: [cassandra] Test the new hardware, as Resolved.
Sep 14 2022, 9:47 AM · Storage manager, System administration
vsellier closed T4510: [cassandra] Profile the replayer cpu consumption as Resolved.

I close this issue because after the @vlorentz 's analysis it seems there isn't a lot of things to improve

Sep 14 2022, 9:47 AM · Storage manager, System administration

Sep 13 2022

vsellier committed R260:b37028d1df3c: increase origin_visit_replayers (authored by vsellier).
increase origin_visit_replayers
Sep 13 2022, 8:22 PM
vsellier accepted D8470: Improve icinga2 prometheus metric checks.

neat

Sep 13 2022, 7:37 PM
vsellier accepted D8471: Summary: Bump local prometheus retention down to 1 month.
Sep 13 2022, 7:29 PM
vsellier committed R260:ad7fb9e0d846: cassandra: redispatch replayers (authored by vsellier).
cassandra: redispatch replayers
Sep 13 2022, 5:34 PM
vsellier requested review of D8469: graphql: Declare startup and liveness probes.
Sep 13 2022, 5:23 PM
vsellier added a comment to T4510: [cassandra] Profile the replayer cpu consumption.

These are the results of the different algorithms tests for the directory_add (with 20 directory replayers)

  • one-by-one
Sep 13 2022, 4:23 PM · Storage manager, System administration
vsellier raised the priority of T4455: Upgrade elk stack to a more recent version from Normal to High.
Sep 13 2022, 2:29 PM · System administration
vsellier created P1450 errors on webapp.
Sep 13 2022, 11:04 AM
vsellier added a comment to P1449 queries running on belvedere.
postgres=# select count(*) from pg_stat_activity where query like '%UNNEST(%';
 count 
-------
    64
(1 row)
Sep 13 2022, 10:41 AM
vsellier added a comment to P1449 queries running on belvedere.
postgres=# select count(*) from pg_stat_activity where query like '%UNNEST(%';
 count 
-------
    64
(1 row)
Sep 13 2022, 10:36 AM
vsellier created P1449 queries running on belvedere.
Sep 13 2022, 10:31 AM

Sep 12 2022

vsellier added a comment to T4459: Deploy swh-indexer > v2.6 on staging then production.

All the indexers were stopped at 20:00 FR because something was consummng all the bandwidth of the VPN between azure and the our infra.

root@pergamon:/etc/clustershell# clush -b -w @indexer-workers "puppet agent --disable 'stop indexer to avoid bandwith consumption'"
root@pergamon:/etc/clustershell# clush -b -w @indexer-workers "systemctl stop swh-indexer-journal-client@*"
Sep 12 2022, 8:10 PM · Indexer, System administration
vsellier committed R260:7b43b64925da: test the batch directory add algorithm (authored by vsellier).
test the batch directory add algorithm
Sep 12 2022, 6:55 PM
vsellier committed R260:275a980bb2bc: test the concurrent directory add algorithm (authored by vsellier).
test the concurrent directory add algorithm
Sep 12 2022, 5:00 PM
vsellier closed D8450: icinga: fix a typo on the graphql host declaration.
Sep 12 2022, 4:47 PM
vsellier committed rSPSITEaf3bd6139e3e: icinga: fix a typo on the graphql host declaration (authored by vsellier).
icinga: fix a typo on the graphql host declaration
Sep 12 2022, 4:47 PM
vsellier requested review of D8450: icinga: fix a typo on the graphql host declaration.
Sep 12 2022, 4:34 PM
vsellier added a revision to T4135: staging: Deploy graphql service: D8450: icinga: fix a typo on the graphql host declaration.
Sep 12 2022, 4:34 PM · System administration, GraphQL API
vsellier closed D8449: staging: Change the monitoring profile of db1 to sql.
Sep 12 2022, 4:17 PM
vsellier committed rSPSITE527c69ff3e50: staging: Change the monitoring profile of db1 to sql (authored by vsellier).
staging: Change the monitoring profile of db1 to sql
Sep 12 2022, 4:17 PM
vsellier committed R260:649738df8042: Measure performance for one-by-one directory replayer (authored by vsellier).
Measure performance for one-by-one directory replayer
Sep 12 2022, 3:39 PM
vsellier committed R260:58577860199f: Support specific options per replayer (authored by vsellier).
Support specific options per replayer
Sep 12 2022, 3:39 PM
vsellier committed R260:8c8dc14580fc: Merge remote-tracking branch 'origin/master' into cassandra (authored by vsellier).
Merge remote-tracking branch 'origin/master' into cassandra
Sep 12 2022, 3:17 PM
vsellier requested review of D8449: staging: Change the monitoring profile of db1 to sql.
Sep 12 2022, 3:09 PM

Sep 9 2022

vsellier added a comment to T4458: Test reaper to automate the cassandra repair actions.

reaper access the cassandra server through jmx. The cassandra deployment scripts need to be adapted (in progress) to expose jmx on the public interface.
When publicly exposed, the cassandra startup scripts force to password protect the jmx accesses.

Sep 9 2022, 4:41 PM · System administration
vsellier accepted D8437: archive-staging: Deploy pubdev ingestion stack.
Sep 9 2022, 2:06 PM
vsellier added a comment to T4458: Test reaper to automate the cassandra repair actions.

A new production node for replayers and generic load was added on the cluster to add more compute resources to allow testing the tool

Sep 9 2022, 11:51 AM · System administration
vsellier changed the status of T4458: Test reaper to automate the cassandra repair actions, a subtask of T4373: [cassandra] Test the new hardware, from Open to Work in Progress.
Sep 9 2022, 11:49 AM · Storage manager, System administration
vsellier changed the status of T4458: Test reaper to automate the cassandra repair actions from Open to Work in Progress.
Sep 9 2022, 11:49 AM · System administration
vsellier committed R260:7daee55ef56d: add an affinity of the replayers to the nodes with swh/replayer=true (authored by vsellier).
add an affinity of the replayers to the nodes with swh/replayer=true
Sep 9 2022, 11:46 AM
vsellier committed rSPRE943c46f87c61: rancher-production: Add a new node on hypervisor3 with 6 cpus (authored by vsellier).
rancher-production: Add a new node on hypervisor3 with 6 cpus
Sep 9 2022, 11:30 AM
vsellier added a comment to T4510: [cassandra] Profile the replayer cpu consumption.

here some profiling of a couple of replayers:

Sep 9 2022, 11:27 AM · Storage manager, System administration
vsellier committed R260:a45bdce2eb24: Merge remote-tracking branch 'origin/master' into cassandra (authored by vsellier).
Merge remote-tracking branch 'origin/master' into cassandra
Sep 9 2022, 10:15 AM
vsellier committed R260:f64a9a9829cd: Try to reduce the global cpu consumption to reduce the hypervisor load (authored by vsellier).
Try to reduce the global cpu consumption to reduce the hypervisor load
Sep 9 2022, 10:15 AM

Sep 8 2022

vsellier changed the status of T4510: [cassandra] Profile the replayer cpu consumption from Open to Work in Progress.
Sep 8 2022, 6:29 PM · Storage manager, System administration
vsellier changed the status of T4510: [cassandra] Profile the replayer cpu consumption, a subtask of T4373: [cassandra] Test the new hardware, from Open to Work in Progress.
Sep 8 2022, 6:29 PM · Storage manager, System administration
vsellier closed T4509: [swh-graph] Configure the max_memory to use, a subtask of T4507: Out of memory on granet, as Resolved.
Sep 8 2022, 6:25 PM · System administration, Compressed graph service
vsellier closed T4509: [swh-graph] Configure the max_memory to use as Resolved.

diff landed and deployed, graph restarted

Sep 8 2022, 6:25 PM · System administration, Compressed graph service
vsellier closed D8431: swh-graph: configure the max heap allocated to the java backend.
Sep 8 2022, 6:18 PM
vsellier committed rSPSITE303c48250b95: swh-graph: configure the max heap allocated to the java backend (authored by vsellier).
swh-graph: configure the max heap allocated to the java backend
Sep 8 2022, 6:18 PM
vsellier updated the diff for D8431: swh-graph: configure the max heap allocated to the java backend.

rebase

Sep 8 2022, 6:18 PM
vsellier added a comment to T4509: [swh-graph] Configure the max_memory to use.

I forgot to mention, it seems expect during some peak, the used memory is around 350Go

Sep 8 2022, 6:16 PM · System administration, Compressed graph service
vsellier triaged T4516: swh-graph: Add jvm monitoring as Normal priority.
Sep 8 2022, 5:57 PM · System administration, Compressed graph service
vsellier requested review of D8431: swh-graph: configure the max heap allocated to the java backend.
Sep 8 2022, 5:48 PM
vsellier added a revision to T4509: [swh-graph] Configure the max_memory to use: D8431: swh-graph: configure the max heap allocated to the java backend.
Sep 8 2022, 5:48 PM · System administration, Compressed graph service
vsellier accepted D8429: Add static check on the staging graphql instance.
Sep 8 2022, 5:30 PM
vsellier committed R260:7f552eb93182: reduce concurrent loaders for origin, increase directory (authored by vsellier).
reduce concurrent loaders for origin, increase directory
Sep 8 2022, 3:39 PM
vsellier added a comment to T4330: Deploy maven stack in production.

\o/ great

Sep 8 2022, 2:41 PM · System administration, Maven loader, Maven lister, GSoC 2019, Archive coverage
vsellier changed the status of T4509: [swh-graph] Configure the max_memory to use, a subtask of T4507: Out of memory on granet, from Open to Work in Progress.
Sep 8 2022, 12:22 PM · System administration, Compressed graph service
vsellier changed the status of T4509: [swh-graph] Configure the max_memory to use from Open to Work in Progress.
Sep 8 2022, 12:22 PM · System administration, Compressed graph service
vsellier closed T4471: swh-graph Add java process port monitoring as Resolved.
Sep 8 2022, 12:19 PM · Compressed graph service, System administration
vsellier committed R260:097e68bf3067: Merge remote-tracking branch 'origin/master' into cassandra (authored by vsellier).
Merge remote-tracking branch 'origin/master' into cassandra
Sep 8 2022, 11:18 AM
vsellier committed R260:25710a5f501e: Adapt replayer dispatching (authored by vsellier).
Adapt replayer dispatching
Sep 8 2022, 11:16 AM
vsellier closed D8415: thanos: Increase the allocated memory to avoid OOM killer.
Sep 8 2022, 11:08 AM
vsellier committed rSPREecec22a1a3a2: thanos: Increase the allocated memory to avoid OOM killer (authored by vsellier).
thanos: Increase the allocated memory to avoid OOM killer
Sep 8 2022, 11:08 AM
vsellier requested review of D8415: thanos: Increase the allocated memory to avoid OOM killer.
Sep 8 2022, 10:50 AM
vsellier triaged T4510: [cassandra] Profile the replayer cpu consumption as Normal priority.
Sep 8 2022, 10:38 AM · Storage manager, System administration
vsellier triaged T4509: [swh-graph] Configure the max_memory to use as High priority.
Sep 8 2022, 10:14 AM · System administration, Compressed graph service
vsellier added a comment to T4507: Out of memory on granet.

@vlorentz I assigned the task to you because if I'm not wrong you are running some experiments on granet.
I don't know what, but you should be more gentle with the server

Sep 8 2022, 9:40 AM · System administration, Compressed graph service
vsellier triaged T4507: Out of memory on granet as High priority.
Sep 8 2022, 9:38 AM · System administration, Compressed graph service
vsellier committed R260:a15da28adfaa: reduce origin comumption as most of the partition are replayed (authored by vsellier).
reduce origin comumption as most of the partition are replayed
Sep 8 2022, 9:12 AM
vsellier committed R260:4391318718a5: speed up origin topic replay (authored by vsellier).
speed up origin topic replay
Sep 8 2022, 6:42 AM

Sep 7 2022

vsellier updated the task description for T4506: Use local hypervisor storage in the loader pods.
Sep 7 2022, 6:21 PM · System administration
vsellier updated the task description for T4506: Use local hypervisor storage in the loader pods.
Sep 7 2022, 6:20 PM · System administration
vsellier triaged T4506: Use local hypervisor storage in the loader pods as High priority.
Sep 7 2022, 6:19 PM · System administration
vsellier committed R260:08c8caf5c557: Merge remote-tracking branch 'origin/master' into cassandra (authored by vsellier).
Merge remote-tracking branch 'origin/master' into cassandra
Sep 7 2022, 5:56 PM
vsellier accepted D8400: archive-staging: Deploy listers in cluster.
Sep 7 2022, 5:54 PM
vsellier committed R260:a2711ca2b6b1: stabilize the number of replayer (authored by vsellier).
stabilize the number of replayer
Sep 7 2022, 12:46 PM
vsellier committed R260:c6f5cb002929: fix comment start (authored by vsellier).
fix comment start
Sep 7 2022, 12:45 PM
vsellier committed R260:4b88963f3360: remove empty deployment as 0 is considered as empty values (authored by vsellier).
remove empty deployment as 0 is considered as empty values
Sep 7 2022, 12:44 PM
vsellier committed R260:fb253d42370a: try to fix 0 replicas deployments (authored by vsellier).
try to fix 0 replicas deployments
Sep 7 2022, 10:53 AM
vsellier committed R260:89c0b457ad2f: prioritize small topics to free resources for bigger ones later (authored by vsellier).
prioritize small topics to free resources for bigger ones later
Sep 7 2022, 10:39 AM

Sep 6 2022

vsellier updated the task description for T4479: uncouple the java grpc server from the python HTTP server.
Sep 6 2022, 5:34 PM · Compressed graph service
vsellier closed T4472: swh-graph: Allow to specify the rpc port as Wontfix.

yes even better

Sep 6 2022, 5:34 PM · Compressed graph service
vsellier closed D8405: money: fix chromium issue with missing sse3 instructions.
Sep 6 2022, 5:29 PM
vsellier committed rSPREc4eec83f84e9: money: fix chromium issue with missing sse3 instructions (authored by vsellier).
money: fix chromium issue with missing sse3 instructions
Sep 6 2022, 5:29 PM
vsellier requested changes to D8400: archive-staging: Deploy listers in cluster.
Sep 6 2022, 5:27 PM
vsellier requested review of D8405: money: fix chromium issue with missing sse3 instructions.
Sep 6 2022, 5:22 PM
vsellier accepted D8397: Deploy maven-exporter production node.
Sep 6 2022, 4:13 PM
vsellier added inline comments to D8397: Deploy maven-exporter production node.
Sep 6 2022, 3:50 PM
vsellier added a comment to T4497: [sentry] Out of disk space.

The root cause is a swh-graph experiment that generated a lot of grpc errors which are huge.

Sep 6 2022, 12:41 PM · Sentry, System administration
vsellier added a comment to T4497: [sentry] Out of disk space.

No consumers seem to have a big lag on these topics, so it should be possible to reduce the lag to unblock the server and have a look which service is sending the events:

root@riverside:/var/lib/sentry-onpremise# docker-compose-1.29.2 run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --list | tr -d '\r' | xargs -t -n1 docker-compose-1.29.2 run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092  --describe --group | grep -e GROUP -e " events "
Creating sentry-self-hosted_kafka_run ... done
docker-compose-1.29.2 run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --describe --group snuba-consumers
Creating sentry-self-hosted_kafka_run ... done
GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
snuba-consumers events          0          82585390        82587094        1704            -                                            -               -
docker-compose-1.29.2 run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --describe --group snuba-post-processor:sync:6fa9928e1d6911edac290242ac170014
Creating sentry-self-hosted_kafka_run ... done
GROUP                                                      TOPIC            PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
docker-compose-1.29.2 run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --describe --group ingest-consumer
Creating sentry-self-hosted_kafka_run ... done
Sep 6 2022, 11:18 AM · Sentry, System administration
vsellier added a comment to T4497: [sentry] Out of disk space.

The biggest topics are:

root@riverside:/var/lib/docker/volumes/sentry-kafka/_data# du -sch * | sort -h | tail -n 5
31M	snuba-commit-log-0
291M	outcomes-0
30G	ingest-events-0
43G	events-0
73G	total
Sep 6 2022, 11:11 AM · Sentry, System administration
vsellier changed the status of T4497: [sentry] Out of disk space from Open to Work in Progress.
Sep 6 2022, 11:09 AM · Sentry, System administration