Query: Advanced Search

	Include stories about projects I am a member of.

rancher seems to create emptydir volume in /var/lib/kubelet, except the /var/lib/kubelet/pki directory, everything is ephemeral in this directory so we could easily use a partition backed by a local storage disk.
It will also remove an unecessary pressure on ceph for the pod relative data.
The /var/lib/docker directory could also be moved to this local partition as everything in docker can be lost.
I will manually try that on one staging node to check if it can work before changing the terraform / puppet code

graphql: Declare startup and liveness probes

graphql: change the sentry secret to optional

In order to test the local storage on nodes declared on uffizi, I configured a new scratch storage on this hypervisor.
Following T3707#73522 and https://pve.proxmox.com/wiki/Storage:_LVM_Thin

root@uffizi:~# lvcreate -L200G -n proxmox-scratch vg-louvre
  Logical volume "scratch" created.

I close this issue because after the @vlorentz 's analysis it seems there isn't a lot of things to improve

increase origin_visit_replayers

cassandra: redispatch replayers

These are the results of the different algorithms tests for the directory_add (with 20 directory replayers)

one-by-one

postgres=# select count(*) from pg_stat_activity where query like '%UNNEST(%';
 count 
-------
    64
(1 row)

postgres=# select count(*) from pg_stat_activity where query like '%UNNEST(%';
 count 
-------
    64
(1 row)

All the indexers were stopped at 20:00 FR because something was consummng all the bandwidth of the VPN between azure and the our infra.

root@pergamon:/etc/clustershell# clush -b -w @indexer-workers "puppet agent --disable 'stop indexer to avoid bandwith consumption'"
root@pergamon:/etc/clustershell# clush -b -w @indexer-workers "systemctl stop swh-indexer-journal-client@*"

test the batch directory add algorithm

test the concurrent directory add algorithm

icinga: fix a typo on the graphql host declaration

staging: Change the monitoring profile of db1 to sql

Measure performance for one-by-one directory replayer

Support specific options per replayer

Merge remote-tracking branch 'origin/master' into cassandra

reaper access the cassandra server through jmx. The cassandra deployment scripts need to be adapted (in progress) to expose jmx on the public interface.
When publicly exposed, the cassandra startup scripts force to password protect the jmx accesses.

A new production node for replayers and generic load was added on the cluster to add more compute resources to allow testing the tool

add an affinity of the replayers to the nodes with swh/replayer=true

rancher-production: Add a new node on hypervisor3 with 6 cpus

here some profiling of a couple of replayers:

Merge remote-tracking branch 'origin/master' into cassandra

Try to reduce the global cpu consumption to reduce the hypervisor load

diff landed and deployed, graph restarted

swh-graph: configure the max heap allocated to the java backend

I forgot to mention, it seems expect during some peak, the used memory is around 350Go

reduce concurrent loaders for origin, increase directory

Merge remote-tracking branch 'origin/master' into cassandra

Adapt replayer dispatching

thanos: Increase the allocated memory to avoid OOM killer

@vlorentz I assigned the task to you because if I'm not wrong you are running some experiments on granet.
I don't know what, but you should be more gentle with the server

reduce origin comumption as most of the partition are replayed

speed up origin topic replay

Merge remote-tracking branch 'origin/master' into cassandra

stabilize the number of replayer

remove empty deployment as 0 is considered as empty values

try to fix 0 replicas deployments

prioritize small topics to free resources for bigger ones later

money: fix chromium issue with missing sse3 instructions

The root cause is a swh-graph experiment that generated a lot of grpc errors which are huge.

No consumers seem to have a big lag on these topics, so it should be possible to reduce the lag to unblock the server and have a look which service is sending the events:

root@riverside:/var/lib/sentry-onpremise# docker-compose-1.29.2 run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --list | tr -d '\r' | xargs -t -n1 docker-compose-1.29.2 run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092  --describe --group | grep -e GROUP -e " events "
Creating sentry-self-hosted_kafka_run ... done
docker-compose-1.29.2 run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --describe --group snuba-consumers
Creating sentry-self-hosted_kafka_run ... done
GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
snuba-consumers events          0          82585390        82587094        1704            -                                            -               -
docker-compose-1.29.2 run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --describe --group snuba-post-processor:sync:6fa9928e1d6911edac290242ac170014
Creating sentry-self-hosted_kafka_run ... done
GROUP                                                      TOPIC            PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
docker-compose-1.29.2 run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --describe --group ingest-consumer
Creating sentry-self-hosted_kafka_run ... done

The biggest topics are:

root@riverside:/var/lib/docker/volumes/sentry-kafka/_data# du -sch * | sort -h | tail -n 5
31M	snuba-commit-log-0
291M	outcomes-0
30G	ingest-events-0
43G	events-0
73G	total

Advanced Search
Use Results
Edit Query
Hide Query

Sep 14 2022

Sep 13 2022

Sep 12 2022

Sep 9 2022

Sep 8 2022

Sep 7 2022

Sep 6 2022

Advanced SearchUse ResultsEdit QueryHide Query

Sep 14 2022

Sep 13 2022

Sep 12 2022

Sep 9 2022

Sep 8 2022

Sep 7 2022

Sep 6 2022

Advanced Search
Use Results
Edit Query
Hide Query