root@search0:/# curl -s http://search-esnode0:9200/_cat/indices\?v
health status index                       uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   origin-v0.11                HljzsdD9SmKI7-8ekB_q3Q  80   0          0            0      4.2kb          4.2kb
green  close  origin                      HthJj42xT5uO7w3Aoxzppw  80   0                                                  
green  close  origin-v0.9.0               o7FiYJWnTkOViKiAdCXCuA  80   0                                                  
green  open   origin-v0.10.0              -fvf4hK9QDeN8qYTJBBlxQ  80   0    1981623       559384      2.3gb          2.3gb
green  close  origin-backup-20210209-1736 P1CKjXW0QiWM5zlzX46-fg  80   0                                                  
green  close  origin-v0.5.0               SGplSaqPR_O9cPYU4ZsmdQ  80   0

journal clients enabled and restarted
the journal clients lags should recover in less than 12h
waiting some time to estimate the duration with only one journal client per type

Sep 1 2021, 5:46 PM · System administration, Archive search

vsellier added a comment to T3433: Deploy swh.search v0.10/v0.11.

The problem was fixed by rDSEA68347a5604c74150197f691593cbb05bdd34396f
thanks @olasd

Sep 1 2021, 5:22 PM · System administration, Archive search

vsellier added a comment to T3433: Deploy swh.search v0.10/v0.11.

Deployment of version v0.11.4 in staging:
On search0:

puppet stopped
stop and disable the journal clients and search backend
update the swh-search configuration to use a origin-v0.11 index

root@search0:/etc/softwareheritage/search# diff -U2 /tmp/server.yml server.yml 
--- /tmp/server.yml	2021-09-01 13:42:29.347951302 +0000
+++ server.yml	2021-09-01 13:42:35.739953523 +0000
@@ -7,5 +7,5 @@
   indexes:
     origin:
-      index: origin-v0.10.0
+      index: origin-v0.11
       read_alias: origin-read
       write_alias: origin-write

update the journal-clients to use a group id swh.search.journal_client.[indexed|object]-v0.11

root@search0:/etc/softwareheritage/search# diff -U3 /tmp/journal_client_objects.yml journal_client_objects.yml 
--- /tmp/journal_client_objects.yml	2021-09-01 13:44:49.843999978 +0000
+++ journal_client_objects.yml	2021-09-01 13:45:03.972004852 +0000
@@ -5,7 +5,7 @@
 journal:
   brokers:
   - journal0.internal.staging.swh.network
-  group_id: swh.search.journal_client-v0.10.0
+  group_id: swh.search.journal_client-v0.11
   prefix: swh.journal.objects
   object_types:
   - origin
root@search0:/etc/softwareheritage/search# diff -U3 /tmp/journal_client_indexed.yml journal_client_indexed.yml 
--- /tmp/journal_client_indexed.yml	2021-09-01 13:44:44.847998252 +0000
+++ journal_client_indexed.yml	2021-09-01 13:44:57.020002454 +0000
@@ -5,7 +5,7 @@
 journal:
   brokers:
   - journal0.internal.staging.swh.network
-  group_id: swh.search.journal_client.indexed-v0.10.0
+  group_id: swh.search.journal_client.indexed-v0.11
   prefix: swh.journal.indexed
   object_types:
   - origin_intrinsic_metadata

perform a system upgrade, a reboot was not required
enable and start swh-search backend
An error occurs after the restart:

Sep 01 14:19:12 search0 python3[4066688]: 2021-09-01 14:19:12 [4066688] root:ERROR command 'cc' failed with exit status 1
                                          Traceback (most recent call last):
                                            File "/usr/lib/python3.7/distutils/unixccompiler.py", line 118, in _compile
                                              extra_postargs)
                                            File "/usr/lib/python3.7/distutils/ccompiler.py", line 909, in spawn
                                              spawn(cmd, dry_run=self.dry_run)
                                            File "/usr/lib/python3.7/distutils/spawn.py", line 36, in spawn
                                              _spawn_posix(cmd, search_path, dry_run=dry_run)
                                            File "/usr/lib/python3.7/distutils/spawn.py", line 159, in _spawn_posix
                                              % (cmd, exit_status))
                                          distutils.errors.DistutilsExecError: command 'cc' failed with exit status 1

Sep 1 2021, 5:15 PM · System administration, Archive search

vsellier closed T3484: Fix the release builds for swh-search, a subtask of T3433: Deploy swh.search v0.10/v0.11, as Resolved.

Sep 1 2021, 3:21 PM · System administration, Archive search

vsellier closed T3484: Fix the release builds for swh-search as Resolved.

The build is now fixed and the v0.11.4 version is ready to be deployed on the environments

Sep 1 2021, 3:21 PM · System administration, Archive search

vsellier added a comment to D6139: cassandra: Add option to select (hopefully) more efficient batch insertion algos.

Test with 10 replayers with the 3 kind of algorithm:

first interval: one-by-one
second interval: concurremt
third interval: batch:

Sep 1 2021, 11:37 AM

vsellier committed rDSNIP30b06ccd0294: grid5000/cassandra: fix statsd configuration of gunicorn services (authored by vsellier).

grid5000/cassandra: fix statsd configuration of gunicorn services

Sep 1 2021, 10:33 AM

vsellier accepted D6166: swh-scheduler-journal-client: Delay the restart of failing service.

LGTM

Sep 1 2021, 9:37 AM

vsellier accepted D6163: Ensure icinga alerts are raised if scheduler journal client service is down.

LGTM

Sep 1 2021, 9:33 AM

Aug 31 2021

vsellier closed T3539: snapshot/metadata inversion in origin_visit_status_get_random as Resolved.

Aug 31 2021, 9:19 AM · Storage manager

vsellier closed D6161: postgresql: Fix a column order mismatch between the query and object builder.

Aug 31 2021, 9:18 AM

vsellier committed rDSTO3ad1bec113e8: postgresql: Fix a column order mismatch between the query and object builder (authored by vsellier).

postgresql: Fix a column order mismatch between the query and object builder

Aug 31 2021, 9:18 AM

Aug 30 2021

vsellier closed T3517: [cassandra] decorate the method calls to have statsd metrics , a subtask of T3357: Perform some tests of the cassandra storage on Grid5000, as Resolved.

Aug 30 2021, 6:11 PM · System administration, Storage manager

vsellier closed T3517: [cassandra] decorate the method calls to have statsd metrics as Resolved.

Aug 30 2021, 6:11 PM · System administration, Storage manager

vsellier updated the diff for D6161: postgresql: Fix a column order mismatch between the query and object builder.

rebase

Aug 30 2021, 5:40 PM

vsellier closed D6162: cassandra: generate statsd metrics on method calls.

Aug 30 2021, 5:39 PM

vsellier committed rDSTO999ea6bbd773: cassandra: generate statsd metrics on method calls (authored by vsellier).

cassandra: generate statsd metrics on method calls

Aug 30 2021, 5:39 PM

vsellier added a revision to T3517: [cassandra] decorate the method calls to have statsd metrics : D6162: cassandra: generate statsd metrics on method calls.

Aug 30 2021, 5:28 PM · System administration, Storage manager

vsellier updated the diff for D6161: postgresql: Fix a column order mismatch between the query and object builder.

Add a failure without the correction in the tests

Aug 30 2021, 5:21 PM

vsellier requested review of D6161: postgresql: Fix a column order mismatch between the query and object builder.

Aug 30 2021, 5:12 PM

vsellier added a revision to T3539: snapshot/metadata inversion in origin_visit_status_get_random: D6161: postgresql: Fix a column order mismatch between the query and object builder.

Aug 30 2021, 5:06 PM · Storage manager

vsellier changed the status of T3539: snapshot/metadata inversion in origin_visit_status_get_random from Open to Work in Progress.

Aug 30 2021, 5:01 PM · Storage manager

vsellier committed rDSNIP4e34b320ab69: grid5000/cassandra: add the reservation date on the environment config file (authored by vsellier).

grid5000/cassandra: add the reservation date on the environment config file

Aug 30 2021, 12:50 PM

vsellier committed rDSNIP0594610e87b2: grid5000/cassandra: Add a missing filter on the cluster on the cassandra… (authored by vsellier).

grid5000/cassandra: Add a missing filter on the cluster on the cassandra…

Aug 30 2021, 12:50 PM

Aug 27 2021

vsellier added a comment to T3465: Test multidatacenter replication.

New cluster state after all the reservation are up:

vsellier@gros-50:~$  nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load        Tokens  Owns (effective)  Host ID                               Rack
UN  172.16.97.3   1.4 TiB     256     60.1%             a3ae5fa2-c063-4890-87f1-bddfcf293bde  rack1
UN  172.16.97.6   1.4 TiB     256     60.0%             bfe360f1-8fd2-4f4b-a070-8f267eda1e12  rack1
UN  172.16.97.5   1.39 TiB    256     59.9%             478c36f8-5220-4db7-b5c2-f3876c0c264a  rack1
UN  172.16.97.4   1.4 TiB     256     59.9%             b3105348-66b0-4f82-a5bf-31ef28097a41  rack1
UN  172.16.97.2   1.4 TiB     256     60.1%             de866efd-064c-4e27-965c-f5112393dc8f  rack1

Aug 27 2021, 7:35 PM · System administration, Storage manager

vsellier added a comment to T3465: Test multidatacenter replication.

cassandra stopped

vsellier@fnancy:~/cassandra$ seq 50 64 | parallel -t ssh root@gros-{} systemctl stop cassandra

data cleaned

vsellier@fnancy:~/cassandra$ seq 50 64 | parallel -t ssh root@gros-{} "rm -rf /srv/cassandra/*"

Cassandra restarted

vsellier@fnancy:~/cassandra$ seq 50 64 | parallel -t ssh root@gros-{} systemctl start cassandra

Aug 27 2021, 6:43 PM · System administration, Storage manager

vsellier added a comment to T3465: Test multidatacenter replication.

well after reflection, it will be probably faster to recreate the second DC from scractch now the configuration is ready.

Aug 27 2021, 6:35 PM · System administration, Storage manager

vsellier added a comment to T3465: Test multidatacenter replication.

5 nodes were added on the cluster:

configuration pushed on g5k, disk reserved for 14 days on the new servers, a new reservation was launched with the new nodes
each node was started one by one after their status was UN on the nodetool status output

Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load        Tokens  Owns (effective)  Host ID                               Rack
DN  172.16.97.3   ?           256     0.0%              a3ae5fa2-c063-4890-87f1-bddfcf293bde  r1
DN  172.16.97.6   ?           256     0.0%              bfe360f1-8fd2-4f4b-a070-8f267eda1e12  r1
DN  172.16.97.5   ?           256     0.0%              478c36f8-5220-4db7-b5c2-f3876c0c264a  r1
DN  172.16.97.4   ?           256     0.0%              b3105348-66b0-4f82-a5bf-31ef28097a41  r1
DN  172.16.97.2   ?           256     0.0%              de866efd-064c-4e27-965c-f5112393dc8f  r1

Aug 27 2021, 6:30 PM · System administration, Storage manager

vsellier committed rDSNIP7a754f65a4f5: grid5000/cassandra: Add more nodes on the second datacenter (authored by vsellier).

grid5000/cassandra: Add more nodes on the second datacenter

Aug 27 2021, 6:16 PM

vsellier added a comment to T3465: Test multidatacenter replication.

10 nodes are not enough, I add 5 additional nodes to reduce the volume per node a little.

Aug 27 2021, 5:24 PM · System administration, Storage manager

vsellier changed the status of T3517: [cassandra] decorate the method calls to have statsd metrics from Open to Work in Progress.

Aug 27 2021, 4:48 PM · System administration, Storage manager

vsellier added a comment to D6139: cassandra: Add option to select (hopefully) more efficient batch insertion algos.

thanks. I will test that once the monitoring is updated to use the statsd statistics instead of the object_count table content.

Aug 27 2021, 2:55 PM

vsellier committed rDSNIP237c76e754a8: grid5000/cassandra: fix zfs configuration when only one dataset is used (authored by vsellier).

grid5000/cassandra: fix zfs configuration when only one dataset is used

Aug 27 2021, 2:26 PM

vsellier committed rDSNIP79906e54998c: grid5000/cassandra: fix besteffort nodes deployment (authored by vsellier).

grid5000/cassandra: fix besteffort nodes deployment

Aug 27 2021, 2:26 PM

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

the lz4 compression was already activated by default. Changing the algo to zstd on the table snapshot was not really significant (initially with lz4: 7Go, zstd: 12Go, go back to lz4: 9Go :) )

Aug 27 2021, 12:10 PM · System administration, Storage manager

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

interesting:

Depending on the data characteristics of the table, compressing its data can result in:
25-33% reduction in data size
25-35% performance improvement on reads
5-10% performance improvement on writes

Aug 27 2021, 10:15 AM · System administration, Storage manager

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

The replaying is currently stopped as the data disks are now almost full.
I will try to activate the compression on some big tables to see if it can help.
I will probably need to start on small tables to recover some space before being able to compress the biggest tables

Aug 27 2021, 10:02 AM · System administration, Storage manager

Aug 26 2021

vsellier added a comment to D6139: cassandra: Add option to select (hopefully) more efficient batch insertion algos.

The patch was test in a loader and in the replayers.
The difference was not really significant on the loader but I'm not really confident in the tests as the cluster had a pretty high load (running replayers + second datacenter synchronization).
I will retry with a more quieter environment to be able to isolate the loader behavior.

Aug 26 2021, 7:26 PM

vsellier committed rDSNIP4a4eaea90026: grid5000/cassandra Adapt the script to support a multidc deployment (authored by vsellier).

grid5000/cassandra Adapt the script to support a multidc deployment

Aug 26 2021, 12:46 PM

vsellier added a comment to T3465: Test multidatacenter replication.

These are the steps done to initialized the new cluster [1]:

add a file datacenter-rackdc.properties on the server with the according DC

gros-50:~$ cat /etc/cassandra/cassandra-rackdc.properties 
dc=datacenter2
rack=rack1

change the value of the properties endpoint_snitch from SimpleSnitch to GossipingPropertyFileSnitch [2].

The recommanded value for production is GossipingPropertyFileSnitch so it should have been this since the beginning

configure the disk_optimization_strategy to ssd on the new datacenter
update the seed_provider to have one node on each datacenter
restart the datacenter1 nodes to apply the new configuration
start the datacenter2 nodes one by one, wait until the status of the node is UN (Up and Normal) before starting another one (They can be stay in the UJ (joining) state for a couple of minutes)
when done, update the swh keyspace to declare the replication strategy of the second DC

ALTER KEYSPACE swh WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'datacenter1' : 3, 'datacenter2': 3};

The replication of the new changes starts here but the full table contents need to be copied

rebuild the cluster content:

vsellier@fnancy:~/cassandra$ seq 0 9 | parallel -t ssh gros-5{} nodetool rebuild -ks swh -- datacenter1

The progression can be monitored with nodetool command:

gros-50:~$ nodetool netstats                                                                 
Mode: NORMAL                                                                                           
Rebuild e5e64920-0644-11ec-92a6-31a241f39914                                                            
    /172.16.97.4                                                                                                                                      
        Receiving 199 files, 147926499702 bytes total. Already received 125 files (62.81%), 57339885570 bytes total (38.76%)
            swh/release-4 1082347/1082347 bytes (100%) received from idx:0/172.16.97.4                                                                           
            swh/content_by_blake2s256-2 3729362955/3729362955 bytes (100%) received from idx:0/172.16.97.4
            swh/release-3 224510803/224510803 bytes (100%) received from idx:0/172.16.97.4                
            swh/content_by_blake2s256-1 240283216/240283216 bytes (100%) received from idx:0/172.16.97.4
            swh/content_by_blake2s256-4 29491504/29491504 bytes (100%) received from idx:0/172.16.97.4
            swh/release-2 6409474/6409474 bytes (100%) received from idx:0/172.16.97.4                
...
Read Repair Statistics:                                                                                     
Attempted: 0                                                                                          
Mismatch (Blocking): 0                                                                                
Mismatch (Background): 0                                                                            
Pool Name                    Active   Pending      Completed   Dropped                                
Large messages                  n/a         0             23         0                                
Small messages                  n/a         3      132753939         0                          
Gossip messages                 n/a         0          43915         0

or to filter only running transfers:

gros-50:~$ nodetool netstats  | grep -v 100%
Mode: NORMAL
Rebuild e5e64920-0644-11ec-92a6-31a241f39914
    /172.16.97.4
        Receiving 199 files, 147926499702 bytes total. Already received 125 files (62.81%), 57557961160 bytes total (38.91%)
            swh/directory_entry-7 4819168032/4925484261 bytes (97%) received from idx:0/172.16.97.4
    /172.16.97.2
        Receiving 202 files, 111435975646 bytes total. Already received 139 files (68.81%), 60583670773 bytes total (54.37%)
            swh/directory_entry-12 1631210003/2906113367 bytes (56%) received from idx:0/172.16.97.2
    /172.16.97.6
        Receiving 236 files, 186694443984 bytes total. Already received 142 files (60.17%), 58869656747 bytes total (31.53%)
            swh/snapshot_branch-10 4449235102/7845572885 bytes (56%) received from idx:0/172.16.97.6
    /172.16.97.5
        Receiving 221 files, 143384473640 bytes total. Already received 132 files (59.73%), 58300913015 bytes total (40.66%)
            swh/directory_entry-4 982247023/3492851311 bytes (28%) received from idx:0/172.16.97.5
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name                    Active   Pending      Completed   Dropped
Large messages                  n/a         0             23         0
Small messages                  n/a         2      135087921         0
Gossip messages                 n/a         0          44176         0

Aug 26 2021, 12:41 PM · System administration, Storage manager

vsellier added a comment to T3465: Test multidatacenter replication.

The second cassandra cluster is finally up and synchronizing with the first one. The rebuild should be done by the end of the day or tomorrow.

Aug 26 2021, 12:05 PM · System administration, Storage manager

vsellier closed T3485: extid topic is misconfigured in staging and production as Resolved.

The backfill is also done for the production.
It tooks less than 4h30

...
2021-08-25T19:25:25 INFO     swh.storage.backfill Processing extid range 700000 to 700001

Aug 26 2021, 11:54 AM · System administration

vsellier accepted D6136: journal_client: Ensure queue position does not overflow.

LGTM

Aug 26 2021, 10:23 AM

Aug 25 2021

vsellier added a comment to T3485: extid topic is misconfigured in staging and production.

It was really faster than expected in staging. The backfilling is already done:

Aug 25 2021, 6:22 PM · System administration

vsellier triaged T3502: Date overflow error in scheduler journal client as High priority.

Aug 25 2021, 6:10 PM · System administration, Scheduling utilities

vsellier added a comment to T3485: extid topic is misconfigured in staging and production.

on production:

vsellier@kafka1 ~ % /opt/kafka/bin/kafka-topics.sh  --bootstrap-server $SERVER --describe --topic swh.journal.objects.extid | grep "^Topic"
Topic: swh.journal.objects.extid	PartitionCount: 256	ReplicationFactor: 2	Configs: cleanup.policy=compact,max.message.bytes=104857600
vsellier@kafka1 ~ % /opt/kafka/bin/kafka-configs.sh --bootstrap-server ${SERVER} --alter  --add-config 'cleanup.policy=[compact,delete],retention.ms=86400000' --entity-type=topics --entity-name swh.journal.objects.extid
Completed updating config for topic swh.journal.objects.extid.

In the kafka logs:

...
[2021-08-25 14:56:19,495] INFO [Log partition=swh.journal.objects.extid-162, dir=/srv/kafka/logdir] Found deletable segments with base offsets [0] due to retention time 86400000ms breach (kafka.log.Log)
[2021-08-25 14:56:19,495] INFO [Log partition=swh.journal.objects.extid-162, dir=/srv/kafka/logdir] Scheduling segments for deletion LogSegment(baseOffset=0, size=2720767, lastModifiedTime=1629815520833, largestTime=1629815520702) (kafka.log.Log)
[2021-08-25 14:56:19,495] INFO [Log partition=swh.journal.objects.extid-162, dir=/srv/kafka/logdir] Incremented log start offset to 20623 due to segment deletion (kafka.log.Log)
....

vsellier@kafka1 ~ % /opt/kafka/bin/kafka-configs.sh --bootstrap-server ${SERVER} --alter  --delete-config 'cleanup.policy' --entity-type=topics --entity-name swh.journal.objects.extid                                                
Completed updating config for topic swh.journal.objects.extid.
vsellier@kafka1 ~ % /opt/kafka/bin/kafka-configs.sh --bootstrap-server ${SERVER} --alter  --delete-config 'retention.ms' --entity-type=topics --entity-name swh.journal.objects.extid 
Completed updating config for topic swh.journal.objects.extid.
vsellier@kafka1 ~ % /opt/kafka/bin/kafka-configs.sh --bootstrap-server ${SERVER} --alter  --add-config 'cleanup.policy=compact' --entity-type=topics --entity-name swh.journal.objects.extid
Completed updating config for topic swh.journal.objects.extid.
vsellier@kafka1 ~ % /opt/kafka/bin/kafka-topics.sh  --bootstrap-server $SERVER --describe --topic swh.journal.objects.extid | grep "^Topic"                                          
Topic: swh.journal.objects.extid	PartitionCount: 256	ReplicationFactor: 2	Configs: cleanup.policy=compact,max.message.bytes=104857600

Aug 25 2021, 5:18 PM · System administration

vsellier added a comment to T3485: extid topic is misconfigured in staging and production.

the retention policy was restore to compact on staging:

vsellier@journal0 ~ % /opt/kafka/bin/kafka-configs.sh --bootstrap-server journal0.internal.staging.swh.network:9092 --alter  --delete-config 'cleanup.policy' --entity-type=topics --entity-name swh.journal.objects.extid

% /opt/kafka/bin/kafka-configs.sh --bootstrap-server journal0.internal.staging.swh.network:9092 --alter  --add-config 'cleanup.policy=compact' --entity-type=topics --entity-name swh.journal.objects.extid
Completed updating config for topic swh.journal.objects.extid.

% /opt/kafka/bin/kafka-topics.sh  --bootstrap-server $SERVER --describe --topic swh.journal.objects.extid | grep "^Topic"
Topic: swh.journal.objects.extid	PartitionCount: 64	ReplicationFactor: 1	Configs: cleanup.policy=compact,max.message.bytes=104857600,min.cleanable.dirty.ratio=0.01

Aug 25 2021, 4:19 PM · System administration

vsellier added a comment to T3501: Too many open files error on kafka.

status.io incident closed

Aug 25 2021, 11:55 AM · Journal, System administration

vsellier added a comment to T3501: Too many open files error on kafka.

Save code now requests rescheduled:

swh-web=> select * from save_origin_request where loading_task_status='scheduled' limit 100;
...
<output loast due to the psql pager :(
...

softwareheritage-scheduler=> select * from task where id in (398244739, 398244740, 398244742, 398244744, 398244745, 398244748, 398095676, 397470401, 397470402, 397470404, 397470399);

few minutes later:

swh-web=> select * from save_origin_request where loading_task_status='scheduled' limit 100;
 id | request_date | visit_type | origin_url | status | loading_task_id | visit_date | loading_task_status | visit_status | user_ids 
----+--------------+------------+------------+--------+-----------------+------------+---------------------+--------------+----------
(0 rows)

Aug 25 2021, 11:53 AM · Journal, System administration

vsellier added a comment to T3501: Too many open files error on kafka.

all the workers are restarted
Several save code now requests look stuck in the scheduled status, currently looking how to unblock them

Aug 25 2021, 11:37 AM · Journal, System administration

vsellier closed T3501: Too many open files error on kafka as Resolved.

D6130 landed and applied one kafka at a time

Aug 25 2021, 11:18 AM · Journal, System administration

vsellier closed D6130: kafka: increase the open file limit.

Aug 25 2021, 10:32 AM

vsellier committed rSPSITEaa2e550eb111: kafka: increase the open file limit (authored by vsellier).

kafka: increase the open file limit

Aug 25 2021, 10:32 AM

vsellier requested review of D6130: kafka: increase the open file limit.

Aug 25 2021, 10:25 AM

vsellier added a revision to T3501: Too many open files error on kafka: D6130: kafka: increase the open file limit.

Aug 25 2021, 10:25 AM · Journal, System administration

vsellier added a comment to T3501: Too many open files error on kafka.

ok roger that :).
I will increase to 524288 in the diff

Aug 25 2021, 10:21 AM · Journal, System administration

vsellier added a comment to T3501: Too many open files error on kafka.

all the loaders are restarted on worker01 and workers02, it seems the cluster is ok.

Aug 25 2021, 10:12 AM · Journal, System administration

vsellier added a comment to T3501: Too many open files error on kafka.

The open file limit was manually increased to stabilize the cluster:

# puppet agent --disable T3501
# diff -U3 /tmp/kafka.service kafka.service
--- /tmp/kafka.service	2021-08-25 07:32:28.068928972 +0000
+++ kafka.service	2021-08-25 07:32:31.384955246 +0000
@@ -15,7 +15,7 @@
 Environment='LOG_DIR=/var/log/kafka'
 Type=simple
 ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
-LimitNOFILE=65536
+LimitNOFILE=131072

Aug 25 2021, 9:43 AM · Journal, System administration

vsellier added a comment to T3501: Too many open files error on kafka.

Incident created on status.io
Loader disabled:

root@pergamon:~# clush -b -w @swh-workers 'puppet agent --disable "Kafka incident T3501"; systemctl stop cron; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@loader_*; do systemctl disable $unit; done; systemctl stop "swh-worker@loader_*"'

Aug 25 2021, 9:15 AM · Journal, System administration

vsellier changed the status of T3501: Too many open files error on kafka from Open to Work in Progress.

Aug 25 2021, 9:04 AM · Journal, System administration

Aug 24 2021

vsellier committed rSENVeb3a616b885b: vagrant: update debian image to debian 10.10 (authored by vsellier).

vagrant: update debian image to debian 10.10

Aug 24 2021, 5:54 PM

vsellier added a comment to T3493: [cassandra] Git loader performance are very bad.

Some live data from a git loader with a batch size of 1000 for each object types (with D6118 applied):

"object type";"input count";"missing_id duration (s)";"_missing_id count","_add duration(s)"
content;1000;0.4928;999;35.3384
content;1000;0.4095;1000;34.1440
content;1000;0.4374;998;35.6249
content;492;0.2960;488;16.7028
directory;1000;0.3978;999;71.2518
directory;1000;0.4484;1000;39.6845
directory;1000;0.4356;1000;54.0077
directory;1000;0.3833;1000;36.1437
directory;1000;0.4319;1000;30.5690
directory;402;0.1718;402;19.2335
revision;1000;0.8671;1000;10.3417
revision;575;0.4639;575;4.0819

Aug 24 2021, 3:18 PM · System administration, Storage manager

vsellier accepted D6118: cassandra: Make content_missing query in batches.

The performance are ok now for the read part with a batch size of 1000 for content, directory and revision.

Aug 24 2021, 3:09 PM

vsellier added a revision to T3493: [cassandra] Git loader performance are very bad: D6118: cassandra: Make content_missing query in batches.

Aug 24 2021, 3:06 PM · System administration, Storage manager

vsellier added a task to D6118: cassandra: Make content_missing query in batches: T3493: [cassandra] Git loader performance are very bad.

Aug 24 2021, 3:06 PM

vsellier closed D6127: backfill: add extra where clause to use the right index for extid requests.

Aug 24 2021, 2:57 PM

vsellier committed rDSTO7113198fd65e: backfill: add extra where clause to use the right index for extid requests (authored by vsellier).

backfill: add extra where clause to use the right index for extid requests

Aug 24 2021, 2:57 PM

vsellier changed the status of T3476: One of the system disks of beaubourg is out of order, a subtask of T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem, from Open to Work in Progress.

Aug 24 2021, 2:43 PM · System administration

vsellier changed the status of T3476: One of the system disks of beaubourg is out of order from Open to Work in Progress.

An alert was sent by email the 2021-05-22 at 05:30 AM so the monitoring has well detected the issue ;) :

This message was generated by the smartd daemon running on:

Aug 24 2021, 2:43 PM · System administration

vsellier closed T3499: Move firewall storage to local hypervisor storage as Resolved.

Aug 24 2021, 2:29 PM · System administration

vsellier closed T3499: Move firewall storage to local hypervisor storage, a subtask of T3444: 26/07/2021: Unstuck infrastructure outage then post-mortem, as Resolved.

Aug 24 2021, 2:29 PM · System administration

vsellier added a comment to T3499: Move firewall storage to local hypervisor storage.

on hypervisor3 and branly

A new lvm volume was created and mounted on /var/lib/vz (40G on hypervisor3 / 100G on branly)
local storage type was activated on proxmox via the ui (Datacenter / storage / local, check enable)
pushkin and glytotek disks moved via to ui to the local storage (<vm> / hardware click on the disk / move disk button / target storage 'local')

Aug 24 2021, 2:29 PM · System administration

vsellier triaged T3499: Move firewall storage to local hypervisor storage as High priority.

Aug 24 2021, 2:21 PM · System administration

vsellier requested review of D6127: backfill: add extra where clause to use the right index for extid requests.

Aug 24 2021, 2:02 PM

vsellier added a revision to T3485: extid topic is misconfigured in staging and production: D6127: backfill: add extra where clause to use the right index for extid requests.

Aug 24 2021, 1:55 PM · System administration

vsellier renamed T3493: [cassandra] Git loader performance are very bad from Git loader performance are very bad to [cassandra] Git loader performance are very bad.

Aug 24 2021, 12:07 PM · System administration, Storage manager

vsellier accepted D6124: agent_checks: Install check_systemd plugin and command.

LGTM (double checked with @olasd ;) )

Aug 24 2021, 10:58 AM

Aug 23 2021

vsellier accepted D6120: cassandra: Bump next_visit_id when origin_visit_add is called by a replayer.

Aug 23 2021, 2:50 PM

vsellier added a comment to T3492: cassandra: origin_visit_add should increase next_visit_id even when upserting.

It seems the problem is no longer present now (tested with several origins)

root@parasilo-19:~/swh-environment/docker# docker exec -ti docker_swh-loader_1 bash
swh@8e68948366b7:/$ swh loader run git https://github.com/slackhq/nebula
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/slackhq/nebula' with type 'git'
INFO:swh.loader.git.loader.GitLoader:Listed 293 refs for repo https://github.com/slackhq/nebula
{'status': 'uneventful'}
swh@8e68948366b7:/$ swh loader run git https://github.com/slackhq/nebula
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/slackhq/nebula' with type 'git'
INFO:swh.loader.git.loader.GitLoader:Listed 293 refs for repo https://github.com/slackhq/nebula
{'status': 'uneventful'}
swh@8e68948366b7:/$ swh loader run git https://github.com/slackhq/nebula
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/slackhq/nebula' with type 'git'
INFO:swh.loader.git.loader.GitLoader:Listed 293 refs for repo https://github.com/slackhq/nebula
{'status': 'uneventful'}

Aug 23 2021, 2:50 PM · Storage manager

vsellier added a comment to T3492: cassandra: origin_visit_add should increase next_visit_id even when upserting.

The origin_visit topic was replayed with your diff during the weekend. let's test now if the worker behavior is more deterministic

Aug 23 2021, 11:42 AM · Storage manager

Aug 20 2021

vsellier committed rSENV1c4068100e99: packer/vagrant: upgrade debian buster to version 10.10 (authored by vsellier).

packer/vagrant: upgrade debian buster to version 10.10

Aug 20 2021, 6:59 PM

vsellier committed rSENV023b0c6ca879: packer: update base image url as 10.9.0 is not anymore the current one (authored by vsellier).

packer: update base image url as 10.9.0 is not anymore the current one

Aug 20 2021, 6:16 PM

Aug 19 2021

vsellier changed the status of T3465: Test multidatacenter replication, a subtask of T3357: Perform some tests of the cassandra storage on Grid5000, from Open to Work in Progress.

Aug 19 2021, 7:19 PM · System administration, Storage manager

vsellier changed the status of T3465: Test multidatacenter replication from Open to Work in Progress.

Aug 19 2021, 7:19 PM · System administration, Storage manager

vsellier added a comment to T3465: Test multidatacenter replication.

The gros cluster at Nancy[1] has a lot of nodes(124) with small reservable SSD of 960Go. This can be a good candidate to create the second cluster. It will also allow to check the performance with data (and commit logs) on SSDs.
According to the main cluster, a minimum of 8 nodes are necessary to handle the volume of data (7.3 To and growing). Starting with 10 nodes will allow to have some remaining space.

Aug 19 2021, 7:11 PM · System administration, Storage manager

vsellier added a comment to T3493: [cassandra] Git loader performance are very bad.

it seems some more precise information can be logged by activating the full query logs without a big performance impact: https://cassandra.apache.org/doc/latest/cassandra/new/fqllogging.html

Aug 19 2021, 6:52 PM · System administration, Storage manager

vsellier added a comment to T3491: Origin visit ids restart from 1 even if there is previous visits.

Should be fixed by T3482

Aug 19 2021, 4:34 PM · System administration, Storage manager

vsellier triaged T3493: [cassandra] Git loader performance are very bad as Normal priority.

Aug 19 2021, 4:32 PM · System administration, Storage manager

vsellier triaged T3491: Origin visit ids restart from 1 even if there is previous visits as Normal priority.

Aug 19 2021, 4:20 PM · System administration, Storage manager