Page MenuHomeSoftware Heritage
Feed Advanced Search

Sep 10 2021

vsellier added a comment to D6139: cassandra: Add option to select (hopefully) more efficient batch insertion algos.

final tests with the last version, everything looks good with almost the same performances a better ingestion rate in batch:
4 nodes before (batch only):

Sep 10 2021, 8:31 AM

Sep 9 2021

vsellier updated the diff for D6227: Adapt the debian security repository release for bullseye distribution.

ensure it works with stretch and versions >= bullseye

Sep 9 2021, 5:06 PM
vsellier requested review of D6227: Adapt the debian security repository release for bullseye distribution.
Sep 9 2021, 3:17 PM
vsellier requested review of D6226: Prepare the debian 11 vagrant template.
Sep 9 2021, 3:13 PM
vsellier added a comment to D6139: cassandra: Add option to select (hopefully) more efficient batch insertion algos.

Thanks for the last fix, it looks better with a smaller batch size:
5 nodes:


The ingestion is ~7500 ops/s in batch compared to ~6500 before

Sep 9 2021, 9:13 AM

Sep 8 2021

vsellier closed T3040: [production] Enable swh-search's journal-client for indexed objects, a subtask of T2590: Finish the indexer -> swh-search pipeline, as Resolved.
Sep 8 2021, 3:24 PM · Journal, Archive search
vsellier closed T3040: [production] Enable swh-search's journal-client for indexed objects as Resolved.

metadata searches are now done in Elasticsearch since the deployment of T3433

Sep 8 2021, 3:24 PM · System administration, Journal, Archive search
vsellier renamed T3433: Deploy swh.search v0.10/v0.11 from Deploy swh.search v0.10/v0.11 on staging to Deploy swh.search v0.10/v0.11.
Sep 8 2021, 3:21 PM · System administration, Archive search
vsellier closed T3433: Deploy swh.search v0.10/v0.11 as Resolved.

Everything is deployed and look functional.

Sep 8 2021, 3:21 PM · System administration, Archive search
vsellier closed D6206: webapp: support new metadata search backend configuation.
Sep 8 2021, 2:29 PM
vsellier committed rSPSITEd19dc2f55c01: webapp: support new metadata search backend configuation (authored by vsellier).
webapp: support new metadata search backend configuation
Sep 8 2021, 2:29 PM
vsellier accepted D6199: Install graph services as-is.

LGTM

Sep 8 2021, 2:22 PM
vsellier accepted D6200: Add icinga checks around the graph service.

LGTM

Sep 8 2021, 2:18 PM
vsellier added a comment to D6139: cassandra: Add option to select (hopefully) more efficient batch insertion algos.

According to the documentation of the cassandra concurrent api[1], it seems the concurrency can by specified as an argument of the execute_concurrent_with_args method. The default is 100, but it could be interesting to check with higher or lower values

Sep 8 2021, 10:27 AM
vsellier added a comment to D6139: cassandra: Add option to select (hopefully) more efficient batch insertion algos.

These are more results with different number of replayers.
Each line represents a server with 20 directory replayers, the renages are for one-by-one, concurrent, batch

  • 1 node

  • 2 nodes
Sep 8 2021, 10:11 AM

Sep 7 2021

vsellier requested review of D6206: webapp: support new metadata search backend configuation.
Sep 7 2021, 4:08 PM
vsellier added a revision to T3433: Deploy swh.search v0.10/v0.11: D6206: webapp: support new metadata search backend configuation.
Sep 7 2021, 4:08 PM · System administration, Archive search
vsellier accepted D6203: Retry on concurrent conflicting updates.

LGTM thanks

Sep 7 2021, 3:24 PM
vsellier closed D6202: explicitly name the metadata search configuration property.
Sep 7 2021, 3:06 PM
vsellier committed rDWAPPSc302b9a5e40d: explicitly name the metadata search configuration property (authored by vsellier).
explicitly name the metadata search configuration property
Sep 7 2021, 3:06 PM
vsellier requested review of D6202: explicitly name the metadata search configuration property.
Sep 7 2021, 2:57 PM
vsellier closed D6197: swh-search: use the consumer group used during the reindexation.
Sep 7 2021, 11:25 AM
vsellier committed rSPSITE6efa928ca146: swh-search: use the consumer group used during the reindexation (authored by vsellier).
swh-search: use the consumer group used during the reindexation
Sep 7 2021, 11:25 AM
vsellier added a revision to T3433: Deploy swh.search v0.10/v0.11: D6197: swh-search: use the consumer group used during the reindexation.
Sep 7 2021, 11:22 AM · System administration, Archive search
vsellier requested review of D6197: swh-search: use the consumer group used during the reindexation.
Sep 7 2021, 11:22 AM
vsellier closed D6183: swh-search: activate metadata search all ES on the main webapp.
Sep 7 2021, 11:02 AM
vsellier committed rSPSITE377c1fa75a27: swh-search: activate metadata search all ES on the main webapp (authored by vsellier).
swh-search: activate metadata search all ES on the main webapp
Sep 7 2021, 11:02 AM
vsellier closed D6182: swh-search: update the configuration for the deployment of v0.11.4.
Sep 7 2021, 11:02 AM
vsellier committed rSPSITE2f4076496bbd: swh-search: update the configuration for the deployment of v0.11.4 (authored by vsellier).
swh-search: update the configuration for the deployment of v0.11.4
Sep 7 2021, 11:02 AM
vsellier edited P1155 Log buffer stats.
Sep 7 2021, 10:10 AM

Sep 6 2021

vsellier triaged T3562: [swh-search] Document version conflict during parallel indexation as Normal priority.
Sep 6 2021, 2:52 PM · Archive search

Sep 3 2021

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

With the new concurrent replay of the directory, the disk usage grow up rapidly:

Sep 3 2021, 5:15 PM · System administration, Storage manager
vsellier added a comment to D6139: cassandra: Add option to select (hopefully) more efficient batch insertion algos.

Some feedback, I had to delay the benchmarks because the servers was almost full and the cluster needed to be expanded to 7 nodes. The cluster is in stabilization phase (rebuild/repair of the new node and cleanup of the old one)
When it will be done, I will be able to finalize the tests Hopefully at the beginning of the next week

Sep 3 2021, 4:51 PM
vsellier added a comment to T3433: Deploy swh.search v0.10/v0.11.

production deployment:

  • disable puppet
  • stop and disable the journal clients and the search backend
  • update the swh-search configuration to change the index name to origin-v0.11
root@search1:/etc/softwareheritage/search# diff -U3 /tmp/server.yml server.yml
--- /tmp/server.yml	2021-09-03 14:06:07.896137122 +0000
+++ server.yml	2021-09-03 14:05:47.072081879 +0000
@@ -10,7 +10,7 @@
     port: 9200
   indexes:
     origin:
-      index: origin-production
+      index: origin-v0.11
       read_alias: origin-read
       write_alias: origin-write
  • update the journal-clients to use a group id swh.search.journal_client.[indexed|object]-v0.11
root@search1:/etc/softwareheritage/search# diff -U3 /tmp/journal_client_objects.yml journal_client_objects.yml 
--- /tmp/journal_client_objects.yml	2021-09-03 14:06:52.660255797 +0000
+++ journal_client_objects.yml	2021-09-03 14:07:10.684303568 +0000
@@ -8,7 +8,7 @@
   - kafka2.internal.softwareheritage.org
   - kafka3.internal.softwareheritage.org
   - kafka4.internal.softwareheritage.org
-  group_id: swh.search.journal_client
+  group_id: swh.search.journal_client-v0.11
   prefix: swh.journal.objects
   object_types:
   - origin
root@search1:/etc/softwareheritage/search# diff -U3 /tmp/journal_client_indexed.yml journal_client_indexed.yml 
--- /tmp/journal_client_indexed.yml	2021-09-03 14:06:52.660255797 +0000
+++ journal_client_indexed.yml	2021-09-03 14:07:25.760343512 +0000
@@ -8,7 +8,7 @@
   - kafka2.internal.softwareheritage.org
   - kafka3.internal.softwareheritage.org
   - kafka4.internal.softwareheritage.org
-  group_id: swh.search.journal_client.indexed
+  group_id: swh.search.journal_client.indexed-v0.11
   prefix: swh.journal.indexed
   object_types:
   - origin_intrinsic_metadata
  • perform a system upgrade
root@search1:/etc/softwareheritage/search# apt dist-upgrade -V
...
The following NEW packages will be installed:
   python3-tree-sitter (0.19.0-1+swh1~bpo10+1)
The following packages will be upgraded:
   libnss-systemd (247.3-3~bpo10+1 => 247.3-6~bpo10+1)
   libpam-systemd (247.3-3~bpo10+1 => 247.3-6~bpo10+1)
   libsystemd0 (247.3-3~bpo10+1 => 247.3-6~bpo10+1)
   libudev1 (247.3-3~bpo10+1 => 247.3-6~bpo10+1)
   python3-swh.core (0.14.3-1~swh1~bpo10+1 => 0.14.5-1~swh1~bpo10+1)
   python3-swh.model (2.6.1-1~swh1~bpo10+1 => 2.8.0-1~swh1~bpo10+1)
   python3-swh.scheduler (0.15.0-1~swh1~bpo10+1 => 0.18.0-1~swh1~bpo10+1)
   python3-swh.search (0.9.0-1~swh1~bpo10+1 => 0.11.4-2~swh1~bpo10+1)
   python3-swh.storage (0.30.1-1~swh1~bpo10+1 => 0.36.0-1~swh1~bpo10+1)
   systemd (247.3-3~bpo10+1 => 247.3-6~bpo10+1)
   systemd-sysv (247.3-3~bpo10+1 => 247.3-6~bpo10+1)
   systemd-timesyncd (247.3-3~bpo10+1 => 247.3-6~bpo10+1)
   udev (247.3-3~bpo10+1 => 247.3-6~bpo10+1)
13 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
...

There is no need to reboot

  • enable and restart the swh-search backend
  • check the new index creation
root@search1:/etc/softwareheritage/search# curl ${ES_SERVER}/_cat/indices\?v
health status index             uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   origin-v0.11      XOUR_jKcTtWKjlPk_8EAlA  90   1          0            0     34.3kb         18.2kb
green  open   origin-v0.9.0     TH9xlECuS4CcJTDw0Fqieg  90   1  175001478     36494554      293gb        146.9gb
green  open   origin-production hZfuv0lVRImjOjO_rYgDzg  90   1  176722078     56232582      311gb        155.1gb
  • update the write index alias
root@search1:~/T3433# ./update-write-alias.sh 
{"acknowledged":true}{"acknowledged":true}root@search1:~/T3433# 
root@search1:~/T3433# curl ${ES_SERVER}/_cat/aliases\?v
alias               index             filter routing.index routing.search is_write_index
origin-write        origin-v0.11      -      -             -              -
origin-read-v0.9.0  origin-v0.9.0     -      -             -              -
origin-v0.9.0-read  origin-v0.9.0     -      -             -              -
origin-v0.9.0-write origin-v0.9.0     -      -             -              -
origin-write-v0.9.0 origin-v0.9.0     -      -             -              -
origin-read         origin-production -      -             -              -

All the v0.9.0 stuff will be cleared once the migration to the v0.11 done

  • restart the journal clients
root@search1:~# systemctl enable swh-search-journal-client@objects
Created symlink /etc/systemd/system/multi-user.target.wants/swh-search-journal-client@objects.service → /etc/systemd/system/swh-search-journal-client@.service.
root@search1:~# systemctl enable swh-search-journal-client@indexed
Created symlink /etc/systemd/system/multi-user.target.wants/swh-search-journal-client@indexed.service → /etc/systemd/system/swh-search-journal-client@.service.
root@search1:~# systemctl start swh-search-journal-client@objects
root@search1:~# systemctl start swh-search-journal-client@indexed
  • wait for the lag to recover, create additional journal clients if necessary
  • update the read index alias
  • land D6182, D6183, D6197
  • Update swh-web configuration to support the new way to configure the metadata search backend (D6202)
  • deploy them on webapp1 and moma
Sep 3 2021, 4:03 PM · System administration, Archive search
vsellier updated the summary of D6183: swh-search: activate metadata search all ES on the main webapp.
Sep 3 2021, 3:46 PM
vsellier requested review of D6183: swh-search: activate metadata search all ES on the main webapp.
Sep 3 2021, 3:45 PM
vsellier added a revision to T3040: [production] Enable swh-search's journal-client for indexed objects: D6183: swh-search: activate metadata search all ES on the main webapp.
Sep 3 2021, 3:45 PM · System administration, Journal, Archive search
vsellier requested review of D6182: swh-search: update the configuration for the deployment of v0.11.4.
Sep 3 2021, 3:44 PM
vsellier added a revision to T3433: Deploy swh.search v0.10/v0.11: D6182: swh-search: update the configuration for the deployment of v0.11.4.
Sep 3 2021, 3:44 PM · System administration, Archive search
vsellier accepted D6178: Update netbox to 2.11.12.

thanks
LGTM

Sep 3 2021, 11:59 AM
vsellier added a comment to T3433: Deploy swh.search v0.10/v0.11.
  • puppet configuration deployed in staging
  • read index updated with this script:
#!/bin/bash
Sep 3 2021, 9:57 AM · System administration, Archive search
vsellier closed D6176: swh-search: deploy v0.11.4 in staging.
Sep 3 2021, 9:47 AM
vsellier committed rSPSITEf8bd91737496: swh-search: deploy v0.11.4 in staging (authored by vsellier).
swh-search: deploy v0.11.4 in staging
Sep 3 2021, 9:47 AM
vsellier updated the test plan for D6176: swh-search: deploy v0.11.4 in staging.
Sep 3 2021, 9:45 AM
vsellier requested review of D6176: swh-search: deploy v0.11.4 in staging.
Sep 3 2021, 8:42 AM
vsellier added a revision to T3433: Deploy swh.search v0.10/v0.11: D6176: swh-search: deploy v0.11.4 in staging.
Sep 3 2021, 8:42 AM · System administration, Archive search
vsellier added a comment to T3433: Deploy swh.search v0.10/v0.11.

The lag has recovered in ~ 12hours.
The content of the index looks goods (just cherry picked a couple of origin).

Sep 3 2021, 8:34 AM · System administration, Archive search

Sep 1 2021

vsellier added a comment to T3433: Deploy swh.search v0.10/v0.11.
  • package python3-swh.search upgraded to version 0.11.4-2, the problem is fixed
  • the new index is well created:
root@search0:/# curl -s http://search-esnode0:9200/_cat/indices\?v
health status index                       uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   origin-v0.11                HljzsdD9SmKI7-8ekB_q3Q  80   0          0            0      4.2kb          4.2kb
green  close  origin                      HthJj42xT5uO7w3Aoxzppw  80   0                                                  
green  close  origin-v0.9.0               o7FiYJWnTkOViKiAdCXCuA  80   0                                                  
green  open   origin-v0.10.0              -fvf4hK9QDeN8qYTJBBlxQ  80   0    1981623       559384      2.3gb          2.3gb
green  close  origin-backup-20210209-1736 P1CKjXW0QiWM5zlzX46-fg  80   0                                                  
green  close  origin-v0.5.0               SGplSaqPR_O9cPYU4ZsmdQ  80   0
  • journal clients enabled and restarted
  • the journal clients lags should recover in less than 12h
  • waiting some time to estimate the duration with only one journal client per type
Sep 1 2021, 5:46 PM · System administration, Archive search
vsellier added a comment to T3433: Deploy swh.search v0.10/v0.11.

The problem was fixed by rDSEA68347a5604c74150197f691593cbb05bdd34396f
thanks @olasd

Sep 1 2021, 5:22 PM · System administration, Archive search
vsellier added a comment to T3433: Deploy swh.search v0.10/v0.11.

Deployment of version v0.11.4 in staging:
On search0:

  • puppet stopped
  • stop and disable the journal clients and search backend
  • update the swh-search configuration to use a origin-v0.11 index
root@search0:/etc/softwareheritage/search# diff -U2 /tmp/server.yml server.yml 
--- /tmp/server.yml	2021-09-01 13:42:29.347951302 +0000
+++ server.yml	2021-09-01 13:42:35.739953523 +0000
@@ -7,5 +7,5 @@
   indexes:
     origin:
-      index: origin-v0.10.0
+      index: origin-v0.11
       read_alias: origin-read
       write_alias: origin-write
  • update the journal-clients to use a group id swh.search.journal_client.[indexed|object]-v0.11
root@search0:/etc/softwareheritage/search# diff -U3 /tmp/journal_client_objects.yml journal_client_objects.yml 
--- /tmp/journal_client_objects.yml	2021-09-01 13:44:49.843999978 +0000
+++ journal_client_objects.yml	2021-09-01 13:45:03.972004852 +0000
@@ -5,7 +5,7 @@
 journal:
   brokers:
   - journal0.internal.staging.swh.network
-  group_id: swh.search.journal_client-v0.10.0
+  group_id: swh.search.journal_client-v0.11
   prefix: swh.journal.objects
   object_types:
   - origin
root@search0:/etc/softwareheritage/search# diff -U3 /tmp/journal_client_indexed.yml journal_client_indexed.yml 
--- /tmp/journal_client_indexed.yml	2021-09-01 13:44:44.847998252 +0000
+++ journal_client_indexed.yml	2021-09-01 13:44:57.020002454 +0000
@@ -5,7 +5,7 @@
 journal:
   brokers:
   - journal0.internal.staging.swh.network
-  group_id: swh.search.journal_client.indexed-v0.10.0
+  group_id: swh.search.journal_client.indexed-v0.11
   prefix: swh.journal.indexed
   object_types:
   - origin_intrinsic_metadata
  • perform a system upgrade, a reboot was not required
  • enable and start swh-search backend
  • An error occurs after the restart:
Sep 01 14:19:12 search0 python3[4066688]: 2021-09-01 14:19:12 [4066688] root:ERROR command 'cc' failed with exit status 1
                                          Traceback (most recent call last):
                                            File "/usr/lib/python3.7/distutils/unixccompiler.py", line 118, in _compile
                                              extra_postargs)
                                            File "/usr/lib/python3.7/distutils/ccompiler.py", line 909, in spawn
                                              spawn(cmd, dry_run=self.dry_run)
                                            File "/usr/lib/python3.7/distutils/spawn.py", line 36, in spawn
                                              _spawn_posix(cmd, search_path, dry_run=dry_run)
                                            File "/usr/lib/python3.7/distutils/spawn.py", line 159, in _spawn_posix
                                              % (cmd, exit_status))
                                          distutils.errors.DistutilsExecError: command 'cc' failed with exit status 1
Sep 1 2021, 5:15 PM · System administration, Archive search
vsellier closed T3484: Fix the release builds for swh-search, a subtask of T3433: Deploy swh.search v0.10/v0.11, as Resolved.
Sep 1 2021, 3:21 PM · System administration, Archive search
vsellier closed T3484: Fix the release builds for swh-search as Resolved.

The build is now fixed and the v0.11.4 version is ready to be deployed on the environments

Sep 1 2021, 3:21 PM · System administration, Archive search
vsellier added a comment to D6139: cassandra: Add option to select (hopefully) more efficient batch insertion algos.

Test with 10 replayers with the 3 kind of algorithm:

  • first interval: one-by-one
  • second interval: concurremt
  • third interval: batch:
Sep 1 2021, 11:37 AM
vsellier committed rDSNIP30b06ccd0294: grid5000/cassandra: fix statsd configuration of gunicorn services (authored by vsellier).
grid5000/cassandra: fix statsd configuration of gunicorn services
Sep 1 2021, 10:33 AM
vsellier accepted D6166: swh-scheduler-journal-client: Delay the restart of failing service.

LGTM

Sep 1 2021, 9:37 AM
vsellier accepted D6163: Ensure icinga alerts are raised if scheduler journal client service is down.

LGTM

Sep 1 2021, 9:33 AM

Aug 31 2021

vsellier closed T3539: snapshot/metadata inversion in origin_visit_status_get_random as Resolved.
Aug 31 2021, 9:19 AM · Storage manager
vsellier closed D6161: postgresql: Fix a column order mismatch between the query and object builder.
Aug 31 2021, 9:18 AM
vsellier committed rDSTO3ad1bec113e8: postgresql: Fix a column order mismatch between the query and object builder (authored by vsellier).
postgresql: Fix a column order mismatch between the query and object builder
Aug 31 2021, 9:18 AM

Aug 30 2021

vsellier closed T3517: [cassandra] decorate the method calls to have statsd metrics , a subtask of T3357: Perform some tests of the cassandra storage on Grid5000, as Resolved.
Aug 30 2021, 6:11 PM · System administration, Storage manager
vsellier closed T3517: [cassandra] decorate the method calls to have statsd metrics as Resolved.
Aug 30 2021, 6:11 PM · System administration, Storage manager
vsellier updated the diff for D6161: postgresql: Fix a column order mismatch between the query and object builder.

rebase

Aug 30 2021, 5:40 PM
vsellier closed D6162: cassandra: generate statsd metrics on method calls.
Aug 30 2021, 5:39 PM
vsellier committed rDSTO999ea6bbd773: cassandra: generate statsd metrics on method calls (authored by vsellier).
cassandra: generate statsd metrics on method calls
Aug 30 2021, 5:39 PM
vsellier added a revision to T3517: [cassandra] decorate the method calls to have statsd metrics : D6162: cassandra: generate statsd metrics on method calls.
Aug 30 2021, 5:28 PM · System administration, Storage manager
vsellier updated the diff for D6161: postgresql: Fix a column order mismatch between the query and object builder.

Add a failure without the correction in the tests

Aug 30 2021, 5:21 PM
vsellier requested review of D6161: postgresql: Fix a column order mismatch between the query and object builder.
Aug 30 2021, 5:12 PM
vsellier added a revision to T3539: snapshot/metadata inversion in origin_visit_status_get_random: D6161: postgresql: Fix a column order mismatch between the query and object builder.
Aug 30 2021, 5:06 PM · Storage manager
vsellier changed the status of T3539: snapshot/metadata inversion in origin_visit_status_get_random from Open to Work in Progress.
Aug 30 2021, 5:01 PM · Storage manager
vsellier committed rDSNIP4e34b320ab69: grid5000/cassandra: add the reservation date on the environment config file (authored by vsellier).
grid5000/cassandra: add the reservation date on the environment config file
Aug 30 2021, 12:50 PM
vsellier committed rDSNIP0594610e87b2: grid5000/cassandra: Add a missing filter on the cluster on the cassandra… (authored by vsellier).
grid5000/cassandra: Add a missing filter on the cluster on the cassandra…
Aug 30 2021, 12:50 PM

Aug 27 2021

vsellier added a comment to T3465: Test multidatacenter replication.

New cluster state after all the reservation are up:

vsellier@gros-50:~$  nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load        Tokens  Owns (effective)  Host ID                               Rack
UN  172.16.97.3   1.4 TiB     256     60.1%             a3ae5fa2-c063-4890-87f1-bddfcf293bde  rack1
UN  172.16.97.6   1.4 TiB     256     60.0%             bfe360f1-8fd2-4f4b-a070-8f267eda1e12  rack1
UN  172.16.97.5   1.39 TiB    256     59.9%             478c36f8-5220-4db7-b5c2-f3876c0c264a  rack1
UN  172.16.97.4   1.4 TiB     256     59.9%             b3105348-66b0-4f82-a5bf-31ef28097a41  rack1
UN  172.16.97.2   1.4 TiB     256     60.1%             de866efd-064c-4e27-965c-f5112393dc8f  rack1
Aug 27 2021, 7:35 PM · System administration, Storage manager
vsellier added a comment to T3465: Test multidatacenter replication.
  • cassandra stopped
vsellier@fnancy:~/cassandra$ seq 50 64 | parallel -t ssh root@gros-{} systemctl stop cassandra
  • data cleaned
vsellier@fnancy:~/cassandra$ seq 50 64 | parallel -t ssh root@gros-{} "rm -rf /srv/cassandra/*"
  • Cassandra restarted
vsellier@fnancy:~/cassandra$ seq 50 64 | parallel -t ssh root@gros-{} systemctl start cassandra
Aug 27 2021, 6:43 PM · System administration, Storage manager
vsellier added a comment to T3465: Test multidatacenter replication.

well after reflection, it will be probably faster to recreate the second DC from scractch now the configuration is ready.

Aug 27 2021, 6:35 PM · System administration, Storage manager
vsellier added a comment to T3465: Test multidatacenter replication.

5 nodes were added on the cluster:

  • configuration pushed on g5k, disk reserved for 14 days on the new servers, a new reservation was launched with the new nodes
  • each node was started one by one after their status was UN on the nodetool status output
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load        Tokens  Owns (effective)  Host ID                               Rack
DN  172.16.97.3   ?           256     0.0%              a3ae5fa2-c063-4890-87f1-bddfcf293bde  r1
DN  172.16.97.6   ?           256     0.0%              bfe360f1-8fd2-4f4b-a070-8f267eda1e12  r1
DN  172.16.97.5   ?           256     0.0%              478c36f8-5220-4db7-b5c2-f3876c0c264a  r1
DN  172.16.97.4   ?           256     0.0%              b3105348-66b0-4f82-a5bf-31ef28097a41  r1
DN  172.16.97.2   ?           256     0.0%              de866efd-064c-4e27-965c-f5112393dc8f  r1
Aug 27 2021, 6:30 PM · System administration, Storage manager
vsellier committed rDSNIP7a754f65a4f5: grid5000/cassandra: Add more nodes on the second datacenter (authored by vsellier).
grid5000/cassandra: Add more nodes on the second datacenter
Aug 27 2021, 6:16 PM
vsellier added a comment to T3465: Test multidatacenter replication.

10 nodes are not enough, I add 5 additional nodes to reduce the volume per node a little.

Aug 27 2021, 5:24 PM · System administration, Storage manager
vsellier changed the status of T3517: [cassandra] decorate the method calls to have statsd metrics from Open to Work in Progress.
Aug 27 2021, 4:48 PM · System administration, Storage manager
vsellier added a comment to D6139: cassandra: Add option to select (hopefully) more efficient batch insertion algos.

thanks. I will test that once the monitoring is updated to use the statsd statistics instead of the object_count table content.

Aug 27 2021, 2:55 PM
vsellier committed rDSNIP237c76e754a8: grid5000/cassandra: fix zfs configuration when only one dataset is used (authored by vsellier).
grid5000/cassandra: fix zfs configuration when only one dataset is used
Aug 27 2021, 2:26 PM
vsellier committed rDSNIP79906e54998c: grid5000/cassandra: fix besteffort nodes deployment (authored by vsellier).
grid5000/cassandra: fix besteffort nodes deployment
Aug 27 2021, 2:26 PM
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

the lz4 compression was already activated by default. Changing the algo to zstd on the table snapshot was not really significant (initially with lz4: 7Go, zstd: 12Go, go back to lz4: 9Go :) )

Aug 27 2021, 12:10 PM · System administration, Storage manager
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

interesting:

Depending on the data characteristics of the table, compressing its data can result in:

25-33% reduction in data size
25-35% performance improvement on reads
5-10% performance improvement on writes
Aug 27 2021, 10:15 AM · System administration, Storage manager
vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

The replaying is currently stopped as the data disks are now almost full.
I will try to activate the compression on some big tables to see if it can help.
I will probably need to start on small tables to recover some space before being able to compress the biggest tables

Aug 27 2021, 10:02 AM · System administration, Storage manager

Aug 26 2021

vsellier added a comment to D6139: cassandra: Add option to select (hopefully) more efficient batch insertion algos.

The patch was test in a loader and in the replayers.
The difference was not really significant on the loader but I'm not really confident in the tests as the cluster had a pretty high load (running replayers + second datacenter synchronization).
I will retry with a more quieter environment to be able to isolate the loader behavior.

Aug 26 2021, 7:26 PM
vsellier committed rDSNIP4a4eaea90026: grid5000/cassandra Adapt the script to support a multidc deployment (authored by vsellier).
grid5000/cassandra Adapt the script to support a multidc deployment
Aug 26 2021, 12:46 PM
vsellier added a comment to T3465: Test multidatacenter replication.

These are the steps done to initialized the new cluster [1]:

  • add a file datacenter-rackdc.properties on the server with the according DC
gros-50:~$ cat /etc/cassandra/cassandra-rackdc.properties 
dc=datacenter2
rack=rack1
  • change the value of the properties endpoint_snitch from SimpleSnitch to GossipingPropertyFileSnitch [2].

The recommanded value for production is GossipingPropertyFileSnitch so it should have been this since the beginning

  • configure the disk_optimization_strategy to ssd on the new datacenter
  • update the seed_provider to have one node on each datacenter
  • restart the datacenter1 nodes to apply the new configuration
  • start the datacenter2 nodes one by one, wait until the status of the node is UN (Up and Normal) before starting another one (They can be stay in the UJ (joining) state for a couple of minutes)
  • when done, update the swh keyspace to declare the replication strategy of the second DC
ALTER KEYSPACE swh WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'datacenter1' : 3, 'datacenter2': 3};

The replication of the new changes starts here but the full table contents need to be copied

  • rebuild the cluster content:
vsellier@fnancy:~/cassandra$ seq 0 9 | parallel -t ssh gros-5{} nodetool rebuild -ks swh -- datacenter1

The progression can be monitored with nodetool command:

gros-50:~$ nodetool netstats                                                                 
Mode: NORMAL                                                                                           
Rebuild e5e64920-0644-11ec-92a6-31a241f39914                                                            
    /172.16.97.4                                                                                                                                      
        Receiving 199 files, 147926499702 bytes total. Already received 125 files (62.81%), 57339885570 bytes total (38.76%)
            swh/release-4 1082347/1082347 bytes (100%) received from idx:0/172.16.97.4                                                                           
            swh/content_by_blake2s256-2 3729362955/3729362955 bytes (100%) received from idx:0/172.16.97.4
            swh/release-3 224510803/224510803 bytes (100%) received from idx:0/172.16.97.4                
            swh/content_by_blake2s256-1 240283216/240283216 bytes (100%) received from idx:0/172.16.97.4
            swh/content_by_blake2s256-4 29491504/29491504 bytes (100%) received from idx:0/172.16.97.4
            swh/release-2 6409474/6409474 bytes (100%) received from idx:0/172.16.97.4                
...
Read Repair Statistics:                                                                                     
Attempted: 0                                                                                          
Mismatch (Blocking): 0                                                                                
Mismatch (Background): 0                                                                            
Pool Name                    Active   Pending      Completed   Dropped                                
Large messages                  n/a         0             23         0                                
Small messages                  n/a         3      132753939         0                          
Gossip messages                 n/a         0          43915         0

or to filter only running transfers:

gros-50:~$ nodetool netstats  | grep -v 100%
Mode: NORMAL
Rebuild e5e64920-0644-11ec-92a6-31a241f39914
    /172.16.97.4
        Receiving 199 files, 147926499702 bytes total. Already received 125 files (62.81%), 57557961160 bytes total (38.91%)
            swh/directory_entry-7 4819168032/4925484261 bytes (97%) received from idx:0/172.16.97.4
    /172.16.97.2
        Receiving 202 files, 111435975646 bytes total. Already received 139 files (68.81%), 60583670773 bytes total (54.37%)
            swh/directory_entry-12 1631210003/2906113367 bytes (56%) received from idx:0/172.16.97.2
    /172.16.97.6
        Receiving 236 files, 186694443984 bytes total. Already received 142 files (60.17%), 58869656747 bytes total (31.53%)
            swh/snapshot_branch-10 4449235102/7845572885 bytes (56%) received from idx:0/172.16.97.6
    /172.16.97.5
        Receiving 221 files, 143384473640 bytes total. Already received 132 files (59.73%), 58300913015 bytes total (40.66%)
            swh/directory_entry-4 982247023/3492851311 bytes (28%) received from idx:0/172.16.97.5
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name                    Active   Pending      Completed   Dropped
Large messages                  n/a         0             23         0
Small messages                  n/a         2      135087921         0
Gossip messages                 n/a         0          44176         0
Aug 26 2021, 12:41 PM · System administration, Storage manager
vsellier added a comment to T3465: Test multidatacenter replication.

The second cassandra cluster is finally up and synchronizing with the first one. The rebuild should be done by the end of the day or tomorrow.

Aug 26 2021, 12:05 PM · System administration, Storage manager
vsellier closed T3485: extid topic is misconfigured in staging and production as Resolved.

The backfill is also done for the production.
It tooks less than 4h30

...
2021-08-25T19:25:25 INFO     swh.storage.backfill Processing extid range 700000 to 700001
Aug 26 2021, 11:54 AM · System administration
vsellier accepted D6136: journal_client: Ensure queue position does not overflow.

LGTM

Aug 26 2021, 10:23 AM

Aug 25 2021

vsellier added a comment to T3485: extid topic is misconfigured in staging and production.

It was really faster than expected in staging. The backfilling is already done:

Aug 25 2021, 6:22 PM · System administration
vsellier triaged T3502: Date overflow error in scheduler journal client as High priority.
Aug 25 2021, 6:10 PM · System administration, Scheduling utilities
vsellier added a comment to T3485: extid topic is misconfigured in staging and production.
  • on production:
vsellier@kafka1 ~ % /opt/kafka/bin/kafka-topics.sh  --bootstrap-server $SERVER --describe --topic swh.journal.objects.extid | grep "^Topic"
Topic: swh.journal.objects.extid	PartitionCount: 256	ReplicationFactor: 2	Configs: cleanup.policy=compact,max.message.bytes=104857600
vsellier@kafka1 ~ % /opt/kafka/bin/kafka-configs.sh --bootstrap-server ${SERVER} --alter  --add-config 'cleanup.policy=[compact,delete],retention.ms=86400000' --entity-type=topics --entity-name swh.journal.objects.extid
Completed updating config for topic swh.journal.objects.extid.

In the kafka logs:

...
[2021-08-25 14:56:19,495] INFO [Log partition=swh.journal.objects.extid-162, dir=/srv/kafka/logdir] Found deletable segments with base offsets [0] due to retention time 86400000ms breach (kafka.log.Log)
[2021-08-25 14:56:19,495] INFO [Log partition=swh.journal.objects.extid-162, dir=/srv/kafka/logdir] Scheduling segments for deletion LogSegment(baseOffset=0, size=2720767, lastModifiedTime=1629815520833, largestTime=1629815520702) (kafka.log.Log)
[2021-08-25 14:56:19,495] INFO [Log partition=swh.journal.objects.extid-162, dir=/srv/kafka/logdir] Incremented log start offset to 20623 due to segment deletion (kafka.log.Log)
....
vsellier@kafka1 ~ % /opt/kafka/bin/kafka-configs.sh --bootstrap-server ${SERVER} --alter  --delete-config 'cleanup.policy' --entity-type=topics --entity-name swh.journal.objects.extid                                                
Completed updating config for topic swh.journal.objects.extid.
vsellier@kafka1 ~ % /opt/kafka/bin/kafka-configs.sh --bootstrap-server ${SERVER} --alter  --delete-config 'retention.ms' --entity-type=topics --entity-name swh.journal.objects.extid 
Completed updating config for topic swh.journal.objects.extid.
vsellier@kafka1 ~ % /opt/kafka/bin/kafka-configs.sh --bootstrap-server ${SERVER} --alter  --add-config 'cleanup.policy=compact' --entity-type=topics --entity-name swh.journal.objects.extid
Completed updating config for topic swh.journal.objects.extid.
vsellier@kafka1 ~ % /opt/kafka/bin/kafka-topics.sh  --bootstrap-server $SERVER --describe --topic swh.journal.objects.extid | grep "^Topic"                                          
Topic: swh.journal.objects.extid	PartitionCount: 256	ReplicationFactor: 2	Configs: cleanup.policy=compact,max.message.bytes=104857600
Aug 25 2021, 5:18 PM · System administration
vsellier added a comment to T3485: extid topic is misconfigured in staging and production.
  • the retention policy was restore to compact on staging:
vsellier@journal0 ~ % /opt/kafka/bin/kafka-configs.sh --bootstrap-server journal0.internal.staging.swh.network:9092 --alter  --delete-config 'cleanup.policy' --entity-type=topics --entity-name swh.journal.objects.extid
% /opt/kafka/bin/kafka-configs.sh --bootstrap-server journal0.internal.staging.swh.network:9092 --alter  --add-config 'cleanup.policy=compact' --entity-type=topics --entity-name swh.journal.objects.extid
Completed updating config for topic swh.journal.objects.extid.
% /opt/kafka/bin/kafka-topics.sh  --bootstrap-server $SERVER --describe --topic swh.journal.objects.extid | grep "^Topic"
Topic: swh.journal.objects.extid	PartitionCount: 64	ReplicationFactor: 1	Configs: cleanup.policy=compact,max.message.bytes=104857600,min.cleanable.dirty.ratio=0.01
Aug 25 2021, 4:19 PM · System administration
vsellier added a comment to T3501: Too many open files error on kafka.

status.io incident closed

Aug 25 2021, 11:55 AM · Journal, System administration
vsellier added a comment to T3501: Too many open files error on kafka.

Save code now requests rescheduled:

swh-web=> select * from save_origin_request where loading_task_status='scheduled' limit 100;
...
<output loast due to the psql pager :(
...
softwareheritage-scheduler=> select * from task where id in (398244739, 398244740, 398244742, 398244744, 398244745, 398244748, 398095676, 397470401, 397470402, 397470404, 397470399);

few minutes later:

swh-web=> select * from save_origin_request where loading_task_status='scheduled' limit 100;
 id | request_date | visit_type | origin_url | status | loading_task_id | visit_date | loading_task_status | visit_status | user_ids 
----+--------------+------------+------------+--------+-----------------+------------+---------------------+--------------+----------
(0 rows)
Aug 25 2021, 11:53 AM · Journal, System administration
vsellier added a comment to T3501: Too many open files error on kafka.
  • all the workers are restarted
  • Several save code now requests look stuck in the scheduled status, currently looking how to unblock them
Aug 25 2021, 11:37 AM · Journal, System administration
vsellier closed T3501: Too many open files error on kafka as Resolved.

D6130 landed and applied one kafka at a time

Aug 25 2021, 11:18 AM · Journal, System administration
vsellier closed D6130: kafka: increase the open file limit.
Aug 25 2021, 10:32 AM
vsellier committed rSPSITEaa2e550eb111: kafka: increase the open file limit (authored by vsellier).
kafka: increase the open file limit
Aug 25 2021, 10:32 AM