Page MenuHomeSoftware Heritage

Perform some tests of the cassandra storage on Grid5000
Started, Work in Progress, NormalPublic

Description

In order to test the behavior of a cassandra cluster during the normal operations (global performance on bare metal servers, node maintenance impact, rebalancing, ...), we should run some tests on grid5000 infrastructure

The POC will be separated in several phases:

  • Prepare scripts to build the environment and run small iterations to validate it will be possible to run the tests with interruptions
    • Validate the way the data will be kept between 2 cluster restarts
    • Have generic scripts that could configure the cluster according different hardware (memory / cpu / SSD, SATA or mixed / number of nodes / ...)
    • [in progress] Add a monitoring stack to measure the cluster behavior
  • Import a big enough dataset to be representative of the reality (probably during the night or a week-end)
    • define the minimal target to reach to consider the dataset representative
  • [In progress] perform some benchmarks, check the behavior and the performance impacts during normal operations
  • compare ScyllaDb / vanilla cassandra performances
  • authentication to allow r/o access only
  • [option] test backfilling an empty journal from cassandra The backfill is based on sql queries and highly coupled with postgresql

The final goal of the experiment is to :

  • define the minimal cluster size to maintain correct performance during maintenance operations / node failures => 5 nodes is recommended to avoid too much pressure on the remaining nodes in case of an incident with only 3 nodes (run + recovery)
  • possibly test the performance on the different hardwares provided by grid5000

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
vsellier updated the task description. (Show Details)

These are the first tests that will be executed:

  1. baseline configuration : 3 cassandra nodes (parasilo[1]), commitlog on a dedicated SSD, data on 4 HDD. Goal: testing minimal configuration (a night or a complete weekend if possible)
  2. baseline configuration + 1 cassandra nodes. Goal: Testing the performance impact of having 1 more server (duration enough to have tendencies)
  3. baseline configuration + 3 cassandra nodes: Goal: Testing the performance impact of having the cluster size x 2 (duration enough to have tendencies)
  4. baseline configuration but with the commitlog on the data partition. Goal check the impact of data/commitlog mutualization (duration enough to have tendencies)
  5. baseline configuration but with 2 HDD. Goal check the impact of the number of disks + have a reference for the next run (a night)
  6. baseline configuration but with 2 HDD + commitlog on a dedicated HDD. Goal check the impact of having the commitlog on a slower disk (duration enough to have tendencies)
  7. baseline configuration but with 2x the default heap allocated to cassandra. Goal check the impact of the memory configuration ((!) check the gc profile)

All the runs will be executed with 4 servers of the parasilo cluster running the replayers with 20 processes per object type. It will be adapted if the replayers become the bottlenecks at some time

The main metrics will be the number of swh object inserted on the database after the end of the run.
The other points (CPU/ I/Os / memory / network / cassandra metrics) will be also monitored to try to identify what is the bottleneck but will be only informative.

The tests are write oriented, but there are still some reads as the replayers perform read queries to check the presence of some elements in the database.
It will not be representative of the nominal state of the swh production but will give a idea of the behavior under high load.

Some other test have to be defined:

  • impact of a repair (read repair or full) on the cluster behavior under load / compared with the number of nodes
  • impact of a node down compared to the number of nodes in the cluster (hints overhead, impact of the recovery on startup after some time down)
  • Impact of a rebalancing with some data (adding a node, removing a node)
  • check some configuration impacts
  • ...
  • Bonus for the fun: test a PMEM cluster performances :)

[1] Parasilo cluster specs:

Model:	Dell PowerEdge R630
Date of arrival:	2015-01-13
CPU:	Intel Xeon E5-2630 v3 (Haswell, 2.40GHz, 2 CPUs/node, 8 cores/CPU)
Memory:	128 GiB
Storage:	
600 GB HDD SAS Seagate ST600MM0006 (dev: /dev/sda, by-path: /dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:0:0) (primary disk)
600 GB HDD SAS Seagate ST600MM0006 (dev: /dev/sdb*, by-path: /dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:1:0) (reservable)
600 GB HDD SAS Seagate ST600MM0006 (dev: /dev/sdc*, by-path: /dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:2:0) (reservable)
600 GB HDD SAS Seagate ST600MM0006 (dev: /dev/sdd*, by-path: /dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:3:0) (reservable)
600 GB HDD SAS Seagate ST600MM0006 (dev: /dev/sde*, by-path: /dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:4:0) (reservable)
200 GB SSD SAS Toshiba PX02SSF020 (dev: /dev/sdf*, by-path: /dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:5:0) (reservable)
*: the disk block device name /dev/sd? may vary in deployed environments, prefer referring to the by-path identifier
Network:	
eth0/eno1, Ethernet, configured rate: 10 Gbps, model: Intel 82599ES 10-Gigabit SFI/SFP+ Network Connection, driver: ixgbe
eth1/eno2, Ethernet, configured rate: 10 Gbps, model: Intel 82599ES 10-Gigabit SFI/SFP+ Network Connection, driver: ixgbe (multi NICs example)
eth2/eno3, Ethernet, model: Intel I350 Gigabit Network Connection, driver: igb - unavailable for experiment
eth3/eno4, Ethernet, model: Intel I350 Gigabit Network Connection, driver: igb - unavailable for experiment

A first run was launched with 4x20 replayered per object type.
It seems it's too much for 3 cassandra nodes. The default 1s timeout for cassandra reads is often reached.
It's seems the batch size of 1000 is also too much for some object types like snapshots
A new test will be launched this night with some changes:

  • reduce the number of replayers processes to 4x10 per object type
  • reduce the journal client batch size to 500

Some logs from the journal clients:

Jun 30 00:00:05 parasilo-20 swh[18453]: INFO:swh.journal.client.rdkafka:REQTMOUT [rdkafka#consumer-1] [thrd:GroupCoordinator]: GroupCoordinator/4: Timed out LeaveGroupRequest in flight (after 5923ms, timeout #0): possibly held back by preceeding blocking JoinGroupRequest with timeout in 290098ms
Jun 30 00:00:05 parasilo-20 swh[18453]: WARNING:swh.journal.client.rdkafka:REQTMOUT [rdkafka#consumer-1] [thrd:GroupCoordinator]: GroupCoordinator/4: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
Jun 30 00:00:05 parasilo-20 swh[18453]: ERROR:swh.journal.client.rdkafka:FAIL [rdkafka#consumer-1] [thrd:GroupCoordinator]: GroupCoordinator: broker4.journal.softwareheritage.org:9093: 1 request(s) timed out: disconnect (after 282262ms in state UP)
Jun 30 00:00:05 parasilo-20 swh[18453]: INFO:swh.journal.client:Received non-fatal kafka error: KafkaError{code=_TIMED_OUT,val=-185,str="GroupCoordinator: broker4.journal.softwareheritage.org:9093: 1 request(s) timed out: disconnect (after 282262ms in state UP)"}
Jun 30 00:00:05 parasilo-20 swh[18495]: INFO:swh.journal.client.rdkafka:REQTMOUT [rdkafka#consumer-1] [thrd:GroupCoordinator]: GroupCoordinator/4: Timed out LeaveGroupRequest in flight (after 5580ms, timeout #0): possibly held back by preceeding blocking JoinGroupRequest with timeout in 296467ms
Jun 30 00:00:05 parasilo-20 swh[18495]: WARNING:swh.journal.client.rdkafka:REQTMOUT [rdkafka#consumer-1] [thrd:GroupCoordinator]: GroupCoordinator/4: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
Jun 30 00:00:05 parasilo-20 swh[18495]: ERROR:swh.journal.client.rdkafka:FAIL [rdkafka#consumer-1] [thrd:GroupCoordinator]: GroupCoordinator: broker4.journal.softwareheritage.org:9093: 1 request(s) timed out: disconnect (after 282261ms in state UP)
Jun 30 00:00:05 parasilo-20 swh[18495]: INFO:swh.journal.client:Received non-fatal kafka error: KafkaError{code=_TIMED_OUT,val=-185,str="GroupCoordinator: broker4.journal.softwareheritage.org:9093: 1 request(s) timed out: disconnect (after 282261ms in state UP)"}
Jun 30 00:00:13 parasilo-20 swh[24255]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/storage.py", line 564, in revision_add
Jun 30 00:00:13 parasilo-20 swh[24255]:     missing = self.revision_missing([rev.id for rev in to_add])
Jun 30 00:00:13 parasilo-20 swh[24255]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/storage.py", line 588, in revision_missing
Jun 30 00:00:13 parasilo-20 swh[24255]:     return self._cql_runner.revision_missing(revisions)
Jun 30 00:00:13 parasilo-20 swh[24255]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/cql.py", line 134, in newf
Jun 30 00:00:13 parasilo-20 swh[24255]:     self, *args, **kwargs, statement=self._prepared_statements[f.__name__]
Jun 30 00:00:13 parasilo-20 swh[24255]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/cql.py", line 527, in revision_missing
...
Jun 30 00:00:13 parasilo-20 swh[24255]:   File "cassandra/cluster.py", line 4304, in cassandra.cluster.ResponseFuture.result
Jun 30 00:00:13 parasilo-20 swh[24255]: cassandra.ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'consistency': 'LOCAL_ONE', 'required_responses': 1, 'received_responses': 0}
...
Jun 30 00:00:13 parasilo-20 systemd[1]: replayer-revision@15.service: Main process exited, code=exited, status=1/FAILURE
Jun 30 00:00:13 parasilo-20 systemd[1]: replayer-revision@15.service: Failed with result 'exit-code'.
  • unstable creation rate

  • pending compaction, indicator of a high write load

  • client timeouts

A new run was launched with 4x10 replayers per object type.
A limit is still reached for the revision replayers which ends with read timeout.
A test with only the revision replayers shows the problems start to append when the number of // replayers is greater than ~20

It seems the problem comes from the size of the thread pool dedicated to the reads:

Read thread pool:

Other interesting pool is the compaction one always full that can lead to slow down the reads:

The servers have some ressources remaining, the cpu is ~60% of usage and the I/O wait not so high so it should be possible to play with some parameters.

Some tests of the repair process was done after the hard stop of the servers by grid5000, it failed each time due to outofmemory with the default jvm configuration (8Go of heap).
A new section was added on the hedgedoc[1] document and some test are in progress with 16Go of heap (ok with the official recommandations [2])

[1] https://hedgedoc.softwareheritage.org/m2MBUViUQl2r9dwcq3-_Nw?view#repairs
[2] https://cassandra.apache.org/doc/latest/operating/hardware.html#memory

A replay was run during during 13 hours the previous night with the current default consistency=ONE. It can be used as the reference for the next test with the LOCAL_QUORUM consistency.
After trying several options to render the result, the simpler was to export the content of a spreadsheet in a hedgedoc document [1] .
The data are stored in a prometheus instance on a proxmox vms so it will always possible to improve the statistics later[2].

[1] https://hedgedoc.softwareheritage.org/xBOZFpZ-Qr2855cTzbZEbQ
[2] http://192.168.130.165/

A run was launched with th patch storage allowing to configure the consistency levels.
The results are on the dedicated hedgedoc document[1]

The impacts look quite limited on most of the object types. The content type is the only really impacted with a lost of 25% of performance. It needs to be confirmed during other tests

[1] https://hedgedoc.softwareheritage.org/xBOZFpZ-Qr2855cTzbZEbQ?both#run1---consistency-level-local_quorum

The next scheduled run will be with 4 cassandra nodes starting the 2021-07-07 at 7PM EST

During the last run, I discovered there were cassandra logs[1] about oversized mutations on this run and all the previous ones.
It means some changes were committed but ignored when the commit log is flushed which it's absolutely wrong.

It can be solved by increasing the commitlog_segment_size_in_mb properties to twice the max size of the mutation.
In this case, increasing the value from 32mb to 64mb did the job.

After checking the current size of the revisions in the postgres database, it lets think it will not enough:

softwareheritage=> select max(length(message)), max(length(metadata::text)) from revision
;
    max    |   max   
-----------+---------
 106088605 | 9143520
(1 row)

We will see if the problem occurs again as the commitlog is compressed.

So the run2 was relaunched around midnight and was not as long as the previous ones. It will need to be relaunched (the previous one also as the configuration update can have an impact on the performances).

[1]

ERROR [MutationStage-87] 2021-07-07 19:24:28,365 AbstractLocalAwareExecutorService.java:166 - Uncaught exception on thread Thread[MutationStage-87,5,main]
org.apache.cassandra.db.MutationExceededMaxSizeException: Encountered an oversized mutation (19722501/16777216) for keyspace: swh. Top keys are: revision.f25261bd45867cc72a61a67b81ddbf134f8661f8
        at org.apache.cassandra.db.Mutation.validateSize(Mutation.java:141)
        at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:268)
        at org.apache.cassandra.db.CassandraKeyspaceWriteHandler.beginWrite(CassandraKeyspaceWriteHandler.java:50)
        at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:630)
        at org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:477)
        at org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:210)
        at org.apache.cassandra.db.MutationVerbHandler.doVerb(MutationVerbHandler.java:58)
        at org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:77)
        at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:93)
        at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:44)
        at org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:432)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:829)
INFO  [SlabPoolCleaner] 2021-07-07 19:24:28,685 ColumnFamilyStore.java:878 - Enqueuing flush of revision: 255.348MiB (6%) on-heap, 0.000KiB (0%) off-heap

News for the last incomplete run with 4 nodes[1], it seems it's 25% faster than with 3 nodes which it's great
The next run will be this night with 8 cassandra nodes

[1] https://hedgedoc.softwareheritage.org/#run2---consistency-level-local_quorum---4-nodes

The run with 8 nodes was faster[1] than the previous ones, as expected, but it seems it could have been even faster because the bottleneck are now the 6 replayers which have a really high load.
The performance is better between 60% to 100% depending of the object.

As more revision were loaded, the oversize problem occurred again:

org.apache.cassandra.db.MutationExceededMaxSizeException: Encountered an oversized mutation (43544814/33554432) for keyspace: swh. Top keys are: revision.8272297416ace08c29e62a691d1ae8c1146389c4
softwareheritage=> select id, date, length(message) as message, length(metadata::text) as metadata from revision where id='\x8272297416ace08c29e62a691d1ae8c1146389c4';
                     id                     |          date          | message  | metadata 
--------------------------------------------+------------------------+----------+----------
 \x8272297416ace08c29e62a691d1ae8c1146389c4 | 2019-06-20 11:26:14+00 | 43544445 |         
(1 row)

The commit log compression is disable by default and the size of the mutation is matching more or less to the size of the revision. It's possible to estimate the max commitlog size with the request in the previous comment.
The next runs will be configured with commitlog_segment_size_in_mb: 256 (> 2x106088605)

the good news is the error is not completely ignored as I initially though:

Jul  9 07:36:35 parasilo-20 swh[17341]: WARNING:cassandra.cluster:Host 172.16.97.6:9042 error: Server error.
Jul  9 07:38:52 parasilo-20 swh[17334]: Traceback (most recent call last):
Jul  9 07:38:52 parasilo-20 swh[17334]:   File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 412, in call
Jul  9 07:38:52 parasilo-20 swh[17334]:     result = fn(*args, **kwargs)
Jul  9 07:38:52 parasilo-20 swh[17334]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/cql.py", line 272, in _execute_with_retries
Jul  9 07:38:52 parasilo-20 swh[17334]:     return self._session.execute(statement, args, timeout=1000.0)
Jul  9 07:38:52 parasilo-20 swh[17334]:   File "cassandra/cluster.py", line 2345, in cassandra.cluster.Session.execute
Jul  9 07:38:52 parasilo-20 swh[17334]:   File "cassandra/cluster.py", line 4304, in cassandra.cluster.ResponseFuture.result
Jul  9 07:38:52 parasilo-20 swh[17334]: cassandra.WriteFailure: Error from server: code=1500 [Replica(s) failed to execute write] message="Operation failed - received 0 responses and 2 failures: UNKNOWN from /172.16.97.7:7000, UNKNOWN from /172.16.97.4:7000" info={'consistency': 'LOCAL_QUORUM', 'required_responses': 2, 'received_responses': 0, 'failures': 2}
Jul  9 07:38:52 parasilo-20 swh[17334]: The above exception was the direct cause of the following exception:
Jul  9 07:38:52 parasilo-20 swh[17334]: Traceback (most recent call last):

[1] https://hedgedoc.softwareheritage.org/xBOZFpZ-Qr2855cTzbZEbQ?view#run3---consistency-level-local_quorum---8-nodes

Some news about the tests running since the beginning of the week:

  • The data retention of the federated prometheus had the default value so all the data has expired after 15 days. A new reference run was performed to be able to compare with the default scenario
  • The first try failed because it was the first time there were adaption on the zfs configuration and it was not correctly deploy via the ansible scripts. It was solved by completely cleaning up the zfs configuration and relaunching the deployment. Unfortunately, it needs to be manually launched before launching a test with zfs changes.
  • With the usage of the best effort jobs, it's possible to perform test during the days without exceeding the quota

Since monday, these are the tests executed:

  • a run with the run1 scenario (reference) [1]
  • a run with the run4scenario (commitlog on the data partition) [2]
  • a run with the run5 scenario (commit log on a dedicated ssd, 2 HDD for the data) [3]
  • a run with the run6 scenario (commit log on a dedicated HDD, 2 HDD for the data). still in progress but the tendency is already visible

as reference, this is the i/o behavior for the run1 test:

Without surprise, having the commitlogs on the data partition (run4) is not performing well at all. The difference is between 50 to 75% depending of the objects and there are a lot of connection and replication timeouts. As expected, the ios on the data partition iare higher:

The difference for the run5 is not so important with only a difference of 10% max of performance which indicate the number of disk for the data is not so important. This is the i/os during the run:

The run6 in progress indicate that have the commitlog on a classical HDD has a quite high impact, there are a lot of client/replication timeout event at the beginning of the test. The detailed performance will be added at the end of the run (around 18:45 today)

To follow:

  • Tonight: run7 (same as run1 but with more memory allocated to cassandra)
  • tomorrow during the day: Test scylladb
  • starting from tomorrow night*: reset everything, with 5 cassandra nodes, perform a long run to import the complete archive

\* [edit] the nodes are not available during this weekend so the load will start on monday

During the run, test:

  • the multidatacenter replication if we want to replicate the main cluster on azure for example
  • the backup procedures (cassandra snapshot / zfs ?)

[1] https://hedgedoc.softwareheritage.org/xBOZFpZ-Qr2855cTzbZEbQ?view#run1---new-run-to-populate-grafana-dashboards
[2] https://hedgedoc.softwareheritage.org/xBOZFpZ-Qr2855cTzbZEbQ?view#run4---commit-log-on-the-data-partition-4-hdd
[3] https://hedgedoc.softwareheritage.org/xBOZFpZ-Qr2855cTzbZEbQ?view#run5---commit-on-dedicated-ssd---2-hdd-for-data

vsellier updated the task description. (Show Details)

run6 results - commitlog on a HDD

https://hedgedoc.softwareheritage.org/xBOZFpZ-Qr2855cTzbZEbQ?both#run6---commit-log-on-dedicated-hdd--2-hdd-for-data

The result of the test is really wrong with a lost of performance around 70% for almost all the the object types.
It seems the cause is a i/o saturation in the commitlog drive:

Having a fast drive for the commitlog looks quite importan

run7 results - cassandra heap from 16g to 32g

https://hedgedoc.softwareheritage.org/xBOZFpZ-Qr2855cTzbZEbQ?both#run7---run1-configuration-cassandra-heap-32g-instead-of-16g

The head increase seems to not be an improvement as the results are quite similar as the default configuration, even a little slower.
The load on the server looks equivalent, there is a little more gc time.
It seems there is a little read load on the data partition at the end of the bench but nothing significant.

scylladb test

installation

  • scylla_setup command is failing due to an error checking if the server is an ac2 server. The --no-ec2-check option has no effect. Patching the '/usr/lib/scylla/scylla_util.py' allowed to test the command
vsellier@parasilo-3:~$ diff -U3 /tmp/scylla_util.py /usr/lib/scylla/scylla_util.py
--- /tmp/scylla_util.py 2021-08-06 11:46:12.678080457 +0200
+++ /usr/lib/scylla/scylla_util.py      2021-08-06 11:46:25.550038629 +0200
@@ -550,7 +550,7 @@


 def is_ec2():
-    return aws_instance.is_aws_instance()
+    return False


 def is_gce():
vsellier@parasilo-3:~$ diff -U3 /tmp/scylla_util.py /usr/lib/scylla/scylla_util.py
--- /tmp/scylla_util.py 2021-08-06 11:46:12.678080457 +0200
+++ /usr/lib/scylla/scylla_util.py      2021-08-06 11:46:25.550038629 +0200
@@ -550,7 +550,7 @@


 def is_ec2():
-    return aws_instance.is_aws_instance()
+    return False


 def is_gce():

It seems scylla prefers to use XFS and md raids.

This is the output of the command:

root@parasilo-3:/etc/scylla# scylla_setup --no-ec2-check --disks /dev/sdb,/dev/sdc,/dev/sdd,/dev/sde --nic eno1
WARN  2021-08-06 11:47:11,036 [shard 0] iotune - Available space on filesystem at /var/tmp/mnt: 124 MB: is less than recommended: 10 GB
INFO  2021-08-06 11:47:11,036 [shard 0] iotune - /var/tmp/mnt passed sanity checks          
This is a supported kernel version.                                                                           
 6 Aug 11:47:22 ntpdate[10835]: adjust time server 91.189.89.198 offset -0.000629 sec       
/dev/md0 will be used to setup a RAID                                                                         
Creating RAID0 for scylla using 4 disk(s): /dev/sdb,/dev/sdc,/dev/sdd,/dev/sde              
mdadm: partition table exists on /dev/sdb                                                                     
mdadm: partition table exists on /dev/sdb but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sdc                                                         
mdadm: partition table exists on /dev/sdc but will be lost or                                                                                                                                                                                                                                                                                                                                                                          
       meaningless after creating array                                                                                                                                                                                                                                                                                                                                                                                                
mdadm: partition table exists on /dev/sdd                                                                                                                                                                                                                                                                                                                                                                                              
mdadm: partition table exists on /dev/sdd but will be lost or                                                                                                                                                                                                                                                                                                                                                                          
       meaningless after creating array                                                                                 
mdadm: partition table exists on /dev/sde    
mdadm: partition table exists on /dev/sde but will be lost or
       meaningless after creating array      
mdadm: creation continuing despite oddities due to --run
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.                                                                 
meta-data=/dev/md0               isize=512    agcount=32, agsize=18310400 blks                                                                                                                                                                                                                                                                                                                                                         
         =                       sectsz=512   attr=2, projid32bit=1                                                    
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0    
data     =                       bsize=4096   blocks=585928704, imaxpct=5
         =                       sunit=256    swidth=1024 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=286208, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0                                                                         
The unit files have no installation config (WantedBy=, RequiredBy=, Also=,        
Alias= settings in the [Install] section, and DefaultInstance= for template
units). This means they are not meant to be enabled using systemctl.                                                  
                        
Possible reasons for having this kind of units are:
• A unit may be statically enabled by being symlinked from another unit's
  .wants/ or .requires/ directory.
• A unit's purpose may be to act as a helper for some other unit which has
  a requirement dependency on it.
• A unit may be started when needed via activation (socket, path, timer,
  D-Bus, udev, scripted systemctl call, ...).
• In case of template units, the unit is meant to be enabled with some
  instance name specified.
Created symlink /etc/systemd/system/multi-user.target.wants/var-lib-scylla.mount → /etc/systemd/system/var-lib-scylla.mount.
update-initramfs: Generating /boot/initrd.img-4.19.0-17-amd64
W: intel-microcode: initramfs mode not supported, using early initramfs mode
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libio-pty-perl libipc-run-perl moreutils
Use 'apt autoremove' to remove them.
The following NEW packages will be installed:
  systemd-coredump
0 upgraded, 1 newly installed, 0 to remove and 7 not upgraded.
Need to get 134 kB of archives.
After this operation, 254 kB of additional disk space will be used.
Get:1 http://security.debian.org buster/updates/main amd64 systemd-coredump amd64 241-7~deb10u8 [134 kB]
Fetched 134 kB in 0s (773 kB/s)
Selecting previously unselected package systemd-coredump.
(Reading database ... 202944 files and directories currently installed.)
Preparing to unpack .../systemd-coredump_241-7~deb10u8_amd64.deb ...
Unpacking systemd-coredump (241-7~deb10u8) ...
Setting up systemd-coredump (241-7~deb10u8) ...
Processing triggers for man-db (2.8.5-2) ...
Created symlink /etc/systemd/system/multi-user.target.wants/var-lib-systemd-coredump.mount → /etc/systemd/system/var-lib-systemd-coredump.mount.
kernel.core_pattern = |/lib/systemd/systemd-coredump %P %u %g %s %t 9223372036854775808 %h %e
Generating coredump to test systemd-coredump...

PID: 16495 (bash)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: Fri 2021-08-06 11:48:05 CEST (3s ago)
  Command Line: /bin/bash /tmp/tmpt5mvpqxg
    Executable: /usr/bin/bash
 Control Group: /user.slice/user-0.slice/session-86.scope
          Unit: session-86.scope
         Slice: user-0.slice
       Session: 86
     Owner UID: 0 (root)
       Boot ID: f945ab90a168488bb1bc364c3da4a859
    Machine ID: f351035a17a949db97e56945d6f68b79
      Hostname: parasilo-3.rennes.grid5000.fr
       Storage: /var/lib/systemd/coredump/core.bash.0.f945ab90a168488bb1bc364c3da4a859.16495.1628243285000000
       Message: Process 16495 (bash) of user 0 dumped core.

                Stack trace of thread 16495:
                #0  0x00007f40fd5e5a97 kill (libc.so.6)
                #1  0x0000560df79ee3d8 kill_pid (bash)
                #2  0x0000560df7a2e1f7 kill_builtin (bash)
                #3  0x0000560df79d8b63 n/a (bash)
                #4  0x0000560df79db34e n/a (bash)
                #5  0x0000560df79dc75f execute_command_internal (bash)
                #6  0x0000560df79dddf2 execute_command (bash)
                #7  0x0000560df79c5833 reader_loop (bash)
                #8  0x0000560df79c4104 main (bash)
                #9  0x00007f40fd5d209b __libc_start_main (libc.so.6)
                #10 0x0000560df79c465a _start (bash)


systemd-coredump is working finely.
tuning /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:0/0:0:0:0/block/sda/sda3
tuning /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:0/0:0:0:0/block/sda
tuning: /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:0/0:0:0:0/block/sda/queue/nomerges 2
tuning /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:0/0:0:0:0/block/sda/sda3
tuning /sys/devices/virtual/block/md0
tuning: /sys/devices/virtual/block/md0/queue/nomerges 2
tuning /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:3/0:0:3:0/block/sdd
tuning: /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:3/0:0:3:0/block/sdd/queue/nomerges 2
tuning /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:1/0:0:1:0/block/sdb
tuning: /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:1/0:0:1:0/block/sdb/queue/nomerges 2
tuning /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:4/0:0:4:0/block/sde
tuning: /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:4/0:0:4:0/block/sde/queue/nomerges 2
tuning /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:2/0:0:2:0/block/sdc
tuning: /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:2/0:0:2:0/block/sdc/queue/nomerges 2
tuning /sys/devices/virtual/block/md0
tuning /sys/devices/virtual/block/md0
INFO  2021-08-06 11:48:15,064 [shard 0] iotune - /var/lib/scylla/saved_caches passed sanity checks
WARN  2021-08-06 11:48:15,065 [shard 0] iotune - Scheduler for /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:3/0:0:3:0/block/sdd/queue/scheduler set to mq-deadline. It is recommend to set it to noop before evaluation so as not to skew the results.
WARN  2021-08-06 11:48:15,065 [shard 0] iotune - Scheduler for /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:1/0:0:1:0/block/sdb/queue/scheduler set to mq-deadline. It is recommend to set it to noop before evaluation so as not to skew the results.
WARN  2021-08-06 11:48:15,065 [shard 0] iotune - Scheduler for /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:4/0:0:4:0/block/sde/queue/scheduler set to mq-deadline. It is recommend to set it to noop before evaluation so as not to skew the results.
WARN  2021-08-06 11:48:15,065 [shard 0] iotune - Scheduler for /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:2/0:0:2:0/block/sdc/queue/scheduler set to mq-deadline. It is recommend to set it to noop before evaluation so as not to skew the results.
INFO  2021-08-06 11:48:15,066 [shard 0] iotune - Disk parameters: max_iodepth=1024 disks_per_array=4 minimum_io_size=512
Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: 633 MB/s
Measuring sequential read bandwidth: 722 MB/s
Measuring random write IOPS: 1835 IOPS
Measuring random read IOPS: 2184 IOPS
INFO  2021-08-06 11:50:24,836 [shard 0] iotune - /srv/cassandra/commitlogs passed sanity checks
WARN  2021-08-06 11:50:24,836 [shard 0] iotune - Scheduler for /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:0/0:0:0:0/block/sda/queue/scheduler set to mq-deadline. It is recommend to set it to noop before evaluation so as not to skew the results.
INFO  2021-08-06 11:50:24,836 [shard 0] iotune - Disk parameters: max_iodepth=256 disks_per_array=1 minimum_io_size=512
Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: 181 MB/s
Measuring sequential read bandwidth: 189 MB/s
Measuring random write IOPS: 690 IOPS
Measuring random read IOPS: 926 IOPS
Writing result to /etc/scylla.d/io_properties.yaml
Writing result to /etc/scylla.d/io.conf
Created symlink /etc/systemd/system/multi-user.target.wants/scylla-node-exporter.service → /lib/systemd/system/scylla-node-exporter.service.
cpufrequtils.service is not a native service, redirecting to systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable cpufrequtils
Created symlink /etc/systemd/system/timers.target.wants/scylla-fstrim.timer → /lib/systemd/system/scylla-fstrim.timer.
ScyllaDB setup finished.

After having some hard time to configure and correctly start the scylla servers (different binding, configuration adaptation), the schema was correctly created (I needed to add SWH_USE_SCYLLADB=1 on the initialisation script).
Compared to cassandra, it seems the nodetool command didn't return correctly the data repartition on the cluster because the system keyspaces hasn't the same replication factor as the swh one

vsellier@parasilo-2:~$  nodetool status
Using /etc/scylla/scylla.yaml as the config file
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens       Owns    Host ID                               Rack
UN  172.16.97.2  2.36 MB    256          ?       866bbcc4-d496-4ebb-ab3b-12ef4942beaa  rack1
UN  172.16.97.3  3.37 MB    256          ?       21fdd0a9-15cd-473f-814c-c8ac24870aca  rack1
UN  172.16.97.4  3.48 MB    256          ?       1ed61715-01a0-4c15-a4bc-f9972f575437  rack1

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

After starting the replayers, there are several errors on the logs:

-- The unit replayer-revision@13.service has entered the 'failed' state with result 'exit-code'.
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]: Traceback (most recent call last):
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/bin/swh", line 11, in <module>
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     load_entry_point('swh.core==0.14.4', 'console_scripts', 'swh')()
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return swh(auto_envvar_prefix="SWH")
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return self.main(*args, **kwargs)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     rv = self.invoke(ctx)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return _process_result(sub_ctx.command.invoke(sub_ctx))
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return _process_result(sub_ctx.command.invoke(sub_ctx))
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return ctx.invoke(self.callback, **ctx.params)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return callback(*args, **kwargs)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return f(get_current_context(), *args, **kwargs)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/storage/cli.py", line 194, in replay
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     client.process(worker_fn)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/journal/client.py", line 265, in process
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     batch_processed, at_eof = self.handle_messages(messages, worker_fn)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/journal/client.py", line 292, in handle_messages
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     worker_fn(dict(objects))
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/storage/replay.py", line 62, in process_replay_objects
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     _insert_objects(object_type, objects, storage)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/storage/replay.py", line 163, in _insert_objects
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     for o in objects
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/storage.py", line 569, in revision_add
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     missing = self.revision_missing([rev.id for rev in to_add])
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/storage.py", line 593, in revision_missing
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return self._cql_runner.revision_missing(revisions)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/cql.py", line 145, in newf
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     self, *args, **kwargs, statement=self._prepared_statements[f.__name__]
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/cql.py", line 542, in revision_missing
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return self._missing(statement, ids)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/cql.py", line 305, in _missing
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     rows = self._execute_with_retries(statement, [ids])
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 329, in wrapped_f
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return self.call(f, *args, **kw)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 409, in call
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     do = self.iter(retry_state=retry_state)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 356, in iter
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return fut.result()
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 425, in result
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return self.__get_result()
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     raise self._exception
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 412, in call
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     result = fn(*args, **kwargs)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/cql.py", line 272, in _execute_with_retries
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return self._session.execute(statement, args, timeout=1000.0)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "cassandra/cluster.py", line 2345, in cassandra.cluster.Session.execute
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "cassandra/cluster.py", line 4304, in cassandra.cluster.ResponseFuture.result
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]: cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 172.16.97.2:9042 datacenter1>: <Error from server: code=0000 [Server error] message="partition key cartesian product size 250 is greater than maximum 100">, <Host: 172.16.97.3:9042 datacenter1>: <Error from server: code=0000 [Server error] message="partition key cartesian product size 250 is greater than maximum 100">, <Host: 172.16.97.4:9042 datacenter1>: <Error from server: code=0000 [Server error] message="partition key cartesian product size 250 is greater than maximum 100">})                                                                                                                                                                                                            
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr systemd[1]: replayer-revision@16.service: Main process exited, code=exited, status=1/FAILURE

there is also a lot of error on the scylla logs relative to read timeout (with no activities on the database except the monitoring):

Aug 06 14:52:10 parasilo-4.rennes.grid5000.fr scylla[16488]:  [shard 5] storage_proxy - Exception when communicating with 172.16.97.4, to read from swh.object_count: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
Aug 06 14:52:10 parasilo-4.rennes.grid5000.fr scylla[16488]:  [shard 6] storage_proxy - Exception when communicating with 172.16.97.4, to read from swh.object_count: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)

Thanks @vlorentz for D6067, I will test the fix when the cluster will be more stable

The db server prometheus configuration needs some adaptation as scylla is coming with its own prometheus node exporter (and is removing the default packages :()

root@parasilo-2:/opt# apt install scylla-node-exporter
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libio-pty-perl libipc-run-perl moreutils
Use 'apt autoremove' to remove them.
The following packages will be REMOVED:
  prometheus-node-exporter
The following NEW packages will be installed:
  scylla-node-exporter
0 upgraded, 1 newly installed, 1 to remove and 7 not upgraded.
Need to get 0 B/4,076 kB of archives.
After this operation, 3,243 kB of additional disk space will be used.

The previous behavior can be restored by overriding the /etc/default/scylla-node-exporter file with:

SCYLLA_NODE_EXPORTER_ARGS="--collector.interrupts --collector.textfile.directory=/var/lib/prometheus/node-exporter/"

(add the collector.textfile.directory option)

as scylla is coming with its own prometheus node exporter (and is removing the default packages :()

yeah, scylla's debian package is a bit disruptive... it also removes Cassandra even though their file names don't clash

It seems D6067 solves the issue with the partition key cartesian product size. @vlorentz Do you think a run with cassandra is necessary to evaluate a potential performance impact?

With this configuration (a mix between the previous cassandra one + the automatically tuned, it seems the performaces are not so great.
With the 4 replayer nodes, there are a lot of timeouts (read and write) + semaphore timeouts

Aug 06 17:25:54 parasilo-3.rennes.grid5000.fr scylla[8180]:  [shard 19] storage_proxy - Exception when communicating with 172.16.97.3, to read from swh.content: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)                                                                                                                                                                                       
Aug 06 17:25:54 parasilo-3.rennes.grid5000.fr scylla[8180]:  [shard 3] storage_proxy - Exception when communicating with 172.16.97.3, to read from swh.origin: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)                                                                                                                                                                                         
Aug 06 17:25:54 parasilo-3.rennes.grid5000.fr scylla[8180]:  [shard 18] storage_proxy - Exception when communicating with 172.16.97.3, to read from swh.content: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)                                                                                                                                                                                       
Aug 06 17:25:54 parasilo-3.rennes.grid5000.fr scylla[8180]:  [shard 0] storage_proxy - Exception when communicating with 172.16.97.3, to read from swh.content: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)                                                                                                                                                                                        
Aug 06 17:25:54 parasilo-3.rennes.grid5000.fr scylla[8180]:  [shard 16] storage_proxy - Exception when communicating with 172.16.97.3, to read from swh.content_by_sha1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)                                                                                                                                                                               
Aug 06 17:25:54 parasilo-3.rennes.grid5000.fr scylla[8180]:  [shard 6] storage_proxy - Exception when communicating with 172.16.97.3, to read from swh.object_count: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)                                                                                                                                                                                   
Aug 06 17:25:54 parasilo-3.rennes.grid5000.fr scylla[8180]:  [shard 3] storage_proxy - Exception when communicating with 172.16.97.3, to read from swh.object_count: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)                                                                                                                                                                                   
Aug 06 17:25:54 parasilo-3.rennes.grid5000.fr scylla[8180]:  [shard 6] storage_proxy - Exception when communicating with 172.16.97.3, to read from swh.object_count: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)                                                                                                                                                                                   
Aug 06 17:25:5

It's almost not possible to read the object_count table:

cqlsh:swh> select dateof(now()) from system.local; select * from object_count;select dateof(now()) from system.local;

 system.dateof(system.now())
---------------------------------
 2021-08-06 15:30:55.280000+0000

(1 rows)

 partition_key | object_type         | count
---------------+---------------------+----------
             0 |             content |  7701907
             0 |           directory |   349048
             0 |     directory_entry |  8609087
             0 |              origin | 11774538
             0 |        origin_visit |  8882767
             0 | origin_visit_status | 19234919
             0 |             release |     2074
             0 |            revision |   266655
             0 |     revision_parent |   273950
             0 |            snapshot |    25820
             0 |     snapshot_branch |   696754

(11 rows)

 system.dateof(system.now())
---------------------------------
 2021-08-06 15:30:55.302000+0000

(1 rows)
cqlsh:swh> select dateof(now()) from system.local; select * from object_count;select dateof(now()) from system.local;

 system.dateof(system.now())
---------------------------------
 2021-08-06 15:30:58.863000+0000

(1 rows)
ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out for swh.object_count - received only 1 responses from 2 CL=LOCAL_QUORUM." info={'received_responses': 1, 'required_responses': 2, 'consistency': 'LOCAL_QUORUM'}

 system.dateof(system.now())
---------------------------------
 2021-08-06 15:31:08.874000+0000

(1 rows)
cqlsh:swh> select dateof(now()) from system.local; select * from object_count;select dateof(now()) from system.local;

 system.dateof(system.now())
---------------------------------
 2021-08-06 15:31:09.911000+0000

(1 rows)
ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out for swh.object_count - received only 1 responses from 2 CL=LOCAL_QUORUM." info={'received_responses': 1, 'required_responses': 2, 'consistency': 'LOCAL_QUORUM'}

 system.dateof(system.now())
---------------------------------
 2021-08-06 15:31:19.925000+0000

(1 rows)

I will stop the test here as I initially planned to work on it only this morning...

Do you think a run with cassandra is necessary to evaluate a potential performance impact?

Not necessary, but I'm curious. It may even improve performance.

If you don't mind, I'd also like to know how P1118 performs. It groups the IDs so that all IDs in a given request are handled by the same server node. This way, nodes don't have to coordinate to return the query. This is at the cost of most CPU work on the client, though.

The complete import is running almost continuously with 5 cassandra nodes since monday.

vsellier@parasilo-2:~$ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load        Tokens  Owns (effective)  Host ID                               Rack 
UN  172.16.97.3  325.79 GiB  256     60.1%             a3ae5fa2-c063-4890-87f1-bddfcf293bde  rack1
UN  172.16.97.6  315 GiB     256     60.0%             bfe360f1-8fd2-4f4b-a070-8f267eda1e12  rack1
UN  172.16.97.5  306.99 GiB  256     59.9%             478c36f8-5220-4db7-b5c2-f3876c0c264a  rack1
UN  172.16.97.4  296.96 GiB  256     59.9%             b3105348-66b0-4f82-a5bf-31ef28097a41  rack1
UN  172.16.97.2  369.58 GiB  256     60.1%             de866efd-064c-4e27-965c-f5112393dc8f  rack1

The data is not equally distributed accross the nodes but the differences are no so big.
I try to launch some repairs but not of them had the time to finish before the end of the g5k reservations. They are very slow when the replayer are running.

With the default incremental repair (nodetool repair -pr), The job was stuck in a preparation step for more than 10h and it generates a lot of i/o on the servers, for example a test this morning with just the repair just running:

vsellier@parasilo-2:~$ nodetool repair_admin list
id                                   | state     | last activity | coordinator       | participants                                                | participants_wp
42013da0-fa74-11eb-9016-e3c0e1396b24 | PREPARING | 1767 (s)      | /172.16.97.2:7000 | 172.16.97.5,172.16.97.6,172.16.97.2,172.16.97.3,172.16.97.4 | 172.16.97.5:7000,172.16.97.6:7000,172.16.97.2:7000,172.16.97.3:7000,172.16.97.4:7000

INFO  [CompactionExecutor:2] 2021-08-11 09:18:04,766 CompactionManager.java:843 - [repair #42013da0-fa74-11eb-9016-e3c0e1396b24] SSTable BigTableReader(path='/srv/cassandra/data/swh/content_by_blake2s256-0b496b40f94611ebb0d51fb7e01de2ce/nb-773-big-Data.db') ([-9205927237330816222,9201009158028608965]) will be anticompacted on range (8811349095641234110,8825281880574143858]                                                                                                     
INFO  [CompactionExecutor:2] 2021-08-11 09:18:04,766 CompactionManager.java:843 - [repair #42013da0-fa74-11eb-9016-e3c0e1396b24] SSTable BigTableReader(path='/srv/cassandra/data/swh/content_by_blake2s256-0b496b40f94611ebb0d51fb7e01de2ce/nb-773-big-Data.db') ([-9205927237330816222,9201009158028608965]) will be anticompacted on range (8945057472051614363,8964557645526076280]                                                                                                     
INFO  [CompactionExecutor:2] 2021-08-11 09:18:04,766 CompactionManager.java:843 - [repair #42013da0-fa74-11eb-9016-e3c0e1396b24] SSTable BigTableReader(path='/srv/cassandra/data/swh/content_by_blake2s256-0b496b40f94611ebb0d51fb7e01de2ce/nb-773-big-Data.db') ([-9205927237330816222,9201009158028608965]) will be anticompacted on range (9195484876029867743,9198251923896477770]                                                                                                     
INFO  [CompactionExecutor:2] 2021-08-11 09:18:04,766 CompactionManager.java:1491 - Performing anticompaction on 12 sstables for 42013da0-fa74-11eb-9016-e3c0e1396b24
INFO  [CompactionExecutor:2] 2021-08-11 09:18:04,767 CompactionManager.java:1541 - Anticompacting [BigTableReader(path='/srv/cassandra/data/swh/content_by_blake2s256-0b496b40f94611ebb0d51fb7e01de2ce/nb-754-big-Data.db'), BigTableReader(path='/srv/cassandra/data/swh/content_by_blake2s256-0b496b40f94611ebb0d51fb7e01de2ce/nb-868-big-Data.db')] in swh.content_by_blake2s256 for 42013da0-fa74-11eb-9016-e3c0e1396b24

The full mode looks also very long but it starts to check some partition more rapidly.
Event if the cluster looked ok, the full repair fix a lot of inconsistencies. Unfortunately, the logs were lost after the end of the reservation. I will relaunch a test later and paste some of them.

Some stats about the import:

objectimportedarchive count (from counters)percent
directory2473315192348634860.2%
content337504100110520154593%
directory_entry564200718
origin162382023163516337100%
origin_visit397874175109651411436%
origin_visit_status1001657300158463865763%
release1893168918742909100%
revision524518750233717261022%
revision_parent537635368
skipped_content155370153387100%
snapshot2100112816570738812%
snapshot_branch416055156

Current import status before the run of this week-end:

objectimportedarchive count (from counters)percent
directory24 733 1519 234 863 4860.2%
content337 504 10011 052 015 4593%
directory_entry564 200 718
origin162 396 161163 516 337~100%
origin_visit831 458 7091 097 664 12376%
origin_visit_status2 142 772 7301 586 012 621~75%
release18 940 57618 759 533~100%
revision892 680 8682 338 949 544~40%
revision_parent926 234 879
skipped_content155 401153 583~100%
snapshot39 619 975165 838 402~25%
snapshot_branch795 347 015
parasilo-2:~$  nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load        Tokens  Owns (effective)  Host ID                               Rack 
UN  172.16.97.3  486.3 GiB   256     60.1%             a3ae5fa2-c063-4890-87f1-bddfcf293bde  rack1
UN  172.16.97.6  488.86 GiB  256     60.0%             bfe360f1-8fd2-4f4b-a070-8f267eda1e12  rack1
UN  172.16.97.5  481.59 GiB  256     59.9%             478c36f8-5220-4db7-b5c2-f3876c0c264a  rack1
UN  172.16.97.4  471.45 GiB  256     59.9%             b3105348-66b0-4f82-a5bf-31ef28097a41  rack1
UN  172.16.97.2  509.7 GiB   256     60.1%             de866efd-064c-4e27-965c-f5112393dc8f  rack1

The import is stille progressing. I focused the today replayers on origin_visit and origin_visit_status and they will be completely imported in a couple of hours.

I had several performance issues this week as I tried to found a way to perform repairs in the g5k reservation intervals (8h30 during the day, 13h during the night).
It was almost impossible tu run a complete repair.

It worked finally by running a repair table per table, node per node, for example:

time (seq 2 6 | xargs -n1 -i{} ssh parasilo-{} nodetool repair -pr -os swh content)

I was not yet executed on each table, only on origin, directory and directory_entry.

tablerepair time per node on the first incremental repair
origin~10mn
directory~10mn
directory_entry~50mn
content~2h

Some stuff learned during the try and failed tests:

  • Incremental repairs are consuming a compaction thread per table to split the repaired/unrepaired data in different sst tables (in the preparation phase)
    • It can have a big i/o impact when a lot of data was written between to incremental repair because a lot of data can be reorganized
    • if the repair is not constrained on specific tables, all the compaction slots can be used by anticompaction processes and the normal compaction blocked [1]
vsellier@parasilo-2:~$ nodetool compactionstats
pending tasks: 15
- swh.content: 1
- swh.revision: 14

id                                   compaction type             keyspace table    completed   total       unit  progress
0e838d90-fc39-11eb-b947-11b81b293809 Compaction                  swh      revision 4592302191  17780154880 bytes 25.83%  
38c322a0-fc39-11eb-b947-11b81b293809 Compaction                  swh      revision 1583439141  1647624647  bytes 96.10%  
dbd14291-fc36-11eb-b947-11b81b293809 Anticompaction after repair swh      content  13336597420 30688260710 bytes 43.46%  
Active compaction remaining time :   0h01m18s

[1] a repair test on one server ran during a day and a night. It has blocked the compaction process. As the replayers were running at full speed, the number of pending compactions and sst tables grown up rapidely:



It has resulted on a higher i/o pressure on the node. It has started to be flaky (holes visible on the screen captures).

current status:

There is now more than 1To of data per nodes:

vsellier@parasilo-2:~$  nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load      Tokens  Owns (effective)  Host ID                               Rack 
UN  172.16.97.3  1.15 TiB  256     60.1%             a3ae5fa2-c063-4890-87f1-bddfcf293bde  rack1
UN  172.16.97.6  1.15 TiB  256     60.0%             bfe360f1-8fd2-4f4b-a070-8f267eda1e12  rack1
UN  172.16.97.5  1.14 TiB  256     59.9%             478c36f8-5220-4db7-b5c2-f3876c0c264a  rack1
UN  172.16.97.4  1.14 TiB  256     59.9%             b3105348-66b0-4f82-a5bf-31ef28097a41  rack1
UN  172.16.97.2  1.16 TiB  256     60.1%             de866efd-064c-4e27-965c-f5112393dc8f  rack1

Replaying status:

object_typeReplaying stats
content1 171 294 907 eta 2 weeks
directory222 769 166 eta several months
directory_entry6 012 115 914 derivated from the directory topic
originDone
origin_visitDone
origin_visit_statusDone
releaseDone
revisionDone
revision_parentDone
skipped_contentDone
snapshotDone
snapshot_branchDone
extidjust started eta 1day

I will check if increasing the number of directory replayer can reduce the delay to complete the backfill without generating to much load on the cassandra nodes.

The replaying is currently stopped as the data disks are now almost full.
I will try to activate the compression on some big tables to see if it can help.
I will probably need to start on small tables to recover some space before being able to compress the biggest tables

interesting:

Depending on the data characteristics of the table, compressing its data can result in:

25-33% reduction in data size
25-35% performance improvement on reads
5-10% performance improvement on writes

from https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsWhenCompress.html
and https://cassandra.apache.org/doc/latest/cassandra/operating/compression.html

the lz4 compression was already activated by default. Changing the algo to zstd on the table snapshot was not really significant (initially with lz4: 7Go, zstd: 12Go, go back to lz4: 9Go :) )

Another point was there were a lot of snapshot on the system tables, automatically created due to the recommended option auto_snapshot.
Doing a snapshot cleanup restore some free space:

vsellier@parasilo-5:/srv/cassandra/data$ df -h .
Filesystem      Size  Used Avail Use% Mounted on
data/data       2.2T  2.1T   17G 100% /srv/cassandra/data
root@parasilo-5:/srv/cassandra/data/swh# nodetool listsnapshots
Snapshot Details:
Snapshot name                           Keyspace name Column family name True size  Size on disk
truncated-1629184464594-table_estimates system        table_estimates    1.5 MiB    1.5 MiB
truncated-1628756480802-size_estimates  system        size_estimates     759.14 KiB 759.19 KiB
truncated-1629702924175-size_estimates  system        size_estimates     767.4 KiB  767.45 KiB
truncated-1629308811973-table_estimates system        table_estimates    2.27 MiB   2.27 MiB
truncated-1630046110551-size_estimates  system        size_estimates     1.14 MiB   1.14 MiB
truncated-1629998125791-size_estimates  system        size_estimates     386.15 KiB 386.18 KiB
truncated-1629098347272-size_estimates  system        size_estimates     1.12 MiB   1.12 MiB
truncated-1629965105009-size_estimates  system        size_estimates     771.03 KiB 771.09 KiB
truncated-1628874829022-size_estimates  system        size_estimates     1.11 MiB   1.11 MiB
truncated-1629911731701-size_estimates  system        size_estimates     383.43 KiB 383.46 KiB
truncated-1628579736073-table_estimates system        table_estimates    2.98 MiB   2.98 MiB
truncated-1628788825963-size_estimates  system        size_estimates     759.09 KiB 759.14 KiB
truncated-1629134330765-size_estimates  system        size_estimates     1.12 MiB   1.12 MiB
truncated-1629482669863-size_estimates  system        size_estimates     382.75 KiB 382.78 KiB
truncated-1628702276608-size_estimates  system        size_estimates     1.11 MiB   1.11 MiB
truncated-1629134330885-table_estimates system        table_estimates    2.25 MiB   2.25 MiB
truncated-1628579735956-size_estimates  system        size_estimates     1.48 MiB   1.48 MiB
truncated-1629357583808-table_estimates system        table_estimates    3.78 MiB   3.78 MiB
truncated-1629393581755-table_estimates system        table_estimates    2.27 MiB   2.27 MiB
truncated-1628702276697-table_estimates system        table_estimates    2.24 MiB   2.24 MiB
truncated-1629270940154-table_estimates system        table_estimates    1.51 MiB   1.51 MiB
truncated-1628628843100-size_estimates  system        size_estimates     760.32 KiB 760.37 KiB
truncated-1628536372852-size_estimates  system        size_estimates     0 bytes    13 bytes
truncated-1628838933467-size_estimates  system        size_estimates     759.82 KiB 759.88 KiB
truncated-1629702924296-table_estimates system        table_estimates    1.51 MiB   1.51 MiB
truncated-1629789925961-size_estimates  system        size_estimates     383.42 KiB 383.45 KiB
truncated-1629876050761-size_estimates  system        size_estimates     1.5 MiB    1.5 MiB
truncated-1629184464480-size_estimates  system        size_estimates     762.66 KiB 762.71 KiB
truncated-1629220429867-table_estimates system        table_estimates    3 MiB      3 MiB
truncated-1629443995522-table_estimates system        table_estimates    1.51 MiB   1.51 MiB
truncated-1629830707077-table_estimates system        table_estimates    2.27 MiB   2.27 MiB
truncated-1628628843228-table_estimates system        table_estimates    1.5 MiB    1.5 MiB
truncated-1629998125864-table_estimates system        table_estimates    776.69 KiB 776.72 KiB
truncated-1628874829182-table_estimates system        table_estimates    2.25 MiB   2.25 MiB
truncated-1628788826094-table_estimates system        table_estimates    1.49 MiB   1.49 MiB
truncated-1629830706973-size_estimates  system        size_estimates     1.12 MiB   1.12 MiB
truncated-1630048546649-table_estimates system        table_estimates    779.51 KiB 779.54 KiB
truncated-1629876050882-table_estimates system        table_estimates    3.02 MiB   3.02 MiB
truncated-1629911731833-table_estimates system        table_estimates    773.41 KiB 773.44 KiB
truncated-1629479921485-table_estimates system        table_estimates    2.92 MiB   2.92 MiB
truncated-1628624457971-size_estimates  system        size_estimates     1.11 MiB   1.11 MiB
truncated-1629357583696-size_estimates  system        size_estimates     1.87 MiB   1.87 MiB
truncated-1629393581576-size_estimates  system        size_estimates     1.13 MiB   1.13 MiB
truncated-1629482670003-table_estimates system        table_estimates    772.47 KiB 772.5 KiB
truncated-1630046110701-table_estimates system        table_estimates    2.28 MiB   2.28 MiB
truncated-1629965105119-table_estimates system        table_estimates    1.52 MiB   1.52 MiB
truncated-1628615547571-table_estimates system        table_estimates    2.24 MiB   2.24 MiB
truncated-1629479921380-size_estimates  system        size_estimates     1.45 MiB   1.45 MiB
truncated-1628665981321-table_estimates system        table_estimates    2.99 MiB   2.99 MiB
truncated-1629744201597-size_estimates  system        size_estimates     383.71 KiB 383.74 KiB
truncated-1629270940011-size_estimates  system        size_estimates     766.25 KiB 766.3 KiB
truncated-1628536373203-table_estimates system        table_estimates    0 bytes    13 bytes
truncated-1629220429713-size_estimates  system        size_estimates     1.49 MiB   1.49 MiB
truncated-1629789926069-table_estimates system        table_estimates    773.25 KiB 773.28 KiB
truncated-1628756480997-table_estimates system        table_estimates    1.49 MiB   1.49 MiB
truncated-1628624458089-table_estimates system        table_estimates    2.24 MiB   2.24 MiB
truncated-1628838933547-table_estimates system        table_estimates    1.5 MiB    1.5 MiB
truncated-1629098347382-table_estimates system        table_estimates    2.25 MiB   2.25 MiB
truncated-1628615547437-size_estimates  system        size_estimates     1.11 MiB   1.11 MiB
truncated-1628665981168-size_estimates  system        size_estimates     1.48 MiB   1.48 MiB
truncated-1629443995396-size_estimates  system        size_estimates     767.97 KiB 768.02 KiB
truncated-1629308811843-size_estimates  system        size_estimates     1.12 MiB   1.12 MiB
truncated-1630048546485-size_estimates  system        size_estimates     387.88 KiB 387.91 KiB
truncated-1629744201736-table_estimates system        table_estimates    773.97 KiB 774 KiB

Total TrueDiskSpaceUsed: 0 bytes
root@parasilo-5:/srv/cassandra/data/swh# nodetool clearsnapshot --all
Requested clearing snapshot(s) for [all keyspaces] with [all snapshots]
vsellier@parasilo-5:/srv/cassandra/data$ df -h .                                                                                                            
Filesystem      Size  Used Avail Use% Mounted on                                                                                                         
data/data       2.2T  1.4T  729G  67% /srv/cassandra/data

let's go to restart the ingestions and the performance tests

With the new concurrent replay of the directory, the disk usage grow up rapidly:

A new node was added on the cluster to reduce the disk usage on the old nodes.

The disk space on the new node:

The initial rebuilt was interrupted by the end of the reservation.
An attempt to use a repair in now in progress, but still not finished has it's regularly interrupted by the daily shutdown.
I hope the reservation of this week-end will be enough to stabilize the cluster

the second cluster with 15 nodes with 1to ssd disk is still ok but will probably need an extension soon also: