⚓ T3357 Perform some tests of the cassandra storage on Grid5000

Status	Assigned	Task
Migrated	gitlab-migration	T2213 Storage
Migrated	gitlab-migration	T2214 Scale-out graph and database storage in production
Migrated	gitlab-migration	T1892 Cassandra as a storage backend
Migrated	gitlab-migration	T3357 Perform some tests of the cassandra storage on Grid5000
Migrated	gitlab-migration	T3394 cassandra - origin url hashing encoding issue
Migrated	gitlab-migration	T3395 cassandra - Timeouts during revision import
Migrated	gitlab-migration	T3396 cassandra - allow to configure the consistency level used by the queries
Migrated	gitlab-migration	T3464 Prepare a quote for the cassandra servers
Migrated	gitlab-migration	T3465 Test multidatacenter replication
Migrated	gitlab-migration	T3491 Origin visit ids restart from 1 even if there is previous visits
Migrated	gitlab-migration	T3493 [cassandra] Git loader performance are very bad
Migrated	gitlab-migration	T3517 [cassandra] decorate the method calls to have statsd metrics
Migrated	gitlab-migration	T3573 [cassandra] directory and content read benchmarks
Migrated	gitlab-migration	T3577 Parallel loaders performances
Migrated	gitlab-migration	T3683 cassandra - benchmark the vault

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

vsellier changed the status of subtask T3396: cassandra - allow to configure the consistency level used by the queries from Open to Work in Progress.Jul 6 2021, 12:12 PM

vsellier mentioned this in rDSNIP8b617533498b: grid5000/cassandra: increase commit log size.Jul 6 2021, 12:17 PM

vsellier mentioned this in rDSNIP63cb43673e29: grid5000/cassandra: Add a statistics dashboards.

vsellier mentioned this in rDSNIPae799c543610: grid5000/cassandra: restore the initial number of replayers per object type.

vsellier mentioned this in rDSNIP68cf74d2fc57: grid5000/cassandra: open unauthenticated access to cassandra.Jul 6 2021, 12:38 PM

vsellier closed subtask T3395: cassandra - Timeouts during revision import as Resolved.Jul 6 2021, 12:44 PM

A run was launched with th patch storage allowing to configure the consistency levels.
The results are on the dedicated hedgedoc document[1]

The impacts look quite limited on most of the object types. The content type is the only really impacted with a lost of 25% of performance. It needs to be confirmed during other tests

[1] https://hedgedoc.softwareheritage.org/xBOZFpZ-Qr2855cTzbZEbQ?both#run1---consistency-level-local_quorum

The next scheduled run will be with 4 cassandra nodes starting the 2021-07-07 at 7PM EST

vsellier mentioned this in rDSNIPa397ec1c7335: grid5000/cassandra: Use a more secure consistency level.Jul 7 2021, 7:32 PM

vsellier mentioned this in rDSNIPbb9854fba524: grid5000/cassandra: small updates on grafana dashboards.

vsellier mentioned this in rDSNIP5a50b5dfb0f9: grid5000/cassandra: Restart the monitoring containers.

During the last run, I discovered there were cassandra logs[1] about oversized mutations on this run and all the previous ones.
It means some changes were committed but ignored when the commit log is flushed which it's absolutely wrong.

It can be solved by increasing the commitlog_segment_size_in_mb properties to twice the max size of the mutation.
In this case, increasing the value from 32mb to 64mb did the job.

After checking the current size of the revisions in the postgres database, it lets think it will not enough:

softwareheritage=> select max(length(message)), max(length(metadata::text)) from revision
;
    max    |   max   
-----------+---------
 106088605 | 9143520
(1 row)

We will see if the problem occurs again as the commitlog is compressed.

So the run2 was relaunched around midnight and was not as long as the previous ones. It will need to be relaunched (the previous one also as the configuration update can have an impact on the performances).

[1]

ERROR [MutationStage-87] 2021-07-07 19:24:28,365 AbstractLocalAwareExecutorService.java:166 - Uncaught exception on thread Thread[MutationStage-87,5,main]
org.apache.cassandra.db.MutationExceededMaxSizeException: Encountered an oversized mutation (19722501/16777216) for keyspace: swh. Top keys are: revision.f25261bd45867cc72a61a67b81ddbf134f8661f8
        at org.apache.cassandra.db.Mutation.validateSize(Mutation.java:141)
        at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:268)
        at org.apache.cassandra.db.CassandraKeyspaceWriteHandler.beginWrite(CassandraKeyspaceWriteHandler.java:50)
        at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:630)
        at org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:477)
        at org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:210)
        at org.apache.cassandra.db.MutationVerbHandler.doVerb(MutationVerbHandler.java:58)
        at org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:77)
        at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:93)
        at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:44)
        at org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:432)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:829)
INFO  [SlabPoolCleaner] 2021-07-07 19:24:28,685 ColumnFamilyStore.java:878 - Enqueuing flush of revision: 255.348MiB (6%) on-heap, 0.000KiB (0%) off-heap

vsellier mentioned this in rDSNIPf4ac8222de37: grid5000/cassandra: avoid oversize mutations on revisions.Jul 8 2021, 1:58 PM

News for the last incomplete run with 4 nodes[1], it seems it's 25% faster than with 3 nodes which it's great
The next run will be this night with 8 cassandra nodes

[1] https://hedgedoc.softwareheritage.org/#run2---consistency-level-local_quorum---4-nodes

vsellier closed subtask T3396: cassandra - allow to configure the consistency level used by the queries as Resolved.Jul 9 2021, 9:46 AM

The run with 8 nodes was faster[1] than the previous ones, as expected, but it seems it could have been even faster because the bottleneck are now the 6 replayers which have a really high load.
The performance is better between 60% to 100% depending of the object.

As more revision were loaded, the oversize problem occurred again:

org.apache.cassandra.db.MutationExceededMaxSizeException: Encountered an oversized mutation (43544814/33554432) for keyspace: swh. Top keys are: revision.8272297416ace08c29e62a691d1ae8c1146389c4

softwareheritage=> select id, date, length(message) as message, length(metadata::text) as metadata from revision where id='\x8272297416ace08c29e62a691d1ae8c1146389c4';
                     id                     |          date          | message  | metadata 
--------------------------------------------+------------------------+----------+----------
 \x8272297416ace08c29e62a691d1ae8c1146389c4 | 2019-06-20 11:26:14+00 | 43544445 |         
(1 row)

The commit log compression is disable by default and the size of the mutation is matching more or less to the size of the revision. It's possible to estimate the max commitlog size with the request in the previous comment.
The next runs will be configured with commitlog_segment_size_in_mb: 256 (> 2x106088605)

the good news is the error is not completely ignored as I initially though:

Jul  9 07:36:35 parasilo-20 swh[17341]: WARNING:cassandra.cluster:Host 172.16.97.6:9042 error: Server error.
Jul  9 07:38:52 parasilo-20 swh[17334]: Traceback (most recent call last):
Jul  9 07:38:52 parasilo-20 swh[17334]:   File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 412, in call
Jul  9 07:38:52 parasilo-20 swh[17334]:     result = fn(*args, **kwargs)
Jul  9 07:38:52 parasilo-20 swh[17334]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/cql.py", line 272, in _execute_with_retries
Jul  9 07:38:52 parasilo-20 swh[17334]:     return self._session.execute(statement, args, timeout=1000.0)
Jul  9 07:38:52 parasilo-20 swh[17334]:   File "cassandra/cluster.py", line 2345, in cassandra.cluster.Session.execute
Jul  9 07:38:52 parasilo-20 swh[17334]:   File "cassandra/cluster.py", line 4304, in cassandra.cluster.ResponseFuture.result
Jul  9 07:38:52 parasilo-20 swh[17334]: cassandra.WriteFailure: Error from server: code=1500 [Replica(s) failed to execute write] message="Operation failed - received 0 responses and 2 failures: UNKNOWN from /172.16.97.7:7000, UNKNOWN from /172.16.97.4:7000" info={'consistency': 'LOCAL_QUORUM', 'required_responses': 2, 'received_responses': 0, 'failures': 2}
Jul  9 07:38:52 parasilo-20 swh[17334]: The above exception was the direct cause of the following exception:
Jul  9 07:38:52 parasilo-20 swh[17334]: Traceback (most recent call last):

[1] https://hedgedoc.softwareheritage.org/xBOZFpZ-Qr2855cTzbZEbQ?view#run3---consistency-level-local_quorum---8-nodes

vsellier mentioned this in rDSNIP7801b4dac744: grid5000/cassandra: remove a pre-configured debian repo with an expired key.Aug 2 2021, 11:18 PM

vsellier mentioned this in rDSNIP1e056818f103: grid5000/cassandra: extend the prometheus data retention.

vsellier mentioned this in rDSNIPc56684acf284: grid5000/cassandra: quick and dirty script to manage best effort nodes.Aug 3 2021, 2:29 PM

vsellier mentioned this in rDSNIPa96df2f91e47: grid5000/cassandra: add a script to cleanup the zfs pools.

Some news about the tests running since the beginning of the week:

The data retention of the federated prometheus had the default value so all the data has expired after 15 days. A new reference run was performed to be able to compare with the default scenario
The first try failed because it was the first time there were adaption on the zfs configuration and it was not correctly deploy via the ansible scripts. It was solved by completely cleaning up the zfs configuration and relaunching the deployment. Unfortunately, it needs to be manually launched before launching a test with zfs changes.
With the usage of the best effort jobs, it's possible to perform test during the days without exceeding the quota

Since monday, these are the tests executed:

a run with the run1 scenario (reference) [1]
a run with the run4scenario (commitlog on the data partition) [2]
a run with the run5 scenario (commit log on a dedicated ssd, 2 HDD for the data) [3]
a run with the run6 scenario (commit log on a dedicated HDD, 2 HDD for the data). still in progress but the tendency is already visible

as reference, this is the i/o behavior for the run1 test:

Without surprise, having the commitlogs on the data partition (run4) is not performing well at all. The difference is between 50 to 75% depending of the objects and there are a lot of connection and replication timeouts. As expected, the ios on the data partition iare higher:

The difference for the run5 is not so important with only a difference of 10% max of performance which indicate the number of disk for the data is not so important. This is the i/os during the run:

The run6 in progress indicate that have the commitlog on a classical HDD has a quite high impact, there are a lot of client/replication timeout event at the beginning of the test. The detailed performance will be added at the end of the run (around 18:45 today)

To follow:

Tonight: run7 (same as run1 but with more memory allocated to cassandra)
tomorrow during the day: Test scylladb
~~starting from tomorrow night~~*: reset everything, with 5 cassandra nodes, perform a long run to import the complete archive

\* [edit] the nodes are not available during this weekend so the load will start on monday

During the run, test:

the multidatacenter replication if we want to replicate the main cluster on azure for example
the backup procedures (cassandra snapshot / zfs ?)

[1] https://hedgedoc.softwareheritage.org/xBOZFpZ-Qr2855cTzbZEbQ?view#run1---new-run-to-populate-grafana-dashboards
[2] https://hedgedoc.softwareheritage.org/xBOZFpZ-Qr2855cTzbZEbQ?view#run4---commit-log-on-the-data-partition-4-hdd
[3] https://hedgedoc.softwareheritage.org/xBOZFpZ-Qr2855cTzbZEbQ?view#run5---commit-on-dedicated-ssd---2-hdd-for-data

vsellier updated the task description. (Show Details)Aug 5 2021, 12:17 PM

vsellier updated the task description. (Show Details)

run6 results - commitlog on a HDD

https://hedgedoc.softwareheritage.org/xBOZFpZ-Qr2855cTzbZEbQ?both#run6---commit-log-on-dedicated-hdd--2-hdd-for-data

The result of the test is really wrong with a lost of performance around 70% for almost all the the object types.
It seems the cause is a i/o saturation in the commitlog drive:

Having a fast drive for the commitlog looks quite importan

run7 results - cassandra heap from 16g to 32g

https://hedgedoc.softwareheritage.org/xBOZFpZ-Qr2855cTzbZEbQ?both#run7---run1-configuration-cassandra-heap-32g-instead-of-16g

The head increase seems to not be an improvement as the results are quite similar as the default configuration, even a little slower.
The load on the server looks equivalent, there is a little more gc time.
It seems there is a little read load on the data partition at the end of the bench but nothing significant.

scylladb test

installation

scylla_setup command is failing due to an error checking if the server is an ac2 server. The --no-ec2-check option has no effect. Patching the '/usr/lib/scylla/scylla_util.py' allowed to test the command

vsellier@parasilo-3:~$ diff -U3 /tmp/scylla_util.py /usr/lib/scylla/scylla_util.py
--- /tmp/scylla_util.py 2021-08-06 11:46:12.678080457 +0200
+++ /usr/lib/scylla/scylla_util.py      2021-08-06 11:46:25.550038629 +0200
@@ -550,7 +550,7 @@


 def is_ec2():
-    return aws_instance.is_aws_instance()
+    return False


 def is_gce():
vsellier@parasilo-3:~$ diff -U3 /tmp/scylla_util.py /usr/lib/scylla/scylla_util.py
--- /tmp/scylla_util.py 2021-08-06 11:46:12.678080457 +0200
+++ /usr/lib/scylla/scylla_util.py      2021-08-06 11:46:25.550038629 +0200
@@ -550,7 +550,7 @@


 def is_ec2():
-    return aws_instance.is_aws_instance()
+    return False


 def is_gce():

It seems scylla prefers to use XFS and md raids.

This is the output of the command:

root@parasilo-3:/etc/scylla# scylla_setup --no-ec2-check --disks /dev/sdb,/dev/sdc,/dev/sdd,/dev/sde --nic eno1
WARN  2021-08-06 11:47:11,036 [shard 0] iotune - Available space on filesystem at /var/tmp/mnt: 124 MB: is less than recommended: 10 GB
INFO  2021-08-06 11:47:11,036 [shard 0] iotune - /var/tmp/mnt passed sanity checks          
This is a supported kernel version.                                                                           
 6 Aug 11:47:22 ntpdate[10835]: adjust time server 91.189.89.198 offset -0.000629 sec       
/dev/md0 will be used to setup a RAID                                                                         
Creating RAID0 for scylla using 4 disk(s): /dev/sdb,/dev/sdc,/dev/sdd,/dev/sde              
mdadm: partition table exists on /dev/sdb                                                                     
mdadm: partition table exists on /dev/sdb but will be lost or
       meaningless after creating array
mdadm: partition table exists on /dev/sdc                                                         
mdadm: partition table exists on /dev/sdc but will be lost or                                                                                                                                                                                                                                                                                                                                                                          
       meaningless after creating array                                                                                                                                                                                                                                                                                                                                                                                                
mdadm: partition table exists on /dev/sdd                                                                                                                                                                                                                                                                                                                                                                                              
mdadm: partition table exists on /dev/sdd but will be lost or                                                                                                                                                                                                                                                                                                                                                                          
       meaningless after creating array                                                                                 
mdadm: partition table exists on /dev/sde    
mdadm: partition table exists on /dev/sde but will be lost or
       meaningless after creating array      
mdadm: creation continuing despite oddities due to --run
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.                                                                 
meta-data=/dev/md0               isize=512    agcount=32, agsize=18310400 blks                                                                                                                                                                                                                                                                                                                                                         
         =                       sectsz=512   attr=2, projid32bit=1                                                    
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0    
data     =                       bsize=4096   blocks=585928704, imaxpct=5
         =                       sunit=256    swidth=1024 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=286208, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0                                                                         
The unit files have no installation config (WantedBy=, RequiredBy=, Also=,        
Alias= settings in the [Install] section, and DefaultInstance= for template
units). This means they are not meant to be enabled using systemctl.                                                  
                        
Possible reasons for having this kind of units are:
• A unit may be statically enabled by being symlinked from another unit's
  .wants/ or .requires/ directory.
• A unit's purpose may be to act as a helper for some other unit which has
  a requirement dependency on it.
• A unit may be started when needed via activation (socket, path, timer,
  D-Bus, udev, scripted systemctl call, ...).
• In case of template units, the unit is meant to be enabled with some
  instance name specified.
Created symlink /etc/systemd/system/multi-user.target.wants/var-lib-scylla.mount → /etc/systemd/system/var-lib-scylla.mount.
update-initramfs: Generating /boot/initrd.img-4.19.0-17-amd64
W: intel-microcode: initramfs mode not supported, using early initramfs mode
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libio-pty-perl libipc-run-perl moreutils
Use 'apt autoremove' to remove them.
The following NEW packages will be installed:
  systemd-coredump
0 upgraded, 1 newly installed, 0 to remove and 7 not upgraded.
Need to get 134 kB of archives.
After this operation, 254 kB of additional disk space will be used.
Get:1 http://security.debian.org buster/updates/main amd64 systemd-coredump amd64 241-7~deb10u8 [134 kB]
Fetched 134 kB in 0s (773 kB/s)
Selecting previously unselected package systemd-coredump.
(Reading database ... 202944 files and directories currently installed.)
Preparing to unpack .../systemd-coredump_241-7~deb10u8_amd64.deb ...
Unpacking systemd-coredump (241-7~deb10u8) ...
Setting up systemd-coredump (241-7~deb10u8) ...
Processing triggers for man-db (2.8.5-2) ...
Created symlink /etc/systemd/system/multi-user.target.wants/var-lib-systemd-coredump.mount → /etc/systemd/system/var-lib-systemd-coredump.mount.
kernel.core_pattern = |/lib/systemd/systemd-coredump %P %u %g %s %t 9223372036854775808 %h %e
Generating coredump to test systemd-coredump...

PID: 16495 (bash)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: Fri 2021-08-06 11:48:05 CEST (3s ago)
  Command Line: /bin/bash /tmp/tmpt5mvpqxg
    Executable: /usr/bin/bash
 Control Group: /user.slice/user-0.slice/session-86.scope
          Unit: session-86.scope
         Slice: user-0.slice
       Session: 86
     Owner UID: 0 (root)
       Boot ID: f945ab90a168488bb1bc364c3da4a859
    Machine ID: f351035a17a949db97e56945d6f68b79
      Hostname: parasilo-3.rennes.grid5000.fr
       Storage: /var/lib/systemd/coredump/core.bash.0.f945ab90a168488bb1bc364c3da4a859.16495.1628243285000000
       Message: Process 16495 (bash) of user 0 dumped core.

                Stack trace of thread 16495:
                #0  0x00007f40fd5e5a97 kill (libc.so.6)
                #1  0x0000560df79ee3d8 kill_pid (bash)
                #2  0x0000560df7a2e1f7 kill_builtin (bash)
                #3  0x0000560df79d8b63 n/a (bash)
                #4  0x0000560df79db34e n/a (bash)
                #5  0x0000560df79dc75f execute_command_internal (bash)
                #6  0x0000560df79dddf2 execute_command (bash)
                #7  0x0000560df79c5833 reader_loop (bash)
                #8  0x0000560df79c4104 main (bash)
                #9  0x00007f40fd5d209b __libc_start_main (libc.so.6)
                #10 0x0000560df79c465a _start (bash)


systemd-coredump is working finely.
tuning /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:0/0:0:0:0/block/sda/sda3
tuning /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:0/0:0:0:0/block/sda
tuning: /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:0/0:0:0:0/block/sda/queue/nomerges 2
tuning /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:0/0:0:0:0/block/sda/sda3
tuning /sys/devices/virtual/block/md0
tuning: /sys/devices/virtual/block/md0/queue/nomerges 2
tuning /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:3/0:0:3:0/block/sdd
tuning: /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:3/0:0:3:0/block/sdd/queue/nomerges 2
tuning /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:1/0:0:1:0/block/sdb
tuning: /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:1/0:0:1:0/block/sdb/queue/nomerges 2
tuning /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:4/0:0:4:0/block/sde
tuning: /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:4/0:0:4:0/block/sde/queue/nomerges 2
tuning /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:2/0:0:2:0/block/sdc
tuning: /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:2/0:0:2:0/block/sdc/queue/nomerges 2
tuning /sys/devices/virtual/block/md0
tuning /sys/devices/virtual/block/md0
INFO  2021-08-06 11:48:15,064 [shard 0] iotune - /var/lib/scylla/saved_caches passed sanity checks
WARN  2021-08-06 11:48:15,065 [shard 0] iotune - Scheduler for /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:3/0:0:3:0/block/sdd/queue/scheduler set to mq-deadline. It is recommend to set it to noop before evaluation so as not to skew the results.
WARN  2021-08-06 11:48:15,065 [shard 0] iotune - Scheduler for /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:1/0:0:1:0/block/sdb/queue/scheduler set to mq-deadline. It is recommend to set it to noop before evaluation so as not to skew the results.
WARN  2021-08-06 11:48:15,065 [shard 0] iotune - Scheduler for /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:4/0:0:4:0/block/sde/queue/scheduler set to mq-deadline. It is recommend to set it to noop before evaluation so as not to skew the results.
WARN  2021-08-06 11:48:15,065 [shard 0] iotune - Scheduler for /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:2/0:0:2:0/block/sdc/queue/scheduler set to mq-deadline. It is recommend to set it to noop before evaluation so as not to skew the results.
INFO  2021-08-06 11:48:15,066 [shard 0] iotune - Disk parameters: max_iodepth=1024 disks_per_array=4 minimum_io_size=512
Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: 633 MB/s
Measuring sequential read bandwidth: 722 MB/s
Measuring random write IOPS: 1835 IOPS
Measuring random read IOPS: 2184 IOPS
INFO  2021-08-06 11:50:24,836 [shard 0] iotune - /srv/cassandra/commitlogs passed sanity checks
WARN  2021-08-06 11:50:24,836 [shard 0] iotune - Scheduler for /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/host0/target0:0:0/0:0:0:0/block/sda/queue/scheduler set to mq-deadline. It is recommend to set it to noop before evaluation so as not to skew the results.
INFO  2021-08-06 11:50:24,836 [shard 0] iotune - Disk parameters: max_iodepth=256 disks_per_array=1 minimum_io_size=512
Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: 181 MB/s
Measuring sequential read bandwidth: 189 MB/s
Measuring random write IOPS: 690 IOPS
Measuring random read IOPS: 926 IOPS
Writing result to /etc/scylla.d/io_properties.yaml
Writing result to /etc/scylla.d/io.conf
Created symlink /etc/systemd/system/multi-user.target.wants/scylla-node-exporter.service → /lib/systemd/system/scylla-node-exporter.service.
cpufrequtils.service is not a native service, redirecting to systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable cpufrequtils
Created symlink /etc/systemd/system/timers.target.wants/scylla-fstrim.timer → /lib/systemd/system/scylla-fstrim.timer.
ScyllaDB setup finished.

After having some hard time to configure and correctly start the scylla servers (different binding, configuration adaptation), the schema was correctly created (I needed to add SWH_USE_SCYLLADB=1 on the initialisation script).
Compared to cassandra, it seems the nodetool command didn't return correctly the data repartition on the cluster because the system keyspaces hasn't the same replication factor as the swh one

vsellier@parasilo-2:~$  nodetool status
Using /etc/scylla/scylla.yaml as the config file
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens       Owns    Host ID                               Rack
UN  172.16.97.2  2.36 MB    256          ?       866bbcc4-d496-4ebb-ab3b-12ef4942beaa  rack1
UN  172.16.97.3  3.37 MB    256          ?       21fdd0a9-15cd-473f-814c-c8ac24870aca  rack1
UN  172.16.97.4  3.48 MB    256          ?       1ed61715-01a0-4c15-a4bc-f9972f575437  rack1

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

After starting the replayers, there are several errors on the logs:

-- The unit replayer-revision@13.service has entered the 'failed' state with result 'exit-code'.
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]: Traceback (most recent call last):
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/bin/swh", line 11, in <module>
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     load_entry_point('swh.core==0.14.4', 'console_scripts', 'swh')()
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return swh(auto_envvar_prefix="SWH")
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return self.main(*args, **kwargs)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     rv = self.invoke(ctx)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return _process_result(sub_ctx.command.invoke(sub_ctx))
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return _process_result(sub_ctx.command.invoke(sub_ctx))
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return ctx.invoke(self.callback, **ctx.params)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return callback(*args, **kwargs)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return f(get_current_context(), *args, **kwargs)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/storage/cli.py", line 194, in replay
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     client.process(worker_fn)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/journal/client.py", line 265, in process
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     batch_processed, at_eof = self.handle_messages(messages, worker_fn)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/journal/client.py", line 292, in handle_messages
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     worker_fn(dict(objects))
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/storage/replay.py", line 62, in process_replay_objects
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     _insert_objects(object_type, objects, storage)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/storage/replay.py", line 163, in _insert_objects
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     for o in objects
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/storage.py", line 569, in revision_add
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     missing = self.revision_missing([rev.id for rev in to_add])
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/storage.py", line 593, in revision_missing
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return self._cql_runner.revision_missing(revisions)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/cql.py", line 145, in newf
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     self, *args, **kwargs, statement=self._prepared_statements[f.__name__]
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/cql.py", line 542, in revision_missing
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return self._missing(statement, ids)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/cql.py", line 305, in _missing
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     rows = self._execute_with_retries(statement, [ids])
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 329, in wrapped_f
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return self.call(f, *args, **kw)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 409, in call
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     do = self.iter(retry_state=retry_state)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 356, in iter
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return fut.result()
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 425, in result
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return self.__get_result()
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     raise self._exception
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 412, in call
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     result = fn(*args, **kwargs)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/cql.py", line 272, in _execute_with_retries
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:     return self._session.execute(statement, args, timeout=1000.0)
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "cassandra/cluster.py", line 2345, in cassandra.cluster.Session.execute
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]:   File "cassandra/cluster.py", line 4304, in cassandra.cluster.ResponseFuture.result
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr swh[18418]: cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 172.16.97.2:9042 datacenter1>: <Error from server: code=0000 [Server error] message="partition key cartesian product size 250 is greater than maximum 100">, <Host: 172.16.97.3:9042 datacenter1>: <Error from server: code=0000 [Server error] message="partition key cartesian product size 250 is greater than maximum 100">, <Host: 172.16.97.4:9042 datacenter1>: <Error from server: code=0000 [Server error] message="partition key cartesian product size 250 is greater than maximum 100">})                                                                                                                                                                                                            
Aug 06 12:56:37 paravance-40.rennes.grid5000.fr systemd[1]: replayer-revision@16.service: Main process exited, code=exited, status=1/FAILURE

there is also a lot of error on the scylla logs relative to read timeout (with no activities on the database except the monitoring):

Aug 06 14:52:10 parasilo-4.rennes.grid5000.fr scylla[16488]:  [shard 5] storage_proxy - Exception when communicating with 172.16.97.4, to read from swh.object_count: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
Aug 06 14:52:10 parasilo-4.rennes.grid5000.fr scylla[16488]:  [shard 6] storage_proxy - Exception when communicating with 172.16.97.4, to read from swh.object_count: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)

vlorentz mentioned this in D6067: cassandra: Fix crash when using _missing() functions with more than 100 ids with ScyllaDB..Aug 6 2021, 3:00 PM

vlorentz added a revision: D6067: cassandra: Fix crash when using _missing() functions with more than 100 ids with ScyllaDB..Aug 6 2021, 3:00 PM

Thanks @vlorentz for D6067, I will test the fix when the cluster will be more stable

The db server prometheus configuration needs some adaptation as scylla is coming with its own prometheus node exporter (and is removing the default packages :()

root@parasilo-2:/opt# apt install scylla-node-exporter
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libio-pty-perl libipc-run-perl moreutils
Use 'apt autoremove' to remove them.
The following packages will be REMOVED:
  prometheus-node-exporter
The following NEW packages will be installed:
  scylla-node-exporter
0 upgraded, 1 newly installed, 1 to remove and 7 not upgraded.
Need to get 0 B/4,076 kB of archives.
After this operation, 3,243 kB of additional disk space will be used.

The previous behavior can be restored by overriding the /etc/default/scylla-node-exporter file with:

SCYLLA_NODE_EXPORTER_ARGS="--collector.interrupts --collector.textfile.directory=/var/lib/prometheus/node-exporter/"

(add the collector.textfile.directory option)

as scylla is coming with its own prometheus node exporter (and is removing the default packages :()

yeah, scylla's debian package is a bit disruptive... it also removes Cassandra even though their file names don't clash

vsellier mentioned this in rDSNIP52b4eea9e3ad: grid5000/cassandra: add a dashboard to monitor the concurrent client connections.Aug 6 2021, 4:35 PM

vsellier mentioned this in rDSNIP78b96c8f1a2b: grid5000/cassandra: add dashboards dedicated to scylla.

vsellier mentioned this in rDSNIP326a9e39812a: grid5000/cassandra: save the temporary scylla configuration.Aug 6 2021, 5:18 PM

It seems D6067 solves the issue with the partition key cartesian product size. @vlorentz Do you think a run with cassandra is necessary to evaluate a potential performance impact?

With this configuration (a mix between the previous cassandra one + the automatically tuned, it seems the performaces are not so great.
With the 4 replayer nodes, there are a lot of timeouts (read and write) + semaphore timeouts

Aug 06 17:25:54 parasilo-3.rennes.grid5000.fr scylla[8180]:  [shard 19] storage_proxy - Exception when communicating with 172.16.97.3, to read from swh.content: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)                                                                                                                                                                                       
Aug 06 17:25:54 parasilo-3.rennes.grid5000.fr scylla[8180]:  [shard 3] storage_proxy - Exception when communicating with 172.16.97.3, to read from swh.origin: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)                                                                                                                                                                                         
Aug 06 17:25:54 parasilo-3.rennes.grid5000.fr scylla[8180]:  [shard 18] storage_proxy - Exception when communicating with 172.16.97.3, to read from swh.content: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)                                                                                                                                                                                       
Aug 06 17:25:54 parasilo-3.rennes.grid5000.fr scylla[8180]:  [shard 0] storage_proxy - Exception when communicating with 172.16.97.3, to read from swh.content: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)                                                                                                                                                                                        
Aug 06 17:25:54 parasilo-3.rennes.grid5000.fr scylla[8180]:  [shard 16] storage_proxy - Exception when communicating with 172.16.97.3, to read from swh.content_by_sha1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)                                                                                                                                                                               
Aug 06 17:25:54 parasilo-3.rennes.grid5000.fr scylla[8180]:  [shard 6] storage_proxy - Exception when communicating with 172.16.97.3, to read from swh.object_count: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)                                                                                                                                                                                   
Aug 06 17:25:54 parasilo-3.rennes.grid5000.fr scylla[8180]:  [shard 3] storage_proxy - Exception when communicating with 172.16.97.3, to read from swh.object_count: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)                                                                                                                                                                                   
Aug 06 17:25:54 parasilo-3.rennes.grid5000.fr scylla[8180]:  [shard 6] storage_proxy - Exception when communicating with 172.16.97.3, to read from swh.object_count: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)                                                                                                                                                                                   
Aug 06 17:25:5

It's almost not possible to read the object_count table:

cqlsh:swh> select dateof(now()) from system.local; select * from object_count;select dateof(now()) from system.local;

 system.dateof(system.now())
---------------------------------
 2021-08-06 15:30:55.280000+0000

(1 rows)

 partition_key | object_type         | count
---------------+---------------------+----------
             0 |             content |  7701907
             0 |           directory |   349048
             0 |     directory_entry |  8609087
             0 |              origin | 11774538
             0 |        origin_visit |  8882767
             0 | origin_visit_status | 19234919
             0 |             release |     2074
             0 |            revision |   266655
             0 |     revision_parent |   273950
             0 |            snapshot |    25820
             0 |     snapshot_branch |   696754

(11 rows)

 system.dateof(system.now())
---------------------------------
 2021-08-06 15:30:55.302000+0000

(1 rows)
cqlsh:swh> select dateof(now()) from system.local; select * from object_count;select dateof(now()) from system.local;

 system.dateof(system.now())
---------------------------------
 2021-08-06 15:30:58.863000+0000

(1 rows)
ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out for swh.object_count - received only 1 responses from 2 CL=LOCAL_QUORUM." info={'received_responses': 1, 'required_responses': 2, 'consistency': 'LOCAL_QUORUM'}

 system.dateof(system.now())
---------------------------------
 2021-08-06 15:31:08.874000+0000

(1 rows)
cqlsh:swh> select dateof(now()) from system.local; select * from object_count;select dateof(now()) from system.local;

 system.dateof(system.now())
---------------------------------
 2021-08-06 15:31:09.911000+0000

(1 rows)
ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out for swh.object_count - received only 1 responses from 2 CL=LOCAL_QUORUM." info={'received_responses': 1, 'required_responses': 2, 'consistency': 'LOCAL_QUORUM'}

 system.dateof(system.now())
---------------------------------
 2021-08-06 15:31:19.925000+0000

(1 rows)

I will stop the test here as I initially planned to work on it only this morning...

Do you think a run with cassandra is necessary to evaluate a potential performance impact?

Not necessary, but I'm curious. It may even improve performance.

If you don't mind, I'd also like to know how P1118 performs. It groups the IDs so that all IDs in a given request are handled by the same server node. This way, nodes don't have to coordinate to return the query. This is at the cost of most CPU work on the client, though.

vlorentz added a commit: rDSTO9f00eb9dba1d: cassandra: Fix crash when using _missing() functions with more than 100 ids….Aug 9 2021, 12:30 PM

The complete import is running almost continuously with 5 cassandra nodes since monday.

vsellier@parasilo-2:~$ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load        Tokens  Owns (effective)  Host ID                               Rack 
UN  172.16.97.3  325.79 GiB  256     60.1%             a3ae5fa2-c063-4890-87f1-bddfcf293bde  rack1
UN  172.16.97.6  315 GiB     256     60.0%             bfe360f1-8fd2-4f4b-a070-8f267eda1e12  rack1
UN  172.16.97.5  306.99 GiB  256     59.9%             478c36f8-5220-4db7-b5c2-f3876c0c264a  rack1
UN  172.16.97.4  296.96 GiB  256     59.9%             b3105348-66b0-4f82-a5bf-31ef28097a41  rack1
UN  172.16.97.2  369.58 GiB  256     60.1%             de866efd-064c-4e27-965c-f5112393dc8f  rack1

The data is not equally distributed accross the nodes but the differences are no so big.
I try to launch some repairs but not of them had the time to finish before the end of the g5k reservations. They are very slow when the replayer are running.

With the default incremental repair (nodetool repair -pr), The job was stuck in a preparation step for more than 10h and it generates a lot of i/o on the servers, for example a test this morning with just the repair just running:

vsellier@parasilo-2:~$ nodetool repair_admin list
id                                   | state     | last activity | coordinator       | participants                                                | participants_wp
42013da0-fa74-11eb-9016-e3c0e1396b24 | PREPARING | 1767 (s)      | /172.16.97.2:7000 | 172.16.97.5,172.16.97.6,172.16.97.2,172.16.97.3,172.16.97.4 | 172.16.97.5:7000,172.16.97.6:7000,172.16.97.2:7000,172.16.97.3:7000,172.16.97.4:7000

INFO  [CompactionExecutor:2] 2021-08-11 09:18:04,766 CompactionManager.java:843 - [repair #42013da0-fa74-11eb-9016-e3c0e1396b24] SSTable BigTableReader(path='/srv/cassandra/data/swh/content_by_blake2s256-0b496b40f94611ebb0d51fb7e01de2ce/nb-773-big-Data.db') ([-9205927237330816222,9201009158028608965]) will be anticompacted on range (8811349095641234110,8825281880574143858]                                                                                                     
INFO  [CompactionExecutor:2] 2021-08-11 09:18:04,766 CompactionManager.java:843 - [repair #42013da0-fa74-11eb-9016-e3c0e1396b24] SSTable BigTableReader(path='/srv/cassandra/data/swh/content_by_blake2s256-0b496b40f94611ebb0d51fb7e01de2ce/nb-773-big-Data.db') ([-9205927237330816222,9201009158028608965]) will be anticompacted on range (8945057472051614363,8964557645526076280]                                                                                                     
INFO  [CompactionExecutor:2] 2021-08-11 09:18:04,766 CompactionManager.java:843 - [repair #42013da0-fa74-11eb-9016-e3c0e1396b24] SSTable BigTableReader(path='/srv/cassandra/data/swh/content_by_blake2s256-0b496b40f94611ebb0d51fb7e01de2ce/nb-773-big-Data.db') ([-9205927237330816222,9201009158028608965]) will be anticompacted on range (9195484876029867743,9198251923896477770]                                                                                                     
INFO  [CompactionExecutor:2] 2021-08-11 09:18:04,766 CompactionManager.java:1491 - Performing anticompaction on 12 sstables for 42013da0-fa74-11eb-9016-e3c0e1396b24
INFO  [CompactionExecutor:2] 2021-08-11 09:18:04,767 CompactionManager.java:1541 - Anticompacting [BigTableReader(path='/srv/cassandra/data/swh/content_by_blake2s256-0b496b40f94611ebb0d51fb7e01de2ce/nb-754-big-Data.db'), BigTableReader(path='/srv/cassandra/data/swh/content_by_blake2s256-0b496b40f94611ebb0d51fb7e01de2ce/nb-868-big-Data.db')] in swh.content_by_blake2s256 for 42013da0-fa74-11eb-9016-e3c0e1396b24

The full mode looks also very long but it starts to check some partition more rapidly.
Event if the cluster looked ok, the full repair fix a lot of inconsistencies. Unfortunately, the logs were lost after the end of the reservation. I will relaunch a test later and paste some of them.

Some stats about the import:

object	imported	archive count (from counters)	percent
directory	24733151	9234863486	0.2%
content	337504100	11052015459	3%
directory_entry	564200718
origin	162382023	163516337	100%
origin_visit	397874175	1096514114	36%
origin_visit_status	1001657300	1584638657	63%
release	18931689	18742909	100%
revision	524518750	2337172610	22%
revision_parent	537635368
skipped_content	155370	153387	100%
snapshot	21001128	165707388	12%
snapshot_branch	416055156

vsellier mentioned this in rDSNIPa98c1f651ecb: grid5000/cassandra: improbe cassandra monitoring.Aug 13 2021, 12:39 PM

Current import status before the run of this week-end:

object	imported	archive count (from counters)	percent
directory	24 733 151	9 234 863 486	0.2%
content	337 504 100	11 052 015 459	3%
directory_entry	564 200 718
origin	162 396 161	163 516 337	~100%
origin_visit	831 458 709	1 097 664 123	76%
origin_visit_status	2 142 772 730	1 586 012 621	~75%
release	18 940 576	18 759 533	~100%
revision	892 680 868	2 338 949 544	~40%
revision_parent	926 234 879
skipped_content	155 401	153 583	~100%
snapshot	39 619 975	165 838 402	~25%
snapshot_branch	795 347 015

parasilo-2:~$  nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load        Tokens  Owns (effective)  Host ID                               Rack 
UN  172.16.97.3  486.3 GiB   256     60.1%             a3ae5fa2-c063-4890-87f1-bddfcf293bde  rack1
UN  172.16.97.6  488.86 GiB  256     60.0%             bfe360f1-8fd2-4f4b-a070-8f267eda1e12  rack1
UN  172.16.97.5  481.59 GiB  256     59.9%             478c36f8-5220-4db7-b5c2-f3876c0c264a  rack1
UN  172.16.97.4  471.45 GiB  256     59.9%             b3105348-66b0-4f82-a5bf-31ef28097a41  rack1
UN  172.16.97.2  509.7 GiB   256     60.1%             de866efd-064c-4e27-965c-f5112393dc8f  rack1

The import is stille progressing. I focused the today replayers on origin_visit and origin_visit_status and they will be completely imported in a couple of hours.

I had several performance issues this week as I tried to found a way to perform repairs in the g5k reservation intervals (8h30 during the day, 13h during the night).
It was almost impossible tu run a complete repair.

It worked finally by running a repair table per table, node per node, for example:

time (seq 2 6 | xargs -n1 -i{} ssh parasilo-{} nodetool repair -pr -os swh content)

I was not yet executed on each table, only on origin, directory and directory_entry.

table	repair time per node on the first incremental repair
origin	~10mn
directory	~10mn
directory_entry	~50mn
content	~2h

Some stuff learned during the try and failed tests:

Incremental repairs are consuming a compaction thread per table to split the repaired/unrepaired data in different sst tables (in the preparation phase)
- It can have a big i/o impact when a lot of data was written between to incremental repair because a lot of data can be reorganized
- if the repair is not constrained on specific tables, all the compaction slots can be used by anticompaction processes and the normal compaction blocked [1]

vsellier@parasilo-2:~$ nodetool compactionstats
pending tasks: 15
- swh.content: 1
- swh.revision: 14

id                                   compaction type             keyspace table    completed   total       unit  progress
0e838d90-fc39-11eb-b947-11b81b293809 Compaction                  swh      revision 4592302191  17780154880 bytes 25.83%  
38c322a0-fc39-11eb-b947-11b81b293809 Compaction                  swh      revision 1583439141  1647624647  bytes 96.10%  
dbd14291-fc36-11eb-b947-11b81b293809 Anticompaction after repair swh      content  13336597420 30688260710 bytes 43.46%  
Active compaction remaining time :   0h01m18s

[1] a repair test on one server ran during a day and a night. It has blocked the compaction process. As the replayers were running at full speed, the number of pending compactions and sst tables grown up rapidely:

It has resulted on a higher i/o pressure on the node. It has started to be flaky (holes visible on the screen captures).

vsellier mentioned this in rDSNIP41e8ee27b337: grid5000/cassandra: increase message size limit to allow revision replaying.Aug 16 2021, 12:27 PM

vsellier mentioned this in rDSNIP33178f46ac4b: grid5000/cassandra: adapt number of consummers.

vsellier added a revision: D6093: storage-cassandra: Remove the default src override.Aug 16 2021, 4:27 PM

vsellier added a commit: rDENV1b9307ce7751: storage-cassandra: Remove the default src override.Aug 16 2021, 4:36 PM

vsellier mentioned this in rDENV4155fc0087bc: cassandra: use the CASSANDRA_SEED env variable for database initialization.Aug 16 2021, 5:01 PM

vsellier mentioned this in rDSNIP35813e5a8fcc: grid5000/cassandra: adapt the number of replayers.Aug 17 2021, 5:27 PM

vsellier mentioned this in rDSNIPa31433b334ad: grid5000/cassandra: replay extid topic.

current status:

There is now more than 1To of data per nodes:

vsellier@parasilo-2:~$  nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load      Tokens  Owns (effective)  Host ID                               Rack 
UN  172.16.97.3  1.15 TiB  256     60.1%             a3ae5fa2-c063-4890-87f1-bddfcf293bde  rack1
UN  172.16.97.6  1.15 TiB  256     60.0%             bfe360f1-8fd2-4f4b-a070-8f267eda1e12  rack1
UN  172.16.97.5  1.14 TiB  256     59.9%             478c36f8-5220-4db7-b5c2-f3876c0c264a  rack1
UN  172.16.97.4  1.14 TiB  256     59.9%             b3105348-66b0-4f82-a5bf-31ef28097a41  rack1
UN  172.16.97.2  1.16 TiB  256     60.1%             de866efd-064c-4e27-965c-f5112393dc8f  rack1

Replaying status:

object_type	Replaying stats
content	1 171 294 907 eta 2 weeks
directory	222 769 166 eta several months
directory_entry	6 012 115 914 derivated from the directory topic
origin	Done
origin_visit	Done
origin_visit_status	Done
release	Done
revision	Done
revision_parent	Done
skipped_content	Done
snapshot	Done
snapshot_branch	Done
extid	just started eta 1day

I will check if increasing the number of directory replayer can reduce the delay to complete the backfill without generating to much load on the cassandra nodes.

vsellier mentioned this in rDSNIP19515afeb074: grid5000/cassadra: count best_effort jobs in waiting/launching state.Aug 18 2021, 10:30 AM

vsellier mentioned this in rDSNIPdb9574d46037: grid5000/cassadra: declare the best effort nodes only when they are fully….

vsellier mentioned this in rDSNIPf4c8abe97ccc: grid5000/cassandra: add a script to refresh the besteffort node list.

vsellier changed the status of subtask T3465: Test multidatacenter replication from Open to Work in Progress.Aug 19 2021, 7:19 PM

The replaying is currently stopped as the data disks are now almost full.
I will try to activate the compression on some big tables to see if it can help.
I will probably need to start on small tables to recover some space before being able to compress the biggest tables

interesting:

Depending on the data characteristics of the table, compressing its data can result in:
25-33% reduction in data size
25-35% performance improvement on reads
5-10% performance improvement on writes

from https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsWhenCompress.html
and https://cassandra.apache.org/doc/latest/cassandra/operating/compression.html

the lz4 compression was already activated by default. Changing the algo to zstd on the table snapshot was not really significant (initially with lz4: 7Go, zstd: 12Go, go back to lz4: 9Go :) )

Another point was there were a lot of snapshot on the system tables, automatically created due to the recommended option auto_snapshot.
Doing a snapshot cleanup restore some free space:

vsellier@parasilo-5:/srv/cassandra/data$ df -h .
Filesystem      Size  Used Avail Use% Mounted on
data/data       2.2T  2.1T   17G 100% /srv/cassandra/data

root@parasilo-5:/srv/cassandra/data/swh# nodetool listsnapshots
Snapshot Details:
Snapshot name                           Keyspace name Column family name True size  Size on disk
truncated-1629184464594-table_estimates system        table_estimates    1.5 MiB    1.5 MiB
truncated-1628756480802-size_estimates  system        size_estimates     759.14 KiB 759.19 KiB
truncated-1629702924175-size_estimates  system        size_estimates     767.4 KiB  767.45 KiB
truncated-1629308811973-table_estimates system        table_estimates    2.27 MiB   2.27 MiB
truncated-1630046110551-size_estimates  system        size_estimates     1.14 MiB   1.14 MiB
truncated-1629998125791-size_estimates  system        size_estimates     386.15 KiB 386.18 KiB
truncated-1629098347272-size_estimates  system        size_estimates     1.12 MiB   1.12 MiB
truncated-1629965105009-size_estimates  system        size_estimates     771.03 KiB 771.09 KiB
truncated-1628874829022-size_estimates  system        size_estimates     1.11 MiB   1.11 MiB
truncated-1629911731701-size_estimates  system        size_estimates     383.43 KiB 383.46 KiB
truncated-1628579736073-table_estimates system        table_estimates    2.98 MiB   2.98 MiB
truncated-1628788825963-size_estimates  system        size_estimates     759.09 KiB 759.14 KiB
truncated-1629134330765-size_estimates  system        size_estimates     1.12 MiB   1.12 MiB
truncated-1629482669863-size_estimates  system        size_estimates     382.75 KiB 382.78 KiB
truncated-1628702276608-size_estimates  system        size_estimates     1.11 MiB   1.11 MiB
truncated-1629134330885-table_estimates system        table_estimates    2.25 MiB   2.25 MiB
truncated-1628579735956-size_estimates  system        size_estimates     1.48 MiB   1.48 MiB
truncated-1629357583808-table_estimates system        table_estimates    3.78 MiB   3.78 MiB
truncated-1629393581755-table_estimates system        table_estimates    2.27 MiB   2.27 MiB
truncated-1628702276697-table_estimates system        table_estimates    2.24 MiB   2.24 MiB
truncated-1629270940154-table_estimates system        table_estimates    1.51 MiB   1.51 MiB
truncated-1628628843100-size_estimates  system        size_estimates     760.32 KiB 760.37 KiB
truncated-1628536372852-size_estimates  system        size_estimates     0 bytes    13 bytes
truncated-1628838933467-size_estimates  system        size_estimates     759.82 KiB 759.88 KiB
truncated-1629702924296-table_estimates system        table_estimates    1.51 MiB   1.51 MiB
truncated-1629789925961-size_estimates  system        size_estimates     383.42 KiB 383.45 KiB
truncated-1629876050761-size_estimates  system        size_estimates     1.5 MiB    1.5 MiB
truncated-1629184464480-size_estimates  system        size_estimates     762.66 KiB 762.71 KiB
truncated-1629220429867-table_estimates system        table_estimates    3 MiB      3 MiB
truncated-1629443995522-table_estimates system        table_estimates    1.51 MiB   1.51 MiB
truncated-1629830707077-table_estimates system        table_estimates    2.27 MiB   2.27 MiB
truncated-1628628843228-table_estimates system        table_estimates    1.5 MiB    1.5 MiB
truncated-1629998125864-table_estimates system        table_estimates    776.69 KiB 776.72 KiB
truncated-1628874829182-table_estimates system        table_estimates    2.25 MiB   2.25 MiB
truncated-1628788826094-table_estimates system        table_estimates    1.49 MiB   1.49 MiB
truncated-1629830706973-size_estimates  system        size_estimates     1.12 MiB   1.12 MiB
truncated-1630048546649-table_estimates system        table_estimates    779.51 KiB 779.54 KiB
truncated-1629876050882-table_estimates system        table_estimates    3.02 MiB   3.02 MiB
truncated-1629911731833-table_estimates system        table_estimates    773.41 KiB 773.44 KiB
truncated-1629479921485-table_estimates system        table_estimates    2.92 MiB   2.92 MiB
truncated-1628624457971-size_estimates  system        size_estimates     1.11 MiB   1.11 MiB
truncated-1629357583696-size_estimates  system        size_estimates     1.87 MiB   1.87 MiB
truncated-1629393581576-size_estimates  system        size_estimates     1.13 MiB   1.13 MiB
truncated-1629482670003-table_estimates system        table_estimates    772.47 KiB 772.5 KiB
truncated-1630046110701-table_estimates system        table_estimates    2.28 MiB   2.28 MiB
truncated-1629965105119-table_estimates system        table_estimates    1.52 MiB   1.52 MiB
truncated-1628615547571-table_estimates system        table_estimates    2.24 MiB   2.24 MiB
truncated-1629479921380-size_estimates  system        size_estimates     1.45 MiB   1.45 MiB
truncated-1628665981321-table_estimates system        table_estimates    2.99 MiB   2.99 MiB
truncated-1629744201597-size_estimates  system        size_estimates     383.71 KiB 383.74 KiB
truncated-1629270940011-size_estimates  system        size_estimates     766.25 KiB 766.3 KiB
truncated-1628536373203-table_estimates system        table_estimates    0 bytes    13 bytes
truncated-1629220429713-size_estimates  system        size_estimates     1.49 MiB   1.49 MiB
truncated-1629789926069-table_estimates system        table_estimates    773.25 KiB 773.28 KiB
truncated-1628756480997-table_estimates system        table_estimates    1.49 MiB   1.49 MiB
truncated-1628624458089-table_estimates system        table_estimates    2.24 MiB   2.24 MiB
truncated-1628838933547-table_estimates system        table_estimates    1.5 MiB    1.5 MiB
truncated-1629098347382-table_estimates system        table_estimates    2.25 MiB   2.25 MiB
truncated-1628615547437-size_estimates  system        size_estimates     1.11 MiB   1.11 MiB
truncated-1628665981168-size_estimates  system        size_estimates     1.48 MiB   1.48 MiB
truncated-1629443995396-size_estimates  system        size_estimates     767.97 KiB 768.02 KiB
truncated-1629308811843-size_estimates  system        size_estimates     1.12 MiB   1.12 MiB
truncated-1630048546485-size_estimates  system        size_estimates     387.88 KiB 387.91 KiB
truncated-1629744201736-table_estimates system        table_estimates    773.97 KiB 774 KiB

Total TrueDiskSpaceUsed: 0 bytes

root@parasilo-5:/srv/cassandra/data/swh# nodetool clearsnapshot --all
Requested clearing snapshot(s) for [all keyspaces] with [all snapshots]

vsellier@parasilo-5:/srv/cassandra/data$ df -h .                                                                                                            
Filesystem      Size  Used Avail Use% Mounted on                                                                                                         
data/data       2.2T  1.4T  729G  67% /srv/cassandra/data

let's go to restart the ingestions and the performance tests

vsellier mentioned this in rDSNIP79906e54998c: grid5000/cassandra: fix besteffort nodes deployment.Aug 27 2021, 2:26 PM

vsellier created subtask T3517: [cassandra] decorate the method calls to have statsd metrics .Aug 27 2021, 4:48 PM

vsellier mentioned this in rDSNIP4e34b320ab69: grid5000/cassandra: add the reservation date on the environment config file.Aug 30 2021, 12:50 PM

vsellier closed subtask T3517: [cassandra] decorate the method calls to have statsd metrics as Resolved.Aug 30 2021, 6:11 PM

vsellier mentioned this in rDSNIP30b06ccd0294: grid5000/cassandra: fix statsd configuration of gunicorn services.Sep 1 2021, 10:33 AM

With the new concurrent replay of the directory, the disk usage grow up rapidly:

A new node was added on the cluster to reduce the disk usage on the old nodes.

The disk space on the new node:

The initial rebuilt was interrupted by the end of the reservation.
An attempt to use a repair in now in progress, but still not finished has it's regularly interrupted by the daily shutdown.
I hope the reservation of this week-end will be enough to stabilize the cluster

the second cluster with 15 nodes with 1to ssd disk is still ok but will probably need an extension soon also:

vsellier closed subtask T3464: Prepare a quote for the cassandra servers as Resolved.Sep 13 2021, 1:24 PM

vsellier closed subtask T3465: Test multidatacenter replication as Resolved.Sep 13 2021, 1:48 PM

vsellier created subtask T3573: [cassandra] directory and content read benchmarks.Sep 15 2021, 11:11 AM

vsellier updated the task description. (Show Details)Sep 15 2021, 5:42 PM

vsellier created subtask T3577: Parallel loaders performances .Sep 15 2021, 5:49 PM

vsellier closed subtask T3493: [cassandra] Git loader performance are very bad as Resolved.Sep 16 2021, 11:31 AM

vsellier mentioned this in rDSNIPd4b4d44f1e61: grid5000/cassandra: allow to instantiate a best_effort node without an….Sep 17 2021, 1:58 PM

vsellier mentioned this in rDSNIP41c771d59b0c: grid5000/cassandra: add statsd metrics dashboards.

vsellier mentioned this in rDSNIP0b2c9ffb779e: grid5000/cassandra: increase objstorage capacity.Oct 6 2021, 2:46 AM

vsellier mentioned this in rDSNIPec226b01a906: grid5000/cassandra: commit some pending changes.Oct 20 2021, 9:04 AM

vsellier closed subtask T3577: Parallel loaders performances as Resolved.Oct 21 2021, 2:56 PM

vsellier created subtask T3683: cassandra - benchmark the vault.Oct 21 2021, 5:43 PM

vsellier closed subtask T3683: cassandra - benchmark the vault as Resolved.Oct 22 2021, 11:48 AM

vsellier reopened subtask T3683: cassandra - benchmark the vault as Work in Progress.

vsellier closed subtask T3683: cassandra - benchmark the vault as Resolved.Nov 8 2021, 9:55 AM

vsellier updated the task description. (Show Details)Nov 15 2021, 9:46 AM

vsellier closed subtask T3573: [cassandra] directory and content read benchmarks as Resolved.Dec 2 2021, 10:08 AM

The slide of the restrospective of the experiment are available at : https://hedgedoc.softwareheritage.org/VOP9qh1MTqm4DjPQfFgNbQ

This task has been migrated to GitLab.

gitlab-migration changed the status of subtask T3394: cassandra - origin url hashing encoding issue from Resolved to Migrated.Oct 19 2022, 6:02 PM

gitlab-migration changed the status of subtask T3395: cassandra - Timeouts during revision import from Resolved to Migrated.

gitlab-migration changed the status of subtask T3396: cassandra - allow to configure the consistency level used by the queries from Resolved to Migrated.

gitlab-migration changed the status of subtask T3464: Prepare a quote for the cassandra servers from Resolved to Migrated.

gitlab-migration changed the status of subtask T3465: Test multidatacenter replication from Resolved to Migrated.

gitlab-migration changed the status of subtask T3491: Origin visit ids restart from 1 even if there is previous visits from Duplicate to Migrated.

gitlab-migration changed the status of subtask T3493: [cassandra] Git loader performance are very bad from Resolved to Migrated.

gitlab-migration changed the status of subtask T3517: [cassandra] decorate the method calls to have statsd metrics from Resolved to Migrated.

gitlab-migration changed the status of subtask T3573: [cassandra] directory and content read benchmarks from Resolved to Migrated.

gitlab-migration changed the status of subtask T3577: Parallel loaders performances from Resolved to Migrated.

gitlab-migration changed the status of subtask T3683: cassandra - benchmark the vault from Resolved to Migrated.

rDENV Development environment
	D6093	rDENV1b9307ce7751 storage-cassandra: Remove the default src override
rDSTO Storage manager
	D6067	rDSTO9f00eb9dba1d cassandra: Fix crash when using _missing() functions with more than 100 ids…

Perform some tests of the cassandra storage on Grid5000
Closed, MigratedEdits Locked
Actions

Description

Revisions and Commits

Related Objects
Search...

Event Timeline

run6 results - commitlog on a HDD

run7 results - cassandra heap from 16g to 32g

scylladb test

installation

	vsellier
	Jun 2 2021, 6:25 PM

Perform some tests of the cassandra storage on Grid5000Closed, MigratedEdits LockedActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

run6 results - commitlog on a HDD

run7 results - cassandra heap from 16g to 32g

scylladb test

installation

Perform some tests of the cassandra storage on Grid5000
Closed, MigratedEdits Locked
Actions

Related Objects
Search...