Page MenuHomeSoftware Heritage

Perform some tests of the cassandra storage on Grid5000
Started, Work in Progress, NormalPublic

Description

In order to test the behavior of a cassandra cluster during the normal operations (global performance on bare metal servers, node maintenance impact, rebalancing, ...), we should run some tests on grid5000 infrastructure

The POC will be separated in several phases:

  • Prepare scripts to build the environment and run small iterations to validate it will be possible to run the tests with interruptions
    • Validate the way the data will be kept between 2 cluster restarts
    • Have generic scripts that could configure the cluster according different hardware (memory / cpu / SSD, SATA or mixed / number of nodes / ...)
  • Import a big enough dataset to be representative of the reality (probably during the night or a week-end)
    • define the minimal target to reach to consider the dataset representative
  • perform some benchmarks, check the behavior and the performance impacts during normal operations

The final goal of the experiment is to :

  • define the minimal cluster size to maintain correct performance during maintenance operations / node failures
  • possibly test the performance on the different hardwares provided by grid5000

Event Timeline

vsellier changed the task status from Open to Work in Progress.Wed, Jun 2, 6:25 PM
vsellier triaged this task as Normal priority.
vsellier created this task.

I played with grid5000 to experiment how the jobs work and how to initialize the reserved nodes.

After having experimented the manual way, I tried to use the terraform provisionner which seems to work for the basic tasks (create a job and install an os on them).
The next step is to go further and try to automatize the node configuration, probably with a mix of ansible and shell scripts as @dachary has done for the object storage experiment https://git.easter-eggs.org/biceps/biceps/-/tree/master

Some status about the automation:

  • Cassandra nodes are ok (os installation, zfs configuration according to the defined environment except a problem during the first initialization with new disks, startup, cluster configuration)
  • swh-storage node is ok (os installation, gunicorn/swh-storage installation and startup)
  • cassandra database initialization :
root@parasilo-3:~#  nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load        Tokens  Owns (effective)  Host ID                               Rack 
UN  172.16.97.3  78.85 KiB   256     31.6%             49d46dd8-4640-45eb-9d4c-b6b16fc954ab  rack1
UN  172.16.97.5  105.45 KiB  256     26.0%             47e99bb4-4846-4e03-a06c-53ea2862172d  rack1
UN  172.16.97.4  98.35 KiB   256     18.1%             e2aeff29-c89a-4c7a-9352-77aaf78e91b3  rack1
UN  172.16.97.2  78.85 KiB   256     24.3%             edd1b72b-4c35-44bd-b7e5-316f41a156c4  rack1
root@parasilo-3:~# cqlsh 172.16.97.3
Connected to swh-storage at 172.16.97.3:9042
[cqlsh 6.0.0 | Cassandra 4.0 | CQL spec 3.4.5 | Native protocol v5]
cqlsh> desc KEYSPACES

swh     system_auth         system_schema  system_views         
system  system_distributed  system_traces  system_virtual_schema
cqlsh:> use swh;
cqlsh:swh> desc tables;

content                metadata_authority      revision_parent              
content_by_blake2s256  metadata_fetcher        skipped_content              
content_by_sha1        object_count            skipped_content_by_blake2s256
content_by_sha1_git    origin                  skipped_content_by_sha1      
content_by_sha256      origin_visit            skipped_content_by_sha1_git  
directory              origin_visit_status     skipped_content_by_sha256    
directory_entry        raw_extrinsic_metadata  snapshot                     
extid                  release                 snapshot_branch              
extid_by_target        revision

Next step is to test a mirror and add some monitoring