Page MenuHomeSoftware Heritage

Test multidatacenter replication
Closed, ResolvedPublic

Description

In case we want to replicate the rocq cluster to azure, the multi datacenter replication must be tested.
The goal of the experiment is to:

  • make a traffic/cost estimation
  • check the viability of the replication with heterogeneous cluster (for example 5 nodes at rocq, 3 nodes in azure)
  • check the recovery of the cluster after a network interruption (node / full cluster / ...)
  • ...

Event Timeline

vsellier triaged this task as Normal priority.Aug 5 2021, 12:31 PM
vsellier created this task.

The gros cluster at Nancy[1] has a lot of nodes(124) with small reservable SSD of 960Go. This can be a good candidate to create the second cluster. It will also allow to check the performance with data (and commit logs) on SSDs.
According to the main cluster, a minimum of 8 nodes are necessary to handle the volume of data (7.3 To and growing). Starting with 10 nodes will allow to have some remaining space.

[1] https://www.grid5000.fr/w/Nancy:Hardware#gros

vsellier changed the task status from Open to Work in Progress.Aug 19 2021, 7:19 PM
vsellier moved this task from Backlog to in-progress on the System administration board.

Starting with 10 nodes will allow to have some remaining space.

IIRC compaction writes the new SST before removing the old one, so we need double the size

The second cassandra cluster is finally up and synchronizing with the first one. The rebuild should be done by the end of the day or tomorrow.

This is its topology:

  • 10 nodes (gros-5[0-9].nancy.grid5000.fr)
  • 96Go of memory per node
  • 1 480Go ssd for the system
  • 1 960Go ssd for cassandra (commitlog and data)
  • same replication factor as the first datacenter
  • DC names:
    • initial DC: datacenter1 (it's the default name as it can't be changed after the nodes are initialized)
    • new datacenter: datacenter2
  • one rack defined per DC

current status:

vsellier@gros-50:~$ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load        Tokens  Owns (effective)  Host ID                               Rack 
UN  172.16.97.3   1.36 TiB    256     60.1%             a3ae5fa2-c063-4890-87f1-bddfcf293bde  rack1
UN  172.16.97.6   1.36 TiB    256     60.0%             bfe360f1-8fd2-4f4b-a070-8f267eda1e12  rack1
UN  172.16.97.5   1.36 TiB    256     59.9%             478c36f8-5220-4db7-b5c2-f3876c0c264a  rack1
UN  172.16.97.4   1.36 TiB    256     59.9%             b3105348-66b0-4f82-a5bf-31ef28097a41  rack1
UN  172.16.97.2   1.36 TiB    256     60.1%             de866efd-064c-4e27-965c-f5112393dc8f  rack1

Datacenter: datacenter2
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load        Tokens  Owns (effective)  Host ID                               Rack 
UN  172.16.66.50  255.46 GiB  256     29.6%             69e68604-30f3-4716-b474-dd8365a43817  rack1
UN  172.16.66.54  44.44 GiB   256     33.3%             b06e1181-c38a-4115-962e-a021d566c7c5  rack1
UN  172.16.66.57  39.14 GiB   256     28.3%             dda2235f-dfde-483f-8c63-aaea971bd753  rack1
UN  172.16.66.52  745.06 MiB  256     29.0%             d6667136-8cce-4425-9582-c31e400bae27  rack1
UN  172.16.66.55  26.29 GiB   256     30.0%             820f798b-ad43-4e87-85ad-6b976a302170  rack1
UN  172.16.66.58  51.35 GiB   256     29.4%             168e8243-bc6d-4a42-84e5-4e46fb260197  rack1
UN  172.16.66.56  40.84 GiB   256     30.6%             4820b799-c6ea-4cb1-8e51-0a843baede7a  rack1
UN  172.16.66.59  39.66 GiB   256     29.2%             3a0a64ff-2544-47b5-b11c-1a2d68932bf6  rack1
UN  172.16.66.51  273.55 GiB  256     30.2%             a68b16f3-0575-4689-850a-c474cec5c672  rack1
UN  172.16.66.53  40.81 GiB   256     30.3%             d7c26e16-54e3-4a73-b618-3e5daaf225bc  rack1

These are the steps done to initialized the new cluster [1]:

  • add a file datacenter-rackdc.properties on the server with the according DC
gros-50:~$ cat /etc/cassandra/cassandra-rackdc.properties 
dc=datacenter2
rack=rack1
  • change the value of the properties endpoint_snitch from SimpleSnitch to GossipingPropertyFileSnitch [2].

The recommanded value for production is GossipingPropertyFileSnitch so it should have been this since the beginning

  • configure the disk_optimization_strategy to ssd on the new datacenter
  • update the seed_provider to have one node on each datacenter
  • restart the datacenter1 nodes to apply the new configuration
  • start the datacenter2 nodes one by one, wait until the status of the node is UN (Up and Normal) before starting another one (They can be stay in the UJ (joining) state for a couple of minutes)
  • when done, update the swh keyspace to declare the replication strategy of the second DC
ALTER KEYSPACE swh WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'datacenter1' : 3, 'datacenter2': 3};

The replication of the new changes starts here but the full table contents need to be copied

  • rebuild the cluster content:
vsellier@fnancy:~/cassandra$ seq 0 9 | parallel -t ssh gros-5{} nodetool rebuild -ks swh -- datacenter1

The progression can be monitored with nodetool command:

gros-50:~$ nodetool netstats                                                                 
Mode: NORMAL                                                                                           
Rebuild e5e64920-0644-11ec-92a6-31a241f39914                                                            
    /172.16.97.4                                                                                                                                      
        Receiving 199 files, 147926499702 bytes total. Already received 125 files (62.81%), 57339885570 bytes total (38.76%)
            swh/release-4 1082347/1082347 bytes (100%) received from idx:0/172.16.97.4                                                                           
            swh/content_by_blake2s256-2 3729362955/3729362955 bytes (100%) received from idx:0/172.16.97.4
            swh/release-3 224510803/224510803 bytes (100%) received from idx:0/172.16.97.4                
            swh/content_by_blake2s256-1 240283216/240283216 bytes (100%) received from idx:0/172.16.97.4
            swh/content_by_blake2s256-4 29491504/29491504 bytes (100%) received from idx:0/172.16.97.4
            swh/release-2 6409474/6409474 bytes (100%) received from idx:0/172.16.97.4                
...
Read Repair Statistics:                                                                                     
Attempted: 0                                                                                          
Mismatch (Blocking): 0                                                                                
Mismatch (Background): 0                                                                            
Pool Name                    Active   Pending      Completed   Dropped                                
Large messages                  n/a         0             23         0                                
Small messages                  n/a         3      132753939         0                          
Gossip messages                 n/a         0          43915         0

or to filter only running transfers:

gros-50:~$ nodetool netstats  | grep -v 100%
Mode: NORMAL
Rebuild e5e64920-0644-11ec-92a6-31a241f39914
    /172.16.97.4
        Receiving 199 files, 147926499702 bytes total. Already received 125 files (62.81%), 57557961160 bytes total (38.91%)
            swh/directory_entry-7 4819168032/4925484261 bytes (97%) received from idx:0/172.16.97.4
    /172.16.97.2
        Receiving 202 files, 111435975646 bytes total. Already received 139 files (68.81%), 60583670773 bytes total (54.37%)
            swh/directory_entry-12 1631210003/2906113367 bytes (56%) received from idx:0/172.16.97.2
    /172.16.97.6
        Receiving 236 files, 186694443984 bytes total. Already received 142 files (60.17%), 58869656747 bytes total (31.53%)
            swh/snapshot_branch-10 4449235102/7845572885 bytes (56%) received from idx:0/172.16.97.6
    /172.16.97.5
        Receiving 221 files, 143384473640 bytes total. Already received 132 files (59.73%), 58300913015 bytes total (40.66%)
            swh/directory_entry-4 982247023/3492851311 bytes (28%) received from idx:0/172.16.97.5
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name                    Active   Pending      Completed   Dropped
Large messages                  n/a         0             23         0
Small messages                  n/a         2      135087921         0
Gossip messages                 n/a         0          44176         0

Under the hood, a prometheus node was also added on the second datacenter. The datacenter1 prometheus node federates the data data. It allows to retrieve [3] all the monitoring data by just probing the datacenter1 prometheus

[1] followed mostly this documentation https://docs.datastax.com/en/dse/6.7/dse-admin/datastax_enterprise/production/multiDCperWorkloadType.html
[2] http://cassandra.apache.org/doc/latest/cassandra/architecture/snitch.html
[3] grafana is available here: http://192.168.130.165

10 nodes are not enough, I add 5 additional nodes to reduce the volume per node a little.

vsellier@gros-50:~$ df -h /srv/cassandra
Filesystem      Size  Used Avail Use% Mounted on
data/data       861G  830G   31G  97% /srv/cassandra

5 nodes were added on the cluster:

  • configuration pushed on g5k, disk reserved for 14 days on the new servers, a new reservation was launched with the new nodes
  • each node was started one by one after their status was UN on the nodetool status output
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load        Tokens  Owns (effective)  Host ID                               Rack
DN  172.16.97.3   ?           256     0.0%              a3ae5fa2-c063-4890-87f1-bddfcf293bde  r1
DN  172.16.97.6   ?           256     0.0%              bfe360f1-8fd2-4f4b-a070-8f267eda1e12  r1
DN  172.16.97.5   ?           256     0.0%              478c36f8-5220-4db7-b5c2-f3876c0c264a  r1
DN  172.16.97.4   ?           256     0.0%              b3105348-66b0-4f82-a5bf-31ef28097a41  r1
DN  172.16.97.2   ?           256     0.0%              de866efd-064c-4e27-965c-f5112393dc8f  r1

Datacenter: datacenter2
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load        Tokens  Owns (effective)  Host ID                               Rack
UN  172.16.66.64  84.18 KiB   256     20.0%             aee607f1-d6bc-40c6-96e1-97c65d022321  rack1
UN  172.16.66.61  79.35 KiB   256     20.0%             d9634ce7-5596-41fc-b6c5-9ccffb8bc6e7  rack1
UN  172.16.66.54  711.51 GiB  256     20.0%             e738dff9-d472-48e4-a42c-0b5d43fc3158  rack1
UN  172.16.66.50  824.76 GiB  256     20.0%             a9528382-06ba-457a-a2c4-c74063acba9e  rack1
UN  172.16.66.60  79.36 KiB   256     20.0%             b2ab3e00-50b1-4690-b586-3113b5eed70d  rack1
UN  172.16.66.57  648.43 GiB  256     20.0%             16a3018d-f881-4d47-95be-fc13094576d7  rack1
UN  172.16.66.52  709.95 GiB  256     20.0%             ddac558f-6a47-4409-9f7a-9d5a88bc2755  rack1
UN  172.16.66.55  662.35 GiB  256     20.0%             d4474cc2-601b-4047-9f7a-df71dc7dcd72  rack1
UN  172.16.66.58  656.24 GiB  256     20.0%             a6cbe299-6b4b-47f3-9db9-cf00ed15d8d4  rack1
UN  172.16.66.63  92.43 MiB   256     20.0%             babb54ae-b1f4-40b3-9642-b0c1c3c52dce  rack1
UN  172.16.66.62  79.36 KiB   256     20.0%             f0ccadad-8803-40b9-a37d-581c3c64ae1a  rack1
UN  172.16.66.59  10.94 GiB   256     20.0%             bd0ee03f-d597-419d-bb2d-da74df90e636  rack1
UN  172.16.66.56  650.2 GiB   256     20.0%             763e305e-ce10-463b-bfc3-9d06fa74ce88  rack1
UN  172.16.66.53  815.88 GiB  256     20.0%             f7db848b-7d43-455f-9cce-b87c7ea78790  rack1
UN  172.16.66.51  599.12 GiB  256     20.0%             359b5423-a43f-4003-8ea7-e5ae4d38b4c8  rack1
  • a repair was launched to initialized the data

when it will be done, a nodetool cleanup command will be launched on the old nodes to remove the rebalanced tokens

well after reflection, it will be probably faster to recreate the second DC from scractch now the configuration is ready.

  • cassandra stopped
vsellier@fnancy:~/cassandra$ seq 50 64 | parallel -t ssh root@gros-{} systemctl stop cassandra
  • data cleaned
vsellier@fnancy:~/cassandra$ seq 50 64 | parallel -t ssh root@gros-{} "rm -rf /srv/cassandra/*"
  • Cassandra restarted
vsellier@fnancy:~/cassandra$ seq 50 64 | parallel -t ssh root@gros-{} systemctl start cassandra

Current status:

vsellier@gros-50:~$  nodetool status
Datacenter: datacenter2
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens  Owns (effective)  Host ID                               Rack
UN  172.16.66.64  73.44 KiB  256     14.3%             75235030-8cb8-4d5f-a307-13bf34b0285a  rack1
UN  172.16.66.61  73.47 KiB  256     12.9%             3d57cab9-2f70-4eb7-802c-9d44050d59a9  rack1
UN  172.16.66.50  73.46 KiB  256     13.0%             6dbba88c-3a00-4cd3-804c-cf7f1c6ed759  rack1
UN  172.16.66.54  73.45 KiB  256     13.4%             e71378db-fa41-4269-8758-48621ccab1be  rack1
UN  172.16.66.60  73.48 KiB  256     13.1%             196b7f13-0791-4d46-a653-317869215da9  rack1
UN  172.16.66.57  73.46 KiB  256     13.4%             b62c8977-12a4-4363-bbeb-83f1040c97cd  rack1
UN  172.16.66.52  73.44 KiB  256     12.8%             efd4a4d4-729f-4bd3-b9b8-9cf3f9ae711a  rack1
UN  172.16.66.55  73.47 KiB  256     13.0%             53023fec-249a-4199-8354-c0c49bd7c9a2  rack1
UN  172.16.66.58  73.45 KiB  256     13.5%             343290a6-161a-49c8-bab1-bce8aa37e2df  rack1
UN  172.16.66.63  73.47 KiB  256     13.3%             bd76e60c-d1d1-4873-9789-c04b39369b88  rack1
UN  172.16.66.62  73.49 KiB  256     13.3%             6ee0c953-d91b-4ece-9018-2e91b5a945dc  rack1
UN  172.16.66.56  73.44 KiB  256     13.9%             877cb57e-85ce-4488-bee9-31413b047085  rack1
UN  172.16.66.59  73.44 KiB  256     13.3%             a7762aac-9310-4a39-8f2f-117748e53d34  rack1
UN  172.16.66.53  73.44 KiB  256     13.2%             16825b40-22dd-4091-b986-efe3137af5f4  rack1
UN  172.16.66.51  73.45 KiB  256     13.6%             358d2112-6cc9-47b7-b4b7-6ce2c02f5809  rack1

The first datacenter is not yet visible as the nodes are stopped to avoid to override the daily g5k quota.

New cluster state after all the reservation are up:

vsellier@gros-50:~$  nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load        Tokens  Owns (effective)  Host ID                               Rack
UN  172.16.97.3   1.4 TiB     256     60.1%             a3ae5fa2-c063-4890-87f1-bddfcf293bde  rack1
UN  172.16.97.6   1.4 TiB     256     60.0%             bfe360f1-8fd2-4f4b-a070-8f267eda1e12  rack1
UN  172.16.97.5   1.39 TiB    256     59.9%             478c36f8-5220-4db7-b5c2-f3876c0c264a  rack1
UN  172.16.97.4   1.4 TiB     256     59.9%             b3105348-66b0-4f82-a5bf-31ef28097a41  rack1
UN  172.16.97.2   1.4 TiB     256     60.1%             de866efd-064c-4e27-965c-f5112393dc8f  rack1

Datacenter: datacenter2
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load        Tokens  Owns (effective)  Host ID                               Rack
UN  172.16.66.64  525.48 KiB  256     20.0%             75235030-8cb8-4d5f-a307-13bf34b0285a  rack1
UN  172.16.66.61  361.96 KiB  256     20.0%             3d57cab9-2f70-4eb7-802c-9d44050d59a9  rack1
UN  172.16.66.50  414.12 KiB  256     20.0%             6dbba88c-3a00-4cd3-804c-cf7f1c6ed759  rack1
UN  172.16.66.54  260.98 KiB  256     20.0%             85788f82-365b-4786-bba6-ca1e5d0bb396  rack1
UN  172.16.66.60  385.61 KiB  256     20.0%             196b7f13-0791-4d46-a653-317869215da9  rack1
UN  172.16.66.57  423.14 KiB  256     20.0%             b62c8977-12a4-4363-bbeb-83f1040c97cd  rack1
UN  172.16.66.52  349.89 KiB  256     20.0%             efd4a4d4-729f-4bd3-b9b8-9cf3f9ae711a  rack1
UN  172.16.66.55  79.34 KiB   256     20.0%             f19a2e4d-1c39-47c4-95ac-0b2a9a696a57  rack1
UN  172.16.66.58  361.85 KiB  256     20.0%             343290a6-161a-49c8-bab1-bce8aa37e2df  rack1
UN  172.16.66.63  392.57 KiB  256     20.0%             bd76e60c-d1d1-4873-9789-c04b39369b88  rack1
UN  172.16.66.62  392.46 KiB  256     20.0%             6ee0c953-d91b-4ece-9018-2e91b5a945dc  rack1
UN  172.16.66.56  422.85 KiB  256     20.0%             877cb57e-85ce-4488-bee9-31413b047085  rack1
UN  172.16.66.59  483.98 KiB  256     20.0%             a7762aac-9310-4a39-8f2f-117748e53d34  rack1
UN  172.16.66.51  422.91 KiB  256     20.0%             358d2112-6cc9-47b7-b4b7-6ce2c02f5809  rack1
UN  172.16.66.53  222.95 KiB  256     20.0%             0ed6b14d-13dd-45ae-9e85-644f4a16ef29  rack1
  • Node initialization launched with max 5 node in parallel to avoid too much I/Os on the first datacenter:
vsellier@fnancy:~/cassandra$ seq 50 64 | parallel -j5 -t -n1 -i{} ssh root@gros-{} nodetool rebuild  -ks swh -- datacenter1
ssh root@gros-50 nodetool rebuild -ks swh -- datacenter1
ssh root@gros-51 nodetool rebuild -ks swh -- datacenter1
ssh root@gros-52 nodetool rebuild -ks swh -- datacenter1
ssh root@gros-53 nodetool rebuild -ks swh -- datacenter1
ssh root@gros-54 nodetool rebuild -ks swh -- datacenter1

The new datacenter is active since a couple of week.
It allowed to test:

  • how to declare a new dc and bootstrap it
  • how the data is replicated between the DC
  • how to perform inter/intra DC repairs
  • how to add nodes on a DC on bootstrap it
  • how to remove a datacenter

All the relevant information were added on the hedgedoc document: https://hedgedoc.softwareheritage.org/#multi-dc and different other sections.

The second datacenter will be decommissioned because it's time consuming to maintain several grid5000 reservation in parallel

vsellier moved this task from in-progress to done on the System administration board.