Page MenuHomeSoftware Heritage

[cassandra] Test the new hardware
Closed, MigratedEdits Locked

Description

In order to prepare the final production deployment, we would like to test the cass operator from datastax [1]

The main features will simplify the operation of the cluster:

  • Proper token ring initialization, with only one node bootstrapping at a time
  • Seed node management - one per rack, or three per datacenter, whichever is more suited
  • Server configuration integrated into the CassandraDatacenter CRD
  • Rolling reboot nodes by changing the CRD
  • Store data in a rack-safe way - one replica per cloud AZ
  • Scale up racks evenly with new nodes
  • Scale down racks evenly by decommissioning existing nodes
  • Replace dead/unrecoverable nodes
  • Multi DC clusters (limited to one Kubernetes namespace)

We also want to test different cluster topologies. A subtask per topology will be created.

Per topology, the following scenarios will be tested:

  • Configure and bootstrap the cluster, for some topologies, rack configuration can be needed
  • Import data / measure performance
  • Check recurring jobs (NodeSync / ...)
  • Add a new node / check data rebalancing
  • Remove a node / check data rebalancing
  • Kill a cassandra instance check recovery / rebalancing

2 steps:

  • Recover the killed instance
  • Replace the killed instance

[1] https://github.com/k8ssandra/cass-operator

Event Timeline

vsellier triaged this task as Normal priority.Jul 5 2022, 9:36 AM
vsellier created this task.
vsellier changed the task status from Open to Work in Progress.Jul 5 2022, 5:41 PM
vsellier claimed this task.
vsellier moved this task from Backlog to in-progress on the System administration board.

Unfortunately, the operator test is a failure due to the lack of configuration possibility

  • non blocker, the init containers are OOMkilled during the start, it can be solved by editing the cassandra statefulset created by the operator to extend the limits
  • blocker, it's not possible to configure the commitlog_directory explicitly. it's by default on /var/lib/cassandra/commitlog
    • it's not easy to propagate the host mounts to use 2 mountpoints /srv/cassandra and /srv/cassandra/commitlog without tweaking the kernel / rancher configuration
    • it's not possible to add a second volume on the pod description created by the operator

I will try to fallback to a quick and dirty manual configuration of cassandra in the cluster to keep the kube ops advantages

After spending some time to successfully start a cassandra cluster of 2 nodes with a declarative configuration), these are the observations:

  • A service can't be used to expose the cassandra ports to the clustrer, the pod address must be used. It's because cassandra use the dns name provided as listen address
  • It should work by setting the listen address to 0.0.0.0 but it's stongly recommanded to not use this in the documentation

Setting listen_address to 0.0.0.0 is always wrong.

  • Using internal pod address will avoid multi DC deployment for the future

A new version of the k8ssandra operator was also released last week. It allows now to configure the init containers, but still not the commit log directory.

For this reasons, I will fallback to a "classical" puppet installation

The puppet code is ready for review. It was updated to support multi instances deployment in anticipation of T4375.

The kubernetes cluster was not removed yet as it will probably be used for T4391.
While waiting for the review of D8236, I will focus on on it.