[cassandra] Test the new hardware
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vsellier
	Jul 5 2022, 9:36 AM

Description

In order to prepare the final production deployment, we would like to test the cass operator from datastax [1]

The main features will simplify the operation of the cluster:

Proper token ring initialization, with only one node bootstrapping at a time
Seed node management - one per rack, or three per datacenter, whichever is more suited
Server configuration integrated into the CassandraDatacenter CRD
Rolling reboot nodes by changing the CRD
Store data in a rack-safe way - one replica per cloud AZ
Scale up racks evenly with new nodes
Scale down racks evenly by decommissioning existing nodes
Replace dead/unrecoverable nodes
Multi DC clusters (limited to one Kubernetes namespace)

We also want to test different cluster topologies. A subtask per topology will be created.

Per topology, the following scenarios will be tested:

Configure and bootstrap the cluster, for some topologies, rack configuration can be needed
Import data / measure performance
Check recurring jobs (NodeSync / ...)
Add a new node / check data rebalancing
Remove a node / check data rebalancing
Kill a cassandra instance check recovery / rebalancing

2 steps:

Recover the killed instance
Replace the killed instance

[1] https://github.com/k8ssandra/cass-operator

Revisions and Commits

rSPRE sysadm-provisioning
	D8116	rSPRE16feb3dda32c Deploy the cassandra operator on the production cassandra cluster
rSPSITE puppet-swh-site
	D8236	rSPSITE57d978d9e3de cassandra: Refactor the former installation scripts
	D8105	rSPSITE02794d0df963 Install zfs and docker on the cassandra node to prepare the cass operator tests
rDDOC Development documentation
	D8242	rDDOCb5a88986034c sysadmin: bootstrap the cassandra documentation

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T4373 [cassandra] Test the new hardware
Migrated	gitlab-migration	T4379 [cassandra] create etcd / controlplane servers
Migrated	gitlab-migration	T4375 [cassandra] One cassandra per data disk
Migrated	gitlab-migration	T4374 [cassandra] Test basic topology
Migrated	gitlab-migration	T4389 [cassandra] Configure the monitoring of the cluster
Migrated	gitlab-migration	T4391 [cassandra] deploy the replaying stack
Migrated	gitlab-migration	T4446 Prepare the disks and configure zfs
Migrated	gitlab-migration	T4458 Test reaper to automate the cassandra repair actions
Migrated	gitlab-migration	T4510 [cassandra] Profile the replayer cpu consumption

Event Timeline

vsellier triaged this task as Normal priority.Jul 5 2022, 9:36 AM

vsellier created this task.

ardumont updated the task description. (Show Details)Jul 5 2022, 10:04 AM

vsellier changed the task status from Open to Work in Progress.Jul 5 2022, 5:41 PM

vsellier claimed this task.

vsellier moved this task from Backlog to in-progress on the System administration board.

vsellier removed a subtask: T4375: [cassandra] One cassandra per data disk.Jul 5 2022, 5:50 PM

vsellier removed a subtask: T4374: [cassandra] Test basic topology.

vsellier changed the status of subtask T4379: [cassandra] create etcd / controlplane servers from Open to Work in Progress.Jul 7 2022, 11:56 AM

vsellier added a revision: D8105: Install zfs and docker on the cassandra node to prepare the cass operator tests.Jul 11 2022, 9:33 AM

vsellier added a commit: rSPSITE02794d0df963: Install zfs and docker on the cassandra node to prepare the cass operator tests.Jul 11 2022, 2:19 PM

vsellier mentioned this in rSENV333381e88e93: Declare the cassandra nodes.Jul 11 2022, 2:27 PM

vsellier mentioned this in rSPSITE1ae1e5a7cf2f: Force to use pergamon as dns.Jul 11 2022, 2:29 PM

vsellier closed subtask T4379: [cassandra] create etcd / controlplane servers as Resolved.Jul 11 2022, 4:33 PM

vsellier created subtask T4389: [cassandra] Configure the monitoring of the cluster.Jul 11 2022, 4:48 PM

vsellier closed subtask T4389: [cassandra] Configure the monitoring of the cluster as Resolved.Jul 12 2022, 12:09 PM

vsellier added a revision: D8116: Deploy the cassandra operator on the production cassandra cluster.Jul 12 2022, 3:01 PM

Unfortunately, the operator test is a failure due to the lack of configuration possibility

non blocker, the init containers are OOMkilled during the start, it can be solved by editing the cassandra statefulset created by the operator to extend the limits
blocker, it's not possible to configure the commitlog_directory explicitly. it's by default on /var/lib/cassandra/commitlog
- it's not easy to propagate the host mounts to use 2 mountpoints /srv/cassandra and /srv/cassandra/commitlog without tweaking the kernel / rancher configuration
- it's not possible to add a second volume on the pod description created by the operator

I will try to fallback to a quick and dirty manual configuration of cassandra in the cluster to keep the kube ops advantages

vsellier added a commit: rSPRE16feb3dda32c: Deploy the cassandra operator on the production cassandra cluster.Jul 13 2022, 11:12 AM

vsellier mentioned this in rSKCONFff5707a7c44b: Bootstrap manual cassandra configuration.Jul 13 2022, 8:01 PM

vsellier mentioned this in rSKCONF8e37b0f5df0c: cassandra: poc working manual declarative configuration.Aug 10 2022, 9:58 AM

After spending some time to successfully start a cassandra cluster of 2 nodes with a declarative configuration), these are the observations:

A service can't be used to expose the cassandra ports to the clustrer, the pod address must be used. It's because cassandra use the dns name provided as listen address
It should work by setting the listen address to 0.0.0.0 but it's stongly recommanded to not use this in the documentation

Setting listen_address to 0.0.0.0 is always wrong.

Using internal pod address will avoid multi DC deployment for the future

A new version of the k8ssandra operator was also released last week. It allows now to configure the init containers, but still not the commit log directory.

For this reasons, I will fallback to a "classical" puppet installation

For the record, the issues related to the commitlog_directory configuration:

vsellier mentioned this in rSPRE5fc8dc9b2626: cassandra: expose the log directory to the pods.Aug 10 2022, 12:53 PM

vsellier added a revision: D8236: cassandra: Refactor the former installation scripts.Aug 12 2022, 11:01 AM

The puppet code is ready for review. It was updated to support multi instances deployment in anticipation of T4375.

The kubernetes cluster was not removed yet as it will probably be used for T4391.
While waiting for the review of D8236, I will focus on on it.

vsellier changed the status of subtask T4391: [cassandra] deploy the replaying stack from Open to Work in Progress.Aug 12 2022, 11:10 AM

vsellier added a revision: D8242: sysadmin: bootstrap the cassandra documentation.Aug 12 2022, 3:34 PM

vsellier added a commit: rSPSITE57d978d9e3de: cassandra: Refactor the former installation scripts.Aug 18 2022, 10:48 AM

vsellier mentioned this in rSPRE8791a2ec762d: cassandra: Cleanup extra mounts of the cassandra directories.Aug 18 2022, 10:58 AM

vsellier mentioned this in rSPSITE115a30d19218: cassandra: Add missing dependencies.Aug 18 2022, 2:26 PM

vsellier created subtask T4446: Prepare the disks and configure zfs.Aug 18 2022, 4:46 PM

vsellier closed subtask T4446: Prepare the disks and configure zfs as Resolved.Aug 18 2022, 7:00 PM

vsellier mentioned this in rSPSITE5d845aa0805a: cassandra: extend directory permission to allow monitoring to check the disks.Aug 19 2022, 10:05 AM

vsellier added a commit: rDDOCb5a88986034c: sysadmin: bootstrap the cassandra documentation.Aug 25 2022, 3:41 PM

vsellier changed the status of subtask T4510: [cassandra] Profile the replayer cpu consumption from Open to Work in Progress.Sep 8 2022, 6:29 PM

vsellier changed the status of subtask T4458: Test reaper to automate the cassandra repair actions from Open to Work in Progress.Sep 9 2022, 11:49 AM

vsellier closed subtask T4510: [cassandra] Profile the replayer cpu consumption as Resolved.Sep 14 2022, 9:47 AM

This task has been migrated to GitLab.

gitlab-migration changed the status of subtask T4379: [cassandra] create etcd / controlplane servers from Resolved to Migrated.Oct 19 2022, 6:07 PM

gitlab-migration changed the status of subtask T4389: [cassandra] Configure the monitoring of the cluster from Resolved to Migrated.

gitlab-migration closed subtask T4391: [cassandra] deploy the replaying stack as Migrated.

gitlab-migration changed the status of subtask T4446: Prepare the disks and configure zfs from Resolved to Migrated.

gitlab-migration closed subtask T4458: Test reaper to automate the cassandra repair actions as Migrated.

gitlab-migration changed the status of subtask T4510: [cassandra] Profile the replayer cpu consumption from Resolved to Migrated.

[cassandra] Test the new hardwareClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

[cassandra] Test the new hardware
Closed, MigratedEdits Locked
Actions

Related Objects
Search...