Page MenuHomeSoftware Heritage

Test reaper to automate the cassandra repair actions
Closed, MigratedEdits Locked

Description

http://cassandra-reaper.io/

Reaper is an open source tool that aims to schedule and orchestrate repairs of Apache Cassandra clusters.
It improves the existing nodetool repair process by:

  • Splitting repair jobs into smaller tunable segments.
  • Handling back-pressure through monitoring running repairs and pending compactions.
  • Adding ability to pause or cancel repairs and track progress precisely.

Reaper ships with a REST API, a command line tool and a web UI.

Event Timeline

vsellier triaged this task as Normal priority.Aug 25 2022, 10:54 AM
vsellier created this task.
vsellier changed the task status from Open to Work in Progress.Sep 9 2022, 11:49 AM
vsellier moved this task from Backlog to in-progress on the System administration board.

A new production node for replayers and generic load was added on the cluster to add more compute resources to allow testing the tool

reaper access the cassandra server through jmx. The cassandra deployment scripts need to be adapted (in progress) to expose jmx on the public interface.
When publicly exposed, the cassandra startup scripts force to password protect the jmx accesses.

Reaper was manually deployed and running.
The main functionnalities for now are the scheduling of the different repair type, the orchestration of the segment to repair to avoid a repair of the same segment in different replicas.
Secondary functionalities can be useful too like the repair progress, stop / resume http://cassandra-reaper.io/docs/concepts/

There are also a couple of metrics exposed by repair I will try to exploit after a sometime the replay will run

For example, thes logs were saw in the reaper logs when several repairs for different keystores are scheduler at the same time:

│ INFO   [2022-09-16 15:02:34,268] [archive_production:67b1d310-35c5-11ed-8ea7-4b43418aeab2:67b9e963-35c5-11ed-8ea7-4b43418aeab2] i.c.s.SegmentRunner - Repair for segment 67b9e963-35c5-11ed-8ea7-4b43418aeab2 started, status wait will timeout in 1800000 millis                                                                  │
│ INFO   [2022-09-16 15:02:58,602] [archive_production:9a773740-35cf-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Maximum number of concurrent repairs reached. Repair 9a773740-35cf-11ed-8ea7-4b43418aeab2 will resume later.                                                                                                       │
│ INFO   [2022-09-16 15:02:58,602] [archive_production:9a773740-35cf-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Current active repair runners: [(67b1d310-35c5-11ed-8ea7-4b43418aeab2,1663335748289), (76982be0-35cf-11ed-8ea7-4b43418aeab2,1663340068254), (9a773740-35cf-11ed-8ea7-4b43418aeab2,1663340128436), (9a9cc0a0-35cf-1 │
│ INFO   [2022-09-16 15:02:58,787] [archive_production:9a9cc0a0-35cf-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Maximum number of concurrent repairs reached. Repair 9a9cc0a0-35cf-11ed-8ea7-4b43418aeab2 will resume later.                                                                                                       │
│ INFO   [2022-09-16 15:02:58,787] [archive_production:9a9cc0a0-35cf-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Current active repair runners: [(67b1d310-35c5-11ed-8ea7-4b43418aeab2,1663335748289), (76982be0-35cf-11ed-8ea7-4b43418aeab2,1663340068254), (9a773740-35cf-11ed-8ea7-4b43418aeab2,1663340128436), (9a9cc0a0-35cf-1 │
│ INFO   [2022-09-16 15:02:59,336] [archive_production:9aeae0a0-35cf-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Maximum number of concurrent repairs reached. Repair 9aeae0a0-35cf-11ed-8ea7-4b43418aeab2 will resume later.                                                                                                       │
│ INFO   [2022-09-16 15:02:59,336] [archive_production:9aeae0a0-35cf-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Current active repair runners: [(67b1d310-35c5-11ed-8ea7-4b43418aeab2,1663335748289), (76982be0-35cf-11ed-8ea7-4b43418aeab2,1663340068254), (9a773740-35cf-11ed-8ea7-4b43418aeab2,1663340128436), (9a9cc0a0-35cf-1 │
│ INFO   [2022-09-16 15:02:59,555] [archive_production:9b0c7260-35cf-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Maximum number of concurrent repairs reached. Repair 9b0c7260-35cf-11ed-8ea7-4b43418aeab2 will resume later.                                                                                                       │
│ INFO   [2022-09-16 15:02:59,555] [archive_production:9b0c7260-35cf-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Current active repair runners: [(67b1d310-35c5-11ed-8ea7-4b43418aeab2,1663335748289), (76982be0-35cf-11ed-8ea7-4b43418aeab2,1663340068254), (9a773740-35cf-11ed-8ea7-4b43418aeab2,1663340128436), (9a9cc0a0-35cf-1 │
│ INFO   [2022-09-16 15:02:59,779] [archive_production:76982be0-35cf-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Attempting to run new segment...                                                                                                                                                                                   │
│ INFO   [2022-09-16 15:02:59,813] [archive_production:76982be0-35cf-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Next segment to run : 76998b71-35cf-11ed-8ea7-4b43418aeab2                                                                                                                                                         │
│ INFO   [2022-09-16 15:02:59,849] [archive_production:76982be0-35cf-11ed-8ea7-4b43418aeab2:76998b71-35cf-11ed-8ea7-4b43418aeab2] i.c.j.JmxProxy - Triggering repair of range (-5797115047693728403,-5671075333212739092] for keyspace "reaper_db" on host 192.168.100.182, with repair parallelism dc_parallel, in cluster with Cas │
│ INFO   [2022-09-16 15:02:59,851] [archive_production:76982be0-35cf-11ed-8ea7-4b43418aeab2:76998b71-35cf-11ed-8ea7-4b43418aeab2] i.c.j.JmxProxy - Triggering repair for ranges -5797115047693728403:-5671075333212739092                                                                                                            │
│ INFO   [2022-09-16 15:02:59,863] [archive_production:76982be0-35cf-11ed-8ea7-4b43418aeab2:76998b71-35cf-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Triggered repair of segment 76998b71-35cf-11ed-8ea7-4b43418aeab2 via host 192.168.100.182                                                                                     │
│ INFO   [2022-09-16 15:02:59,863] [archive_production:76982be0-35cf-11ed-8ea7-4b43418aeab2:76998b71-35cf-11ed-8ea7-4b43418aeab2] i.c.s.SegmentRunner - Repair for segment 76998b71-35cf-11ed-8ea7-4b43418aeab2 started, status wait will timeout in 1800000 millis                                                                  │
│ INFO   [2022-09-16 15:03:04,227] [archive_production:67b1d310-35c5-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - Attempting to run new segment...                                                                                                                                                                                   │
│ INFO   [2022-09-16 15:03:04,254] [archive_production:67b1d310-35c5-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - All nodes are busy or have too many pending compactions for the remaining candidate segments.                                                                                                                      │
│ INFO   [2022-09-16 15:03:04,262] [archive_production:67b1d310-35c5-11ed-8ea7-4b43418aeab2] i.c.s.RepairRunner - All nodes are busy or have too many pending compactions for the remaining candidate segments.