Change Details

During the rolling restart of the cluster 2 disks failures has crashed esnode1 and avoid the cluster to recover. [Copied from a comment] Short term plan : [x] Remove old systemlogs indexes older than 1year to start, but we can go to 3 months if necessary [x] reactivate the shard allocation to have 1 replica for all the shards in case of a second node failure [x] Launch a long smartcl test on all the disks of each esnode* server [x] Contact DELL support to proceed to the replacement of the 2 failing disks (under warranty(?)) [1] [x] Try to recover the 16 red indexes if possible, if not, delete them as they are not critical Middle term: [x] Reconfigure sentry to use its local kafka instance instead of the esnode* kafka cluster (thank olasd) [x] D4747, D4757: Cleanup the esnode* kafka/zookeeper instances [x] **done for esnode1** reclaim the 2To disk reserved for the journal => T2958 [] ~~Add a new datadir on elasticsearch using the new available disk~~ [] Add smartctl monitoring to detect disk failure as soon as possible T2960 [1] sdb serial : K5GJBLTA / sdc serial : K5GV9REA