Page MenuHomeSoftware Heritage

Elastic worker cluster failures to unstuck
Closed, MigratedEdits Locked

Description

The cluster somehow is unresponsive.
Try to analyse and unstuck if possible.

Event Timeline

ardumont created this task.
ardumont changed the task status from Open to Work in Progress.May 25 2022, 5:20 PM
ardumont moved this task from Backlog to in-progress on the System administration board.

Unfortunately, after several tries, we were unable to restart the cluster due to a problem with the etcd leader election / data on the nodes (probably wrong manipulation from us).
We finally destroyed the cluster (we had to follow [1] because the cluster was in an unstable state and rancher refused to remove it)

Once the cluster was removed, it was recreated with terraform. The nodes were manually added with the docker command provided by rancher.
The fourth node was started with only a worker configuration. The terraform configuration will be updated accordingly

For the record, terraform doesn't like the cluster creation without nodes because the applications (monitoring / keda) can't be added until the cluster is in an active state.
We will probably have to move this initial configuration outside terraform later.

[1] https://github.com/rancher/rancher/issues/34650#issuecomment-956555419

vsellier moved this task from in-progress to done on the System administration board.

The cluster is up and running.

Regarding the cpu consumption on the nodes, it seems it related to the cluster management load.
It seems it's confirmed by [1].
Some interesting way to dig to reduce the cpu consumption on small clusters: [2]

I tried on the test cluster on our infra for gitlab, it reduced by ~10%the cpu consumption but I'm not sure it's worth it as it can impact the cluster stability.
Perhaps we should try to have 3 small nodes for the cluster management only, and bigger nodes for the workers

[1] https://github.com/kubernetes/kubernetes/issues/75565#issuecomment-476407045
[2] https://github.com/kubernetes/minikube/issues/3207#issuecomment-618123466

An interesting lead that could possibly explain what happened on the cluster : https://etcd.io/docs/v3.4/faq/#should-i-add-a-member-before-removing-an-unhealthy-member

the etcd documentation will deserve to be read during the next incident ;)

I've started back the loader git on that cluster:

for TYPE in git bzr cvs maven pypi npm svn; do \
  REL=workers-$TYPE; \
  NS=ns-loaders-$TYPE; \
  kubectl create namespace $NS; \
  kubectl apply -f loaders-metadata-fetcher.secret.yaml \
    --namespace $NS; \
  kubectl apply -f amqp-access-credentials.secret.yaml \
    --namespace $NS; \
  kubectl apply -f ./loaders-$TYPE-sentry.secret.yaml \
    --namespace $NS; \
done
namespace/ns-loaders-git created
secret/metadata-fetcher-credentials created
secret/amqp-access-credentials created
secret/loaders-git-sentry-secrets created
namespace/ns-loaders-bzr created
secret/metadata-fetcher-credentials created
secret/amqp-access-credentials created
secret/loaders-bzr-sentry-secrets created
namespace/ns-loaders-cvs created
secret/metadata-fetcher-credentials created
secret/amqp-access-credentials created
secret/loaders-cvs-sentry-secrets created
namespace/ns-loaders-maven created
secret/metadata-fetcher-credentials created
secret/amqp-access-credentials created
secret/loaders-maven-sentry-secrets created
namespace/ns-loaders-pypi created
secret/metadata-fetcher-credentials created
secret/amqp-access-credentials created
secret/loaders-pypi-sentry-secrets created
namespace/ns-loaders-npm created
secret/metadata-fetcher-credentials created
secret/amqp-access-credentials created
secret/loaders-npm-sentry-secrets created
namespace/ns-loaders-svn created
secret/metadata-fetcher-credentials created
secret/amqp-access-credentials created
secret/loaders-svn-sentry-secrets created

# Then install the loader
$ for TYPE in git; do REL=workers-$TYPE; helm install -f ./instances/loaders-$TYPE.staging.values.yaml $REL ./; done
NAME: workers-git
LAST DEPLOYED: Thu Jun  2 14:13:07 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None