Elastic worker cluster failures to unstuck
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	ardumont
	May 25 2022, 2:11 PM

Description

The cluster somehow is unresponsive.
Try to analyse and unstuck if possible.

Revisions and Commits

rSPRE sysadm-provisioning
	D7925	rSPREd7d5ff6b4b64 Recreate the staging-worker cluster

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T4523 Dynamic infrastructure
Migrated	gitlab-migration	T4144 Elastic worker infrastructure
Migrated	gitlab-migration	T4278 Elastic worker cluster failures to unstuck

Event Timeline

ardumont triaged this task as High priority.May 25 2022, 2:11 PM

ardumont created this task.

ardumont changed the task status from Open to Work in Progress.May 25 2022, 5:20 PM

ardumont moved this task from Backlog to in-progress on the System administration board.

Unfortunately, after several tries, we were unable to restart the cluster due to a problem with the etcd leader election / data on the nodes (probably wrong manipulation from us).
We finally destroyed the cluster (we had to follow [1] because the cluster was in an unstable state and rancher refused to remove it)

Once the cluster was removed, it was recreated with terraform. The nodes were manually added with the docker command provided by rancher.
The fourth node was started with only a worker configuration. The terraform configuration will be updated accordingly

For the record, terraform doesn't like the cluster creation without nodes because the applications (monitoring / keda) can't be added until the cluster is in an active state.
We will probably have to move this initial configuration outside terraform later.

[1] https://github.com/rancher/rancher/issues/34650#issuecomment-956555419

The cluster is up and running.

Regarding the cpu consumption on the nodes, it seems it related to the cluster management load.
It seems it's confirmed by [1].
Some interesting way to dig to reduce the cpu consumption on small clusters: [2]

I tried on the test cluster on our infra for gitlab, it reduced by ~10%the cpu consumption but I'm not sure it's worth it as it can impact the cluster stability.
Perhaps we should try to have 3 small nodes for the cluster management only, and bigger nodes for the workers

[1] https://github.com/kubernetes/kubernetes/issues/75565#issuecomment-476407045
[2] https://github.com/kubernetes/minikube/issues/3207#issuecomment-618123466

FI: an odd number of nodes is recommended for an etcd cluster

https://etcd.io/docs/v3.4/faq/#why-an-odd-number-of-cluster-members

An interesting lead that could possibly explain what happened on the cluster : https://etcd.io/docs/v3.4/faq/#should-i-add-a-member-before-removing-an-unhealthy-member

the etcd documentation will deserve to be read during the next incident ;)

vsellier added a revision: D7925: Recreate the staging-worker cluster.May 31 2022, 3:37 PM

vsellier added a commit: rSPREd7d5ff6b4b64: Recreate the staging-worker cluster.May 31 2022, 3:40 PM

Awesome! Thanks.

I've started back the loader git on that cluster:

for TYPE in git bzr cvs maven pypi npm svn; do \
  REL=workers-$TYPE; \
  NS=ns-loaders-$TYPE; \
  kubectl create namespace $NS; \
  kubectl apply -f loaders-metadata-fetcher.secret.yaml \
    --namespace $NS; \
  kubectl apply -f amqp-access-credentials.secret.yaml \
    --namespace $NS; \
  kubectl apply -f ./loaders-$TYPE-sentry.secret.yaml \
    --namespace $NS; \
done
namespace/ns-loaders-git created
secret/metadata-fetcher-credentials created
secret/amqp-access-credentials created
secret/loaders-git-sentry-secrets created
namespace/ns-loaders-bzr created
secret/metadata-fetcher-credentials created
secret/amqp-access-credentials created
secret/loaders-bzr-sentry-secrets created
namespace/ns-loaders-cvs created
secret/metadata-fetcher-credentials created
secret/amqp-access-credentials created
secret/loaders-cvs-sentry-secrets created
namespace/ns-loaders-maven created
secret/metadata-fetcher-credentials created
secret/amqp-access-credentials created
secret/loaders-maven-sentry-secrets created
namespace/ns-loaders-pypi created
secret/metadata-fetcher-credentials created
secret/amqp-access-credentials created
secret/loaders-pypi-sentry-secrets created
namespace/ns-loaders-npm created
secret/metadata-fetcher-credentials created
secret/amqp-access-credentials created
secret/loaders-npm-sentry-secrets created
namespace/ns-loaders-svn created
secret/metadata-fetcher-credentials created
secret/amqp-access-credentials created
secret/loaders-svn-sentry-secrets created

# Then install the loader
$ for TYPE in git; do REL=workers-$TYPE; helm install -f ./instances/loaders-$TYPE.staging.values.yaml $REL ./; done
NAME: workers-git
LAST DEPLOYED: Thu Jun  2 14:13:07 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

This task has been migrated to GitLab.

Elastic worker cluster failures to unstuckClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Elastic worker cluster failures to unstuck
Closed, MigratedEdits Locked
Actions

Related Objects
Search...