The cluster somehow is unresponsive.
Try to analyse and unstuck if possible.
Description
Revisions and Commits
rSPRE sysadm-provisioning | |||
D7925 | rSPREd7d5ff6b4b64 Recreate the staging-worker cluster |
Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T4523 Dynamic infrastructure | ||
Migrated | gitlab-migration | T4144 Elastic worker infrastructure | ||
Migrated | gitlab-migration | T4278 Elastic worker cluster failures to unstuck |
Event Timeline
Unfortunately, after several tries, we were unable to restart the cluster due to a problem with the etcd leader election / data on the nodes (probably wrong manipulation from us).
We finally destroyed the cluster (we had to follow [1] because the cluster was in an unstable state and rancher refused to remove it)
Once the cluster was removed, it was recreated with terraform. The nodes were manually added with the docker command provided by rancher.
The fourth node was started with only a worker configuration. The terraform configuration will be updated accordingly
For the record, terraform doesn't like the cluster creation without nodes because the applications (monitoring / keda) can't be added until the cluster is in an active state.
We will probably have to move this initial configuration outside terraform later.
[1] https://github.com/rancher/rancher/issues/34650#issuecomment-956555419
The cluster is up and running.
Regarding the cpu consumption on the nodes, it seems it related to the cluster management load.
It seems it's confirmed by [1].
Some interesting way to dig to reduce the cpu consumption on small clusters: [2]
I tried on the test cluster on our infra for gitlab, it reduced by ~10%the cpu consumption but I'm not sure it's worth it as it can impact the cluster stability.
Perhaps we should try to have 3 small nodes for the cluster management only, and bigger nodes for the workers
[1] https://github.com/kubernetes/kubernetes/issues/75565#issuecomment-476407045
[2] https://github.com/kubernetes/minikube/issues/3207#issuecomment-618123466
FI: an odd number of nodes is recommended for an etcd cluster
https://etcd.io/docs/v3.4/faq/#why-an-odd-number-of-cluster-members
An interesting lead that could possibly explain what happened on the cluster : https://etcd.io/docs/v3.4/faq/#should-i-add-a-member-before-removing-an-unhealthy-member
the etcd documentation will deserve to be read during the next incident ;)
I've started back the loader git on that cluster:
for TYPE in git bzr cvs maven pypi npm svn; do \ REL=workers-$TYPE; \ NS=ns-loaders-$TYPE; \ kubectl create namespace $NS; \ kubectl apply -f loaders-metadata-fetcher.secret.yaml \ --namespace $NS; \ kubectl apply -f amqp-access-credentials.secret.yaml \ --namespace $NS; \ kubectl apply -f ./loaders-$TYPE-sentry.secret.yaml \ --namespace $NS; \ done namespace/ns-loaders-git created secret/metadata-fetcher-credentials created secret/amqp-access-credentials created secret/loaders-git-sentry-secrets created namespace/ns-loaders-bzr created secret/metadata-fetcher-credentials created secret/amqp-access-credentials created secret/loaders-bzr-sentry-secrets created namespace/ns-loaders-cvs created secret/metadata-fetcher-credentials created secret/amqp-access-credentials created secret/loaders-cvs-sentry-secrets created namespace/ns-loaders-maven created secret/metadata-fetcher-credentials created secret/amqp-access-credentials created secret/loaders-maven-sentry-secrets created namespace/ns-loaders-pypi created secret/metadata-fetcher-credentials created secret/amqp-access-credentials created secret/loaders-pypi-sentry-secrets created namespace/ns-loaders-npm created secret/metadata-fetcher-credentials created secret/amqp-access-credentials created secret/loaders-npm-sentry-secrets created namespace/ns-loaders-svn created secret/metadata-fetcher-credentials created secret/amqp-access-credentials created secret/loaders-svn-sentry-secrets created # Then install the loader $ for TYPE in git; do REL=workers-$TYPE; helm install -f ./instances/loaders-$TYPE.staging.values.yaml $REL ./; done NAME: workers-git LAST DEPLOYED: Thu Jun 2 14:13:07 2022 NAMESPACE: default STATUS: deployed REVISION: 1 TEST SUITE: None