Page MenuHomeSoftware Heritage

Evaluate MetalLB as inbound loadbalancer
Closed, MigratedEdits Locked

Description

MetalLB[1] is load balancer implementation allowing to use LoadBalancer object in bare metal kubernetes deployments.
It could allow us to expose the services without deploying and manage a new load balancing stack to ensure the HA

[1] https://metallb.org/

Event Timeline

vsellier triaged this task as Normal priority.Sep 14 2022, 10:57 AM
vsellier created this task.
vsellier changed the task status from Open to Work in Progress.Sep 23 2022, 7:38 PM
vsellier claimed this task.
vsellier moved this task from Backlog to in-progress on the System administration board.

With the ingress controller correctly configured and an ingress declared, everything seems to work correctly:

vsellier@pergamon ~ % cat test-ingress.txt
GET /graphql/ HTTP/1.0
Host: archive.softwareheritage.org


vsellier@pergamon ~ % cat test-ingress.txt| nc 192.168.100.119 80 | head -n 20
HTTP/1.1 200 OK
Server: nginx/1.23.1
Date: Tue, 27 Sep 2022 15:36:35 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 1527
Connection: close

<!DOCTYPE html>
<html>

<head>
  <meta charset=utf-8/>
  <meta name="viewport" content="user-scalable=no, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, minimal-ui">
  <title>GraphQL Playground</title>
  <link rel="stylesheet" href="//cdn.jsdelivr.net/npm/graphql-playground-react/build/static/css/index.css" />
  <link rel="shortcut icon" href="//cdn.jsdelivr.net/npm/graphql-playground-react/build/favicon.png" />
  <script src="//cdn.jsdelivr.net/npm/graphql-playground-react/build/static/js/middleware.js"></script>
</head>

<body>

The first recovery test with a failing node was not very conclusive:

vsellier@pergamon ~ % (while true; do date; sleep 2; done) &
[1] 1305858
vsellier@pergamon ~ % sudo arping 192.168.100.119Tue Sep 27 15:40:44 UTC 2022
Tue Sep 27 15:41:26 UTC 2022
60 bytes from 2e:81:20:19:02:4a (192.168.100.119): index=42 time=604.637 usec
60 bytes from 2e:81:20:19:02:4a (192.168.100.119): index=43 time=744.078 usec
Tue Sep 27 15:41:28 UTC 2022
60 bytes from 2e:81:20:19:02:4a (192.168.100.119): index=44 time=1.562 msec <--- rancher-node-production-worker03 stopped abruptly via the proxmox ui
Tue Sep 27 15:41:30 UTC 2022
Timeout
Timeout
...
Tue Sep 27 15:47:55 UTC 2022
Timeout
Timeout
60 bytes from 2e:84:a0:44:9e:c9 (192.168.100.119): index=45 time=408.477 msec  <--- balanced to rancher-node-production-worker02
Tue Sep 27 15:47:57 UTC 2022
60 bytes from 2e:84:a0:44:9e:c9 (192.168.100.119): index=46 time=645.523 usec
60 bytes from 2e:84:a0:44:9e:c9 (192.168.100.119): index=47 time=507.705 msec
60 bytes from 2e:84:a0:44:9e:c9 (192.168.100.119): index=48 time=2.451 msec

The ip was not rebalanced until worker03 was restarted

metallb logs:

metallb/metallb-speaker-jh92n[speaker]: {"caller":"speakerlist.go:256","error":"1 error occurred:\n\t* Failed to join 192.168.100.123:7946: dial tcp 192.168.100.123:7946: connect: no route to host\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-09-27T15:44:14Z"}
metallb/metallb-speaker-8tkq7[speaker]: {"caller":"speakerlist.go:256","error":"1 error occurred:\n\t* Failed to join 192.168.100.123:7946: dial tcp 192.168.100.123:7946: connect: no route to host\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-09-27T15:44:14Z"}
metallb/metallb-speaker-llxgq[speaker]: {"caller":"config_controller.go:48","controller":"ConfigReconciler","level":"info","start reconcile":"/rancher-node-production-mgmt1","ts":"2022-09-27T15:44:44Z"}
metallb/metallb-speaker-llxgq[speaker]: {"caller":"config_controller.go:136","controller":"ConfigReconciler","event":"force service reload","level":"info","ts":"2022-09-27T15:44:44Z"}
metallb/metallb-speaker-llxgq[speaker]: {"caller":"config_controller.go:147","controller":"ConfigReconciler","event":"config reloaded","level":"info","ts":"2022-09-27T15:44:44Z"}
metallb/metallb-speaker-llxgq[speaker]: {"caller":"config_controller.go:148","controller":"ConfigReconciler","end reconcile":"/rancher-node-production-mgmt1","level":"info","ts":"2022-09-27T15:44:44Z"}
metallb/metallb-speaker-llxgq[speaker]: {"caller":"service_controller_reload.go:61","controller":"ServiceReconciler - reprocessAll","level":"info","start reconcile":"metallbreload/reload","ts":"2022-09-27T15:44:44Z"}
metallb/metallb-speaker-llxgq[speaker]: {"caller":"service_controller_reload.go:103","controller":"ServiceReconciler - reprocessAll","end reconcile":"metallbreload/reload","level":"info","ts":"2022-09-27T15:44:44Z"}
metallb/metallb-speaker-jh92n[speaker]: {"caller":"config_controller.go:48","controller":"ConfigReconciler","level":"info","start reconcile":"/rancher-node-production-mgmt1","ts":"2022-09-27T15:44:44Z"}
metallb/metallb-speaker-jh92n[speaker]: {"caller":"config_controller.go:136","controller":"ConfigReconciler","event":"force service reload","level":"info","ts":"2022-09-27T15:44:44Z"}
metallb/metallb-speaker-8tkq7[speaker]: {"caller":"config_controller.go:48","controller":"ConfigReconciler","level":"info","start reconcile":"/rancher-node-production-mgmt1","ts":"2022-09-27T15:44:44Z"}
metallb/metallb-speaker-8tkq7[speaker]: {"caller":"config_controller.go:136","controller":"ConfigReconciler","event":"force service reload","level":"info","ts":"2022-09-27T15:44:44Z"}
metallb/metallb-speaker-8tkq7[speaker]: {"caller":"config_controller.go:147","controller":"ConfigReconciler","event":"config reloaded","level":"info","ts":"2022-09-27T15:44:44Z"}
metallb/metallb-speaker-8tkq7[speaker]: {"caller":"config_controller.go:148","controller":"ConfigReconciler","end reconcile":"/rancher-node-production-mgmt1","level":"info","ts":"2022-09-27T15:44:44Z"}
metallb/metallb-speaker-8tkq7[speaker]: {"caller":"service_controller_reload.go:61","controller":"ServiceReconciler - reprocessAll","level":"info","start reconcile":"metallbreload/reload","ts":"2022-09-27T15:44:44Z"}
metallb/metallb-speaker-jh92n[speaker]: {"caller":"config_controller.go:147","controller":"ConfigReconciler","event":"config reloaded","level":"info","ts":"2022-09-27T15:44:44Z"}
metallb/metallb-speaker-8tkq7[speaker]: {"caller":"service_controller_reload.go:103","controller":"ServiceReconciler - reprocessAll","end reconcile":"metallbreload/reload","level":"info","ts":"2022-09-27T15:44:44Z"}
metallb/metallb-speaker-jh92n[speaker]: {"caller":"config_controller.go:148","controller":"ConfigReconciler","end reconcile":"/rancher-node-production-mgmt1","level":"info","ts":"2022-09-27T15:44:44Z"}
metallb/metallb-speaker-jh92n[speaker]: {"caller":"service_controller_reload.go:61","controller":"ServiceReconciler - reprocessAll","level":"info","start reconcile":"metallbreload/reload","ts":"2022-09-27T15:44:44Z"}
metallb/metallb-speaker-jh92n[speaker]: {"caller":"service_controller_reload.go:103","controller":"ServiceReconciler - reprocessAll","end reconcile":"metallbreload/reload","level":"info","ts":"2022-09-27T15:44:44Z"}
metallb/metallb-speaker-llxgq[speaker]: {"caller":"speakerlist.go:256","error":"1 error occurred:\n\t* Failed to join 192.168.100.123:7946: dial tcp 192.168.100.123:7946: connect: no route to host\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-09-27T15:44:44Z"}
metallb/metallb-speaker-8tkq7[speaker]: {"caller":"speakerlist.go:256","error":"1 error occurred:\n\t* Failed to join 192.168.100.123:7946: dial tcp 192.168.100.123:7946: connect: no route to host\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-09-27T15:45:14Z"}
metallb/metallb-speaker-jh92n[speaker]: {"caller":"speakerlist.go:256","error":"1 error occurred:\n\t* Failed to join 192.168.100.123:7946: dial tcp 192.168.100.123:7946: connect: no route to host\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-09-27T15:45:14Z"}
metallb/metallb-speaker-jh92n[speaker]: {"caller":"node_controller.go:42","controller":"NodeReconciler","level":"info","start reconcile":"/rancher-node-production-worker04","ts":"2022-09-27T15:45:37Z"}
metallb/metallb-speaker-llxgq[speaker]: {"caller":"config_controller.go:48","controller":"ConfigReconciler","level":"info","start reconcile":"/rancher-node-production-worker04","ts":"2022-09-27T15:45:37Z"}
metallb/metallb-speaker-llxgq[speaker]: {"caller":"config_controller.go:136","controller":"ConfigReconciler","event":"force service reload","level":"info","ts":"2022-09-27T15:45:37Z"}
metallb/metallb-speaker-llxgq[speaker]: {"caller":"config_controller.go:147","controller":"ConfigReconciler","event":"config reloaded","level":"info","ts":"2022-09-27T15:45:37Z"}
metallb/metallb-speaker-llxgq[speaker]: {"caller":"config_controller.go:148","controller":"ConfigReconciler","end reconcile":"/rancher-node-production-worker04","level":"info","ts":"2022-09-27T15:45:37Z"}
metallb/metallb-speaker-llxgq[speaker]: {"caller":"service_controller_reload.go:61","controller":"ServiceReconciler - reprocessAll","level":"info","start reconcile":"metallbreload/reload","ts":"2022-09-27T15:45:37Z"}
metallb/metallb-speaker-llxgq[speaker]: {"caller":"service_controller_reload.go:103","controller":"ServiceReconciler - reprocessAll","end reconcile":"metallbreload/reload","level":"info","ts":"2022-09-27T15:45:37Z"}
metallb/metallb-speaker-jh92n[speaker]: {"caller":"speakerlist.go:271","level":"info","msg":"triggering discovery","op":"memberDiscovery","ts":"2022-09-27T15:45:37Z"}
metallb/metallb-speaker-jh92n[speaker]: {"caller":"node_controller.go:64","controller":"NodeReconciler","end reconcile":"/rancher-node-production-worker04","level":"info","ts":"2022-09-27T15:45:37Z"}
metallb/metallb-speaker-jh92n[speaker]: {"caller":"config_controller.go:136","controller":"ConfigReconciler","event":"force service reload","level":"info","ts":"2022-09-27T15:45:37Z"}
metallb/metallb-speaker-jh92n[speaker]: {"caller":"config_controller.go:147","controller":"ConfigReconciler","event":"config reloaded","level":"info","ts":"2022-09-27T15:45:37Z"}
metallb/metallb-speaker-jh92n[speaker]: {"caller":"config_controller.go:148","controller":"ConfigReconciler","end reconcile":"/rancher-node-production-worker04","level":"info","ts":"2022-09-27T15:45:37Z"}
metallb/metallb-speaker-jh92n[speaker]: {"caller":"service_controller_reload.go:61","controller":"ServiceReconciler - reprocessAll","level":"info","start reconcile":"metallbreload/reload","ts":"2022-09-27T15:45:37Z"}
metallb/metallb-speaker-jh92n[speaker]: {"caller":"service_controller_reload.go:103","controller":"ServiceReconciler - reprocessAll","end reconcile":"metallbreload/reload","level":"info","ts":"2022-09-27T15:45:37Z"}
metallb/metallb-speaker-8tkq7[speaker]: {"caller":"config_controller.go:48","controller":"ConfigReconciler","level":"info","start reconcile":"/rancher-node-production-worker04","ts":"2022-09-27T15:45:37Z"}
metallb/metallb-speaker-8tkq7[speaker]: {"caller":"config_controller.go:136","controller":"ConfigReconciler","event":"force service reload","level":"info","ts":"2022-09-27T15:45:37Z"}
metallb/metallb-speaker-8tkq7[speaker]: {"caller":"config_controller.go:147","controller":"ConfigReconciler","event":"config reloaded","level":"info","ts":"2022-09-27T15:45:37Z"}
metallb/metallb-speaker-8tkq7[speaker]: {"caller":"config_controller.go:148","controller":"ConfigReconciler","end reconcile":"/rancher-node-production-worker04","level":"info","ts":"2022-09-27T15:45:37Z"}
metallb/metallb-speaker-8tkq7[speaker]: {"caller":"service_controller_reload.go:61","controller":"ServiceReconciler - reprocessAll","level":"info","start reconcile":"metallbreload/reload","ts":"2022-09-27T15:45:37Z"}
metallb/metallb-speaker-8tkq7[speaker]: {"caller":"service_controller_reload.go:103","controller":"ServiceReconciler - reprocessAll","end reconcile":"metallbreload/reload","level":"info","ts":"2022-09-27T15:45:37Z"}

If a node is drained out of the cluster, the rebalancing occurs in ~10s which it's what it's announced in the documentation

Tue Sep 27 16:17:21 UTC 2022
60 bytes from 2e:84:a0:44:9e:c9 (192.168.100.119): index=1985 time=1.710 msec
60 bytes from 2e:84:a0:44:9e:c9 (192.168.100.119): index=1986 time=1.376 msec
Tue Sep 27 16:17:23 UTC 2022
Timeout
Tue Sep 27 16:17:25 UTC 2022
Timeout
Timeout
Tue Sep 27 16:17:27 UTC 2022
Timeout
Timeout
Tue Sep 27 16:17:29 UTC 2022
Timeout
Timeout
Tue Sep 27 16:17:31 UTC 2022
Timeout
Timeout
Tue Sep 27 16:17:33 UTC 2022
Timeout
Timeout
60 bytes from 2e:81:20:19:02:4a (192.168.100.119): index=1987 time=669.150 msec
Tue Sep 27 16:17:35 UTC 2022
60 bytes from 2e:81:20:19:02:4a (192.168.100.119): index=1988 time=959.452 usec

A new test with a node completely down, it seems it recover after ~5mn which looks related to some cache expiracy somewhere

Tue Sep 27 16:31:59 UTC 2022
60 bytes from 2e:81:20:19:02:4a (192.168.100.119): index=2926 time=1.679 msec
Tue Sep 27 16:32:01 UTC 2022
Timeout
Timeout
Tue Sep 27 16:32:03 UTC 2022
...
Tue Sep 27 16:37:56 UTC 2022
Timeout
Timeout
Tue Sep 27 16:37:58 UTC 2022
60 bytes from 2e:84:a0:44:9e:c9 (192.168.100.119): index=2927 time=814.409 msec
60 bytes from 2e:84:a0:44:9e:c9 (192.168.100.119): index=2928 time=864.574 usec
60 bytes from 2e:84:a0:44:9e:c9 (192.168.100.119): index=2929 time=973.083 msec
60 bytes from 2e:84:a0:44:9e:c9 (192.168.100.119): index=2930 time=32.151 msec

FWIW, we didn't manage to replicate the timeout issue when manually killing and/or bringing down the network on the node currently responding to the MetalLB IP address... Every time, the failover happened within 10 seconds.

vsellier moved this task from in-progress to done on the System administration board.

regarding the last tests, we can start using it to battle proof its usage.
I found in several documentations where it's the tool recommended to manage load balancing on on-premise kubernetes deployments, for example: https://kubernetes.github.io/ingress-nginx/deploy/baremetal/#a-pure-software-solution-metallb