MetalLB[1] is load balancer implementation allowing to use LoadBalancer object in bare metal kubernetes deployments.
It could allow us to expose the services without deploying and manage a new load balancing stack to ensure the HA
Description
Revisions and Commits
rSPSITE puppet-swh-site | |||
D8577 | rSPSITEbee3d6dc026c Disable ping on hosts/ips managed by metallb |
Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T4523 Dynamic infrastructure | ||
Migrated | gitlab-migration | T4534 Evaluate MetalLB as inbound loadbalancer |
Event Timeline
With the ingress controller correctly configured and an ingress declared, everything seems to work correctly:
vsellier@pergamon ~ % cat test-ingress.txt GET /graphql/ HTTP/1.0 Host: archive.softwareheritage.org vsellier@pergamon ~ % cat test-ingress.txt| nc 192.168.100.119 80 | head -n 20 HTTP/1.1 200 OK Server: nginx/1.23.1 Date: Tue, 27 Sep 2022 15:36:35 GMT Content-Type: text/html; charset=utf-8 Content-Length: 1527 Connection: close <!DOCTYPE html> <html> <head> <meta charset=utf-8/> <meta name="viewport" content="user-scalable=no, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, minimal-ui"> <title>GraphQL Playground</title> <link rel="stylesheet" href="//cdn.jsdelivr.net/npm/graphql-playground-react/build/static/css/index.css" /> <link rel="shortcut icon" href="//cdn.jsdelivr.net/npm/graphql-playground-react/build/favicon.png" /> <script src="//cdn.jsdelivr.net/npm/graphql-playground-react/build/static/js/middleware.js"></script> </head> <body>
The first recovery test with a failing node was not very conclusive:
vsellier@pergamon ~ % (while true; do date; sleep 2; done) & [1] 1305858 vsellier@pergamon ~ % sudo arping 192.168.100.119Tue Sep 27 15:40:44 UTC 2022 Tue Sep 27 15:41:26 UTC 2022 60 bytes from 2e:81:20:19:02:4a (192.168.100.119): index=42 time=604.637 usec 60 bytes from 2e:81:20:19:02:4a (192.168.100.119): index=43 time=744.078 usec Tue Sep 27 15:41:28 UTC 2022 60 bytes from 2e:81:20:19:02:4a (192.168.100.119): index=44 time=1.562 msec <--- rancher-node-production-worker03 stopped abruptly via the proxmox ui Tue Sep 27 15:41:30 UTC 2022 Timeout Timeout ... Tue Sep 27 15:47:55 UTC 2022 Timeout Timeout 60 bytes from 2e:84:a0:44:9e:c9 (192.168.100.119): index=45 time=408.477 msec <--- balanced to rancher-node-production-worker02 Tue Sep 27 15:47:57 UTC 2022 60 bytes from 2e:84:a0:44:9e:c9 (192.168.100.119): index=46 time=645.523 usec 60 bytes from 2e:84:a0:44:9e:c9 (192.168.100.119): index=47 time=507.705 msec 60 bytes from 2e:84:a0:44:9e:c9 (192.168.100.119): index=48 time=2.451 msec
The ip was not rebalanced until worker03 was restarted
metallb logs:
metallb/metallb-speaker-jh92n[speaker]: {"caller":"speakerlist.go:256","error":"1 error occurred:\n\t* Failed to join 192.168.100.123:7946: dial tcp 192.168.100.123:7946: connect: no route to host\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-09-27T15:44:14Z"} metallb/metallb-speaker-8tkq7[speaker]: {"caller":"speakerlist.go:256","error":"1 error occurred:\n\t* Failed to join 192.168.100.123:7946: dial tcp 192.168.100.123:7946: connect: no route to host\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-09-27T15:44:14Z"} metallb/metallb-speaker-llxgq[speaker]: {"caller":"config_controller.go:48","controller":"ConfigReconciler","level":"info","start reconcile":"/rancher-node-production-mgmt1","ts":"2022-09-27T15:44:44Z"} metallb/metallb-speaker-llxgq[speaker]: {"caller":"config_controller.go:136","controller":"ConfigReconciler","event":"force service reload","level":"info","ts":"2022-09-27T15:44:44Z"} metallb/metallb-speaker-llxgq[speaker]: {"caller":"config_controller.go:147","controller":"ConfigReconciler","event":"config reloaded","level":"info","ts":"2022-09-27T15:44:44Z"} metallb/metallb-speaker-llxgq[speaker]: {"caller":"config_controller.go:148","controller":"ConfigReconciler","end reconcile":"/rancher-node-production-mgmt1","level":"info","ts":"2022-09-27T15:44:44Z"} metallb/metallb-speaker-llxgq[speaker]: {"caller":"service_controller_reload.go:61","controller":"ServiceReconciler - reprocessAll","level":"info","start reconcile":"metallbreload/reload","ts":"2022-09-27T15:44:44Z"} metallb/metallb-speaker-llxgq[speaker]: {"caller":"service_controller_reload.go:103","controller":"ServiceReconciler - reprocessAll","end reconcile":"metallbreload/reload","level":"info","ts":"2022-09-27T15:44:44Z"} metallb/metallb-speaker-jh92n[speaker]: {"caller":"config_controller.go:48","controller":"ConfigReconciler","level":"info","start reconcile":"/rancher-node-production-mgmt1","ts":"2022-09-27T15:44:44Z"} metallb/metallb-speaker-jh92n[speaker]: {"caller":"config_controller.go:136","controller":"ConfigReconciler","event":"force service reload","level":"info","ts":"2022-09-27T15:44:44Z"} metallb/metallb-speaker-8tkq7[speaker]: {"caller":"config_controller.go:48","controller":"ConfigReconciler","level":"info","start reconcile":"/rancher-node-production-mgmt1","ts":"2022-09-27T15:44:44Z"} metallb/metallb-speaker-8tkq7[speaker]: {"caller":"config_controller.go:136","controller":"ConfigReconciler","event":"force service reload","level":"info","ts":"2022-09-27T15:44:44Z"} metallb/metallb-speaker-8tkq7[speaker]: {"caller":"config_controller.go:147","controller":"ConfigReconciler","event":"config reloaded","level":"info","ts":"2022-09-27T15:44:44Z"} metallb/metallb-speaker-8tkq7[speaker]: {"caller":"config_controller.go:148","controller":"ConfigReconciler","end reconcile":"/rancher-node-production-mgmt1","level":"info","ts":"2022-09-27T15:44:44Z"} metallb/metallb-speaker-8tkq7[speaker]: {"caller":"service_controller_reload.go:61","controller":"ServiceReconciler - reprocessAll","level":"info","start reconcile":"metallbreload/reload","ts":"2022-09-27T15:44:44Z"} metallb/metallb-speaker-jh92n[speaker]: {"caller":"config_controller.go:147","controller":"ConfigReconciler","event":"config reloaded","level":"info","ts":"2022-09-27T15:44:44Z"} metallb/metallb-speaker-8tkq7[speaker]: {"caller":"service_controller_reload.go:103","controller":"ServiceReconciler - reprocessAll","end reconcile":"metallbreload/reload","level":"info","ts":"2022-09-27T15:44:44Z"} metallb/metallb-speaker-jh92n[speaker]: {"caller":"config_controller.go:148","controller":"ConfigReconciler","end reconcile":"/rancher-node-production-mgmt1","level":"info","ts":"2022-09-27T15:44:44Z"} metallb/metallb-speaker-jh92n[speaker]: {"caller":"service_controller_reload.go:61","controller":"ServiceReconciler - reprocessAll","level":"info","start reconcile":"metallbreload/reload","ts":"2022-09-27T15:44:44Z"} metallb/metallb-speaker-jh92n[speaker]: {"caller":"service_controller_reload.go:103","controller":"ServiceReconciler - reprocessAll","end reconcile":"metallbreload/reload","level":"info","ts":"2022-09-27T15:44:44Z"} metallb/metallb-speaker-llxgq[speaker]: {"caller":"speakerlist.go:256","error":"1 error occurred:\n\t* Failed to join 192.168.100.123:7946: dial tcp 192.168.100.123:7946: connect: no route to host\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-09-27T15:44:44Z"} metallb/metallb-speaker-8tkq7[speaker]: {"caller":"speakerlist.go:256","error":"1 error occurred:\n\t* Failed to join 192.168.100.123:7946: dial tcp 192.168.100.123:7946: connect: no route to host\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-09-27T15:45:14Z"} metallb/metallb-speaker-jh92n[speaker]: {"caller":"speakerlist.go:256","error":"1 error occurred:\n\t* Failed to join 192.168.100.123:7946: dial tcp 192.168.100.123:7946: connect: no route to host\n\n","expected":1,"joined":0,"level":"error","msg":"partial join","op":"memberDiscovery","ts":"2022-09-27T15:45:14Z"} metallb/metallb-speaker-jh92n[speaker]: {"caller":"node_controller.go:42","controller":"NodeReconciler","level":"info","start reconcile":"/rancher-node-production-worker04","ts":"2022-09-27T15:45:37Z"} metallb/metallb-speaker-llxgq[speaker]: {"caller":"config_controller.go:48","controller":"ConfigReconciler","level":"info","start reconcile":"/rancher-node-production-worker04","ts":"2022-09-27T15:45:37Z"} metallb/metallb-speaker-llxgq[speaker]: {"caller":"config_controller.go:136","controller":"ConfigReconciler","event":"force service reload","level":"info","ts":"2022-09-27T15:45:37Z"} metallb/metallb-speaker-llxgq[speaker]: {"caller":"config_controller.go:147","controller":"ConfigReconciler","event":"config reloaded","level":"info","ts":"2022-09-27T15:45:37Z"} metallb/metallb-speaker-llxgq[speaker]: {"caller":"config_controller.go:148","controller":"ConfigReconciler","end reconcile":"/rancher-node-production-worker04","level":"info","ts":"2022-09-27T15:45:37Z"} metallb/metallb-speaker-llxgq[speaker]: {"caller":"service_controller_reload.go:61","controller":"ServiceReconciler - reprocessAll","level":"info","start reconcile":"metallbreload/reload","ts":"2022-09-27T15:45:37Z"} metallb/metallb-speaker-llxgq[speaker]: {"caller":"service_controller_reload.go:103","controller":"ServiceReconciler - reprocessAll","end reconcile":"metallbreload/reload","level":"info","ts":"2022-09-27T15:45:37Z"} metallb/metallb-speaker-jh92n[speaker]: {"caller":"speakerlist.go:271","level":"info","msg":"triggering discovery","op":"memberDiscovery","ts":"2022-09-27T15:45:37Z"} metallb/metallb-speaker-jh92n[speaker]: {"caller":"node_controller.go:64","controller":"NodeReconciler","end reconcile":"/rancher-node-production-worker04","level":"info","ts":"2022-09-27T15:45:37Z"} metallb/metallb-speaker-jh92n[speaker]: {"caller":"config_controller.go:136","controller":"ConfigReconciler","event":"force service reload","level":"info","ts":"2022-09-27T15:45:37Z"} metallb/metallb-speaker-jh92n[speaker]: {"caller":"config_controller.go:147","controller":"ConfigReconciler","event":"config reloaded","level":"info","ts":"2022-09-27T15:45:37Z"} metallb/metallb-speaker-jh92n[speaker]: {"caller":"config_controller.go:148","controller":"ConfigReconciler","end reconcile":"/rancher-node-production-worker04","level":"info","ts":"2022-09-27T15:45:37Z"} metallb/metallb-speaker-jh92n[speaker]: {"caller":"service_controller_reload.go:61","controller":"ServiceReconciler - reprocessAll","level":"info","start reconcile":"metallbreload/reload","ts":"2022-09-27T15:45:37Z"} metallb/metallb-speaker-jh92n[speaker]: {"caller":"service_controller_reload.go:103","controller":"ServiceReconciler - reprocessAll","end reconcile":"metallbreload/reload","level":"info","ts":"2022-09-27T15:45:37Z"} metallb/metallb-speaker-8tkq7[speaker]: {"caller":"config_controller.go:48","controller":"ConfigReconciler","level":"info","start reconcile":"/rancher-node-production-worker04","ts":"2022-09-27T15:45:37Z"} metallb/metallb-speaker-8tkq7[speaker]: {"caller":"config_controller.go:136","controller":"ConfigReconciler","event":"force service reload","level":"info","ts":"2022-09-27T15:45:37Z"} metallb/metallb-speaker-8tkq7[speaker]: {"caller":"config_controller.go:147","controller":"ConfigReconciler","event":"config reloaded","level":"info","ts":"2022-09-27T15:45:37Z"} metallb/metallb-speaker-8tkq7[speaker]: {"caller":"config_controller.go:148","controller":"ConfigReconciler","end reconcile":"/rancher-node-production-worker04","level":"info","ts":"2022-09-27T15:45:37Z"} metallb/metallb-speaker-8tkq7[speaker]: {"caller":"service_controller_reload.go:61","controller":"ServiceReconciler - reprocessAll","level":"info","start reconcile":"metallbreload/reload","ts":"2022-09-27T15:45:37Z"} metallb/metallb-speaker-8tkq7[speaker]: {"caller":"service_controller_reload.go:103","controller":"ServiceReconciler - reprocessAll","end reconcile":"metallbreload/reload","level":"info","ts":"2022-09-27T15:45:37Z"}
If a node is drained out of the cluster, the rebalancing occurs in ~10s which it's what it's announced in the documentation
Tue Sep 27 16:17:21 UTC 2022 60 bytes from 2e:84:a0:44:9e:c9 (192.168.100.119): index=1985 time=1.710 msec 60 bytes from 2e:84:a0:44:9e:c9 (192.168.100.119): index=1986 time=1.376 msec Tue Sep 27 16:17:23 UTC 2022 Timeout Tue Sep 27 16:17:25 UTC 2022 Timeout Timeout Tue Sep 27 16:17:27 UTC 2022 Timeout Timeout Tue Sep 27 16:17:29 UTC 2022 Timeout Timeout Tue Sep 27 16:17:31 UTC 2022 Timeout Timeout Tue Sep 27 16:17:33 UTC 2022 Timeout Timeout 60 bytes from 2e:81:20:19:02:4a (192.168.100.119): index=1987 time=669.150 msec Tue Sep 27 16:17:35 UTC 2022 60 bytes from 2e:81:20:19:02:4a (192.168.100.119): index=1988 time=959.452 usec
A new test with a node completely down, it seems it recover after ~5mn which looks related to some cache expiracy somewhere
Tue Sep 27 16:31:59 UTC 2022 60 bytes from 2e:81:20:19:02:4a (192.168.100.119): index=2926 time=1.679 msec Tue Sep 27 16:32:01 UTC 2022 Timeout Timeout Tue Sep 27 16:32:03 UTC 2022 ... Tue Sep 27 16:37:56 UTC 2022 Timeout Timeout Tue Sep 27 16:37:58 UTC 2022 60 bytes from 2e:84:a0:44:9e:c9 (192.168.100.119): index=2927 time=814.409 msec 60 bytes from 2e:84:a0:44:9e:c9 (192.168.100.119): index=2928 time=864.574 usec 60 bytes from 2e:84:a0:44:9e:c9 (192.168.100.119): index=2929 time=973.083 msec 60 bytes from 2e:84:a0:44:9e:c9 (192.168.100.119): index=2930 time=32.151 msec
FWIW, we didn't manage to replicate the timeout issue when manually killing and/or bringing down the network on the node currently responding to the MetalLB IP address... Every time, the failover happened within 10 seconds.
regarding the last tests, we can start using it to battle proof its usage.
I found in several documentations where it's the tool recommended to manage load balancing on on-premise kubernetes deployments, for example: https://kubernetes.github.io/ingress-nginx/deploy/baremetal/#a-pure-software-solution-metallb