Page MenuHomeSoftware Heritage

Test and select a software router
Closed, MigratedEdits Locked

Description

Currently, the network management of the gateways is done manually with some iptables rules and custom route management.
Having a software router can help to centralize the rules and the network configuration like the VPNs and simplify the configuration.
As PFSense is a well-known solution on the network management community, the test will initially target it to check if it can match our needs.

list of tasks copied from the first comment

  • partially done (ping issue) Testing if having a interface on the VLAN1300 is working as the hypervisor should be well configuration
  • Testing the HA possibilities [1]
  • Testing configuration traceability [2] The plugin is not yet available on the current version
  • VPN [4]
    • Test ipsec vpn / azure compatibility
    • Test OpenVPN and certificate management
  • Test the monitoring capabilities / prometheus integration (via an snmp exporter[5] or netflow (there is a lot of resources on internet relative to prometheus / grafana integration[6]))

Event Timeline

vsellier changed the task status from Open to Work in Progress.Oct 13 2020, 9:52 AM
vsellier triaged this task as Normal priority.
vsellier created this task.

PFSense and OPNsense were tested.

Environment :
3 vms on the same hypervisor to be able to simulate the vlans locally as they are not completely configured yet:

  • Router :
    • 1 cpu
    • 2 cores
    • 2Go of memory
    • 32Go of disk
    • 3 Network interfaces
      • vtnet0: WAN on VLAN400 ip: 192.168.100.198 (louvre as gateway)
      • vtnet1: OPT1 on VLAN442 ip: 192.168.50.1
      • vtnet2: OPT2 on VLAN443 ip: 192.168.129.1
  • 1 server in the admin network
    • 1 cpu
    • 2 core
    • 4Go
    • 32Go of disk
    • 1 network interface
      • eth0: VLAN442 ip: 192.168.50.10
  • 1 server in the staging network
    • 1 cpu
    • 2 core
    • 4Go
    • 32Go of disk
    • 1 network interface
      • eth0: VLAN442 ip: 192.168.129.198

PFsense and OPNSense have the same code base and almost the same functionalities.
OPNSense is a fork of pfsense done in 2015 and is development is more active with regular releases. It has some interesting plugins compared to PFSense especially a better interface, an API, configuration in git, ...

PFSenseOPNSense
Documentationhttps://docs.netgate.com/pfsense/https://docs.opnsense.org/
Hardware documentationhttps://docs.netgate.com/pfsense/en/latest/hardware/index.htmlhttps://docs.opnsense.org/manual/hardware.html
Puppet supportNoNo
Terraform supportNoIn development (early stage)

As expected as that's what they're are made for, the network configuration by itself is quite easy [7] on both solution, with few network rules, the staging network can be isolated from other networks, the admin network can access to the staging servers, and the internet's gateway is working for all the servers (+ avoiding the production network to be reached).
The diagnosis is easy with the ability to configure the accept or deny logs per rules.
The OPNsense interface looks better and offer more options to diagnose the network configuration, specially with the inspect view of the interfaces:

Regarding the performances, a naive file transfert between the admin network and the staging network gives these result (there no significant differences between pfsense and opnsense) :

On staging:

root@test-staging2:~# nc -l -p 90 | pv > /dev/null
63.1GiB 0:02:03 [ 521MiB/s] [<=>                            ]

On admin:

root@test-admin:~# cat /dev/zero | pv | nc 192.168.129.198 90
62.2GiB 0:01:59 [ 551MiB/s] [<=>                            ]

I haven't pushed further with a real test with a tool like iperf3 to avoid overloading the hypervisor (branly).

In conclusion, the basic functionalities look ok despite there is no way to automatize the configuration. OPNsense win the match due to its better interface, a better release plan and some interesting plugins not available for pfsense (configuration in git)

After sharing this with my fellow colleagues, some other tests to execute emerged:

  • Testing if having a interface on the VLAN1300 is working as the hypervisor should be well configuration
  • Testing the HA possibilities [1]
  • Testing configuration traceability [2]
  • VPN [4]
    • Test ipsec vpn / azure compatibility
    • Test OpenVPN and certificate management
  • Test the monitoring capabilities / prometheus integration (via an snmp exporter[5] or netflow (there is a lot of resources on internet relative to prometheus / grafana integration[6]))

Some interesting links :

  • pfsenses best practices [3]

[1]: https://docs.opnsense.org/manual/hacarp.html
[2]: https://docs.opnsense.org/manual/git-backup.html
[3]: https://docs.netgate.com/pfsense/en/latest/firewall/best-practices.html?highlight=configuration
[4]: https://docs.opnsense.org/manual/vpnet.html#integrated-vpn-options
[5]: https://github.com/prometheus/snmp_exporter
[6]: https://brooks.sh/2019/11/17/network-flow-analysis-with-prometheus/
[7]: 2020-10-15 edit : Some precisions on the installation:

Having the WAN gateway declared on the VLAN1330 is working well.
Changing the default gateway to 128.93.166.62 force to declare an additional route for the vpn connections (192.168.101.0/24 => gw 192.168.100.1).

root@router:~ # host myip.opendns.com resolver1.opendns.com
Using domain server:
Name: resolver1.opendns.com
Address: 208.67.222.222#53
Aliases: 

myip.opendns.com has address 128.93.166.2
Host myip.opendns.com not found: 3(NXDOMAIN)
root@router:~ # curl -v https://1.1.1.1/ > /dev/null
...
< HTTP/2 200 
...

There is only an issue with the ping.

root@router:~ # ping -c 3 1.1.1.1
PING 1.1.1.1 (1.1.1.1): 56 data bytes

--- 1.1.1.1 ping statistics ---
3 packets transmitted, 0 packets received, 100.0% packet loss

The packets are not filtered, the request is well sent, and the response received but ping is not detecting it :

00:00:00.376348 rule 80/0(match): pass out on vtnet3: (tos 0x0, ttl 64, id 14533, offset 0, flags [none], proto ICMP (1), length 84)
   128.93.166.2 > 8.8.8.8: ICMP echo request, id 15907, seq 0, length 64
00:00:00.002222 rule 80/0(match): pass out on vtnet3: (tos 0x0, ttl 116, id 0, offset 0, flags [none], proto ICMP (1), length 84)
   8.8.8.8 > 128.93.166.2: ICMP echo reply, id 15907, seq 0, length 64
root@router:~ # tcpdump -n -vvv -i vtnet3 host 1.1.1.1
tcpdump: listening on vtnet3, link-type EN10MB (Ethernet), capture size 262144 bytes
14:10:20.709961 IP (tos 0x0, ttl 64, id 25281, offset 0, flags [none], proto ICMP (1), length 84)
    128.93.166.2 > 1.1.1.1: ICMP echo request, id 11217, seq 0, length 64
14:10:20.713574 IP (tos 0x0, ttl 58, id 426, offset 0, flags [none], proto ICMP (1), length 84)
    1.1.1.1 > 128.93.166.2: ICMP echo reply, id 11217, seq 0, length 64
14:10:20.713597 IP (tos 0x0, ttl 57, id 426, offset 0, flags [none], proto ICMP (1), length 84)
    1.1.1.1 > 128.93.166.2: ICMP echo reply, id 11217, seq 0, length 64
14:10:21.767902 IP (tos 0x0, ttl 64, id 51343, offset 0, flags [none], proto ICMP (1), length 84)
    128.93.166.2 > 1.1.1.1: ICMP echo request, id 11217, seq 1, length 64
14:10:21.771453 IP (tos 0x0, ttl 58, id 45932, offset 0, flags [none], proto ICMP (1), length 84)
    1.1.1.1 > 128.93.166.2: ICMP echo reply, id 11217, seq 1, length 64
14:10:21.771475 IP (tos 0x0, ttl 57, id 45932, offset 0, flags [none], proto ICMP (1), length 84)
    1.1.1.1 > 128.93.166.2: ICMP echo reply, id 11217, seq 1, length 64
14:10:22.781612 IP (tos 0x0, ttl 64, id 20041, offset 0, flags [none], proto ICMP (1), length 84)
    128.93.166.2 > 1.1.1.1: ICMP echo request, id 11217, seq 2, length 64
14:10:22.785218 IP (tos 0x0, ttl 58, id 44859, offset 0, flags [none], proto ICMP (1), length 84)
    1.1.1.1 > 128.93.166.2: ICMP echo reply, id 11217, seq 2, length 64
14:10:22.785233 IP (tos 0x0, ttl 57, id 44859, offset 0, flags [none], proto ICMP (1), length 84)
    1.1.1.1 > 128.93.166.2: ICMP echo reply, id 11217, seq 2, length 64

I'm trying to figure why...

Well, I let this problem aside for the moment as there is nothing special configured for the interface on the VLAN1300 and I have no idea of what can be the source of the problem. Perhaps the "illumination" will come later...

I will focus now on the HA.

The HA was quite simple to configure with the documentation [1] and an additional blog post which helps with the nat section not very explicit in the official documentation [2]
It's recommended to have a dedicated network link between the 2 firewalls used to the synchronization. In the tests I have done, I configured the sync on the admin network (VLAN442). It works but it's not the optimal configuration.

Several changes were made to have the HA working :

  • configure a new opnsense server (clone the previous and update its interfaces)
  • Configure a VIP (Virtual ip) per network, they will be the new gateway address
  • Update the servers to use the vip as a gateway
  • Configure the configuration synchronization between the firewalls

This is a recap of the ips used :

typeAdmin net (VLAN442)Prod net (VLAN440)Staging net (VLEN443)Public net (VLAN1300)
opnsense 1 (previously configured)192.168.50.1192.168.100.198192.168.129.200128.93.166.2
VIPs192.168.50.2192.168.100.197192.168.129.1128.93.166.3
opnsense 2 (new fw)192.168.50.3192.168.100.196192.168.129.3128.93.166.4

Configure the new firewall

  • Clone the previous opnsense in proxmox
  • Disconnect its interfaces via the proxmox interface
  • Start the fw and configure the new ip addresses via the console
  • Connect the interfaces

VIPs configuration

  • In Interfaces / Virtual IPs / Settings :

Create one vip per network on both server as explained in the documentation (type carp / a different vhid per ip / same password / ...)
For example for the vip of the admin network :

The final result looks like :

The status of the VIP should be different on each firewall :

  • on the active :

  • on the passive :

  • If any, the previous existing firewall rules using the firewall ip have to be updated to use the vip addresses
  • As explained in the documentation, the nat rules have to be updated to translate the real interface ip to the vip for the outgoing traffic. As in our configuration there is 2 gateways (128.93.66.62 forVLAN1300 and 192.168.100.1 for VLAN 440), 2 rules must be created per network :

Update the gw of the servers

  • edit the /etc/network/interfaces on each search as usual.

For example for the server in the staging network :

root@test-staging2:~# cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

# The loopback network interface
auto lo
iface lo inet loopback

# The normal eth0
auto eth0
iface eth0 inet static
  address 192.168.129.198
  netmask 24
  gateway 192.168.129.1

Configure the firewall synchronization

  • on each firewall go on System / High Availability / Settings
  • Configure the firewall to sync with its neighbor (adapt the sync ip)

Edit 2020-10-15
The HA / status page uses an xmlrpc call to the other firewall to retrieve the sync status.
The user used in the configuration must be declared on the firewall and added in the admin group. Once done, the page is correctly displayed and the sync can be be manually triggered :

Tests

Scheduled maintenance

  • Start a ping between 2 servers in the staging and the admin networks
  • Set the master in maintenance mode via the interface (Interfaces / Virtual IPs / Status => Enter Persistence CARP maintenance mode)
  • Check the VIPs status => The owner of the vip has changed OK
  • Check the ping :
root@test-admin:~# ping -D 192.168.129.198
PING 192.168.129.198 (192.168.129.198) 56(84) bytes of data.
[1602659845.546186] 64 bytes from 192.168.129.198: icmp_seq=1 ttl=63 time=0.414 ms
..
[1602659849.550817] 64 bytes from 192.168.129.198: icmp_seq=5 ttl=63 time=0.418 ms
[1602659850.552411] 64 bytes from 192.168.129.198: icmp_seq=6 ttl=63 time=0.601 ms  <=== VIPs switched here 
[1602659851.553593] 64 bytes from 192.168.129.198: icmp_seq=7 ttl=63 time=0.360 ms
...
[1602659858.561841] 64 bytes from 192.168.129.198: icmp_seq=14 ttl=63 time=0.428 ms
^C
--- 192.168.129.198 ping statistics ---
14 packets transmitted, 14 received, 0% packet loss, time 29ms   <==== No lost packets
rtt min/avg/max/mdev = 0.352/0.420/0.601/0.065 ms

There is nothing visible on the client side. the connection states are well replicated between the firewalls and no packet were lost.

Simulate a crash of the master

  • Launch a ping between 2 servers in the admin and staging networks
  • Perform a reset via the proxmox interface of the firewall currently master
  • Check the status of the VIPs => the owner has changed OK
  • Check the status of the ping
root@test-admin:~# ping -D 192.168.129.198
PING 192.168.129.198 (192.168.129.198) 56(84) bytes of data.
[1602660031.083762] 64 bytes from 192.168.129.198: icmp_seq=1 ttl=63 time=0.398 ms
[1602660032.084749] 64 bytes from 192.168.129.198: icmp_seq=2 ttl=63 time=0.334 ms
[1602660033.103285] 64 bytes from 192.168.129.198: icmp_seq=3 ttl=63 time=0.360 ms
[1602660036.143286] 64 bytes from 192.168.129.198: icmp_seq=6 ttl=63 time=0.376 ms  <=== 4s between this response and the previous
[1602660037.144289] 64 bytes from 192.168.129.198: icmp_seq=7 ttl=63 time=0.313 ms
...
^C
--- 192.168.129.198 ping statistics ---
10 packets transmitted, 8 received, 20% packet loss, time 73ms   <==== few packets were lost
rtt min/avg/max/mdev = 0.313/0.344/0.398/0.036 ms

There is 4s of downtime but the traffic automatically restarts after the failure was detected

Synchronization traffic

This is an example of a synchronization volume through the admin network during a small iperf3 benchmark

root@test-staging2:~# iperf3 -p 90 -c  192.168.129.3
Connecting to host 192.168.129.3, port 90
[  5] local 192.168.129.198 port 37338 connected to 192.168.129.3 port 90
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   358 MBytes  3.00 Gbits/sec    0   1.60 MBytes       
[  5]   1.00-2.00   sec   391 MBytes  3.28 Gbits/sec    1   1.39 MBytes       
[  5]   2.00-3.00   sec   392 MBytes  3.29 Gbits/sec   32   1.14 MBytes       
[  5]   3.00-4.00   sec   380 MBytes  3.19 Gbits/sec    0   1.37 MBytes       
[  5]   4.00-5.00   sec   389 MBytes  3.26 Gbits/sec   25   1.11 MBytes       
[  5]   5.00-6.00   sec   391 MBytes  3.28 Gbits/sec    0   1.35 MBytes       
[  5]   6.00-7.00   sec   385 MBytes  3.23 Gbits/sec    0   1.54 MBytes       
[  5]   7.00-8.00   sec   392 MBytes  3.29 Gbits/sec    5   1.21 MBytes       
[  5]   8.00-9.00   sec   388 MBytes  3.25 Gbits/sec    0   1.43 MBytes       
[  5]   9.00-10.00  sec   388 MBytes  3.25 Gbits/sec    0   1.61 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  3.76 GBytes  3.23 Gbits/sec   63             sender
[  5]   0.00-10.00  sec  3.76 GBytes  3.23 Gbits/sec                  receiver

iperf Done.

[1]: https://docs.opnsense.org/manual/how-tos/carp.html
[2]: https://www.thomas-krenn.com/en/wiki/OPNsense_HA_Cluster_configuration

I was not able to test the git backup plugin as it seems it's not yet released and it doesn't appear on the installable plugin list.
The commit for the version 1.0 was done 6 days ago : https://github.com/opnsense/plugins/commit/87c4c96fe1d1dc881f72f91ee67b6a84c9dea42a
I have also tested with the development version of pfsense but it also does not appear.

There are 3 other solutions to backup the configuration :

  • download the config file
  • store the config in a google drive
  • store the config in a nextcloud instance

There is also a page on the gui to list the history of the changes. By default, the last 60 changes are kept :

IPSec / Azure configuration

This is the current configuration for the SWH/Azure ipsec VPN :

conn louvre-to-azure-west-europe
	closeaction=restart
	dpdaction=restart
	ike=aes256-sha1-modp1024
	esp=aes256-sha1
	reauth=no
	keyexchange=ikev2
	mobike=no
	ikelifetime=28800s
	keylife=3600s
	keyingtries=%forever
	authby=secret
	left=128..xxx.xxx.xxx
	leftsubnet=192.168.100.0/23
	leftid=128.xxx.xxx.xxx
	right=13..xxx.xxx.xxx
	rightid=13..xxx.xxx.xxx
	rightsubnet=192.168.200.0/21
	auto=start

These specs are supported by OPNSense. There is also a dedicated section on the documentation on how to setup an ipsec vpn for azure [1] so IMO we can consider it's something supported.

[1]: https://docs.opnsense.org/manual/how-tos/ipsec-s2s-route-azure.html

OpenVPN

The open vpn configuration support a certificat authority and csr stuff currently manually managed on louvre.

The vpn configuration by itself is a classical von configuration, routing and firewall routes [1]

[1]: https://docs.opnsense.org/manual/how-tos/sslvpn_client.html

Monitoring

A prometheus exporter is available as an additional plugin.

After its installation internal metrics are available:

root@router:~ # curl http://192.168.50.1:9100/metrics > /tmp/metrics.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 22846    0 22846    0     0  10.8M      0 --:--:-- --:--:-- --:--:-- 21.7M

vsellier added subscribers: ardumont, olasd.

@olasd / @ardumont , IMO the tests seem to confirm OPNsense can be a viable solution or at worst, it deserves to be tested with the staging infrastructure.

The scope it will covered on the infra could be progressively extended if the poc with staging is concluant

vsellier reopened this task as Work in Progress.Oct 14 2020, 5:41 PM

fix the wrong status change embedded with the previous comment

The test phase is achieved. OPNSense seems to have a consensus with no blocking points.
Let's start the real implementation now.