⚓ T2691 Test and select a software router

		Status	Assigned	Task
		Migrated	gitlab-migration	T2650 Network refactoring - step 1
		Migrated	gitlab-migration	T2691 Test and select a software router

Event Timeline

vsellier changed the task status from Open to Work in Progress.Oct 13 2020, 9:52 AM

vsellier triaged this task as Normal priority.

vsellier created this task.

PFSense and OPNsense were tested.

Environment :
3 vms on the same hypervisor to be able to simulate the vlans locally as they are not completely configured yet:

Router :
- 1 cpu
- 2 cores
- 2Go of memory
- 32Go of disk
- 3 Network interfaces
  - vtnet0: WAN on VLAN400 ip: 192.168.100.198 (louvre as gateway)
  - vtnet1: OPT1 on VLAN442 ip: 192.168.50.1
  - vtnet2: OPT2 on VLAN443 ip: 192.168.129.1

1 server in the admin network
- 1 cpu
- 2 core
- 4Go
- 32Go of disk
- 1 network interface
  - eth0: VLAN442 ip: 192.168.50.10

1 server in the staging network
- 1 cpu
- 2 core
- 4Go
- 32Go of disk
- 1 network interface
  - eth0: VLAN442 ip: 192.168.129.198

PFsense and OPNSense have the same code base and almost the same functionalities.
OPNSense is a fork of pfsense done in 2015 and is development is more active with regular releases. It has some interesting plugins compared to PFSense especially a better interface, an API, configuration in git, ...

	PFSense	OPNSense
Documentation	https://docs.netgate.com/pfsense/	https://docs.opnsense.org/
Hardware documentation	https://docs.netgate.com/pfsense/en/latest/hardware/index.html	https://docs.opnsense.org/manual/hardware.html
Puppet support	No	No
Terraform support	No	In development (early stage)

As expected as that's what they're are made for, the network configuration by itself is quite easy [7] on both solution, with few network rules, the staging network can be isolated from other networks, the admin network can access to the staging servers, and the internet's gateway is working for all the servers (+ avoiding the production network to be reached).
The diagnosis is easy with the ability to configure the accept or deny logs per rules.
The OPNsense interface looks better and offer more options to diagnose the network configuration, specially with the inspect view of the interfaces:

Regarding the performances, a naive file transfert between the admin network and the staging network gives these result (there no significant differences between pfsense and opnsense) :

On staging:

root@test-staging2:~# nc -l -p 90 | pv > /dev/null
63.1GiB 0:02:03 [ 521MiB/s] [<=>                            ]

On admin:

root@test-admin:~# cat /dev/zero | pv | nc 192.168.129.198 90
62.2GiB 0:01:59 [ 551MiB/s] [<=>                            ]

I haven't pushed further with a real test with a tool like iperf3 to avoid overloading the hypervisor (branly).

In conclusion, the basic functionalities look ok despite there is no way to automatize the configuration. OPNsense win the match due to its better interface, a better release plan and some interesting plugins not available for pfsense (configuration in git)

After sharing this with my fellow colleagues, some other tests to execute emerged:

Testing if having a interface on the VLAN1300 is working as the hypervisor should be well configuration
Testing the HA possibilities [1]
Testing configuration traceability [2]
VPN [4]
- Test ipsec vpn / azure compatibility
- Test OpenVPN and certificate management
Test the monitoring capabilities / prometheus integration (via an snmp exporter[5] or netflow (there is a lot of resources on internet relative to prometheus / grafana integration[6]))

Some interesting links :

pfsenses best practices [3]

[1]: https://docs.opnsense.org/manual/hacarp.html
[2]: https://docs.opnsense.org/manual/git-backup.html
[3]: https://docs.netgate.com/pfsense/en/latest/firewall/best-practices.html?highlight=configuration
[4]: https://docs.opnsense.org/manual/vpnet.html#integrated-vpn-options
[5]: https://github.com/prometheus/snmp_exporter
[6]: https://brooks.sh/2019/11/17/network-flow-analysis-with-prometheus/
[7]: 2020-10-15 edit : Some precisions on the installation:

the initial interfaces setup in the console : https://docs.opnsense.org/manual/install.html#initial-configuration
the firewall rules configuration : https://docs.opnsense.org/manual/firewall.html

Having the WAN gateway declared on the VLAN1330 is working well.
Changing the default gateway to 128.93.166.62 force to declare an additional route for the vpn connections (192.168.101.0/24 => gw 192.168.100.1).

root@router:~ # host myip.opendns.com resolver1.opendns.com
Using domain server:
Name: resolver1.opendns.com
Address: 208.67.222.222#53
Aliases: 

myip.opendns.com has address 128.93.166.2
Host myip.opendns.com not found: 3(NXDOMAIN)

root@router:~ # curl -v https://1.1.1.1/ > /dev/null
...
< HTTP/2 200 
...

There is only an issue with the ping.

root@router:~ # ping -c 3 1.1.1.1
PING 1.1.1.1 (1.1.1.1): 56 data bytes

--- 1.1.1.1 ping statistics ---
3 packets transmitted, 0 packets received, 100.0% packet loss

The packets are not filtered, the request is well sent, and the response received but ping is not detecting it :

00:00:00.376348 rule 80/0(match): pass out on vtnet3: (tos 0x0, ttl 64, id 14533, offset 0, flags [none], proto ICMP (1), length 84)
   128.93.166.2 > 8.8.8.8: ICMP echo request, id 15907, seq 0, length 64
00:00:00.002222 rule 80/0(match): pass out on vtnet3: (tos 0x0, ttl 116, id 0, offset 0, flags [none], proto ICMP (1), length 84)
   8.8.8.8 > 128.93.166.2: ICMP echo reply, id 15907, seq 0, length 64

root@router:~ # tcpdump -n -vvv -i vtnet3 host 1.1.1.1
tcpdump: listening on vtnet3, link-type EN10MB (Ethernet), capture size 262144 bytes
14:10:20.709961 IP (tos 0x0, ttl 64, id 25281, offset 0, flags [none], proto ICMP (1), length 84)
    128.93.166.2 > 1.1.1.1: ICMP echo request, id 11217, seq 0, length 64
14:10:20.713574 IP (tos 0x0, ttl 58, id 426, offset 0, flags [none], proto ICMP (1), length 84)
    1.1.1.1 > 128.93.166.2: ICMP echo reply, id 11217, seq 0, length 64
14:10:20.713597 IP (tos 0x0, ttl 57, id 426, offset 0, flags [none], proto ICMP (1), length 84)
    1.1.1.1 > 128.93.166.2: ICMP echo reply, id 11217, seq 0, length 64
14:10:21.767902 IP (tos 0x0, ttl 64, id 51343, offset 0, flags [none], proto ICMP (1), length 84)
    128.93.166.2 > 1.1.1.1: ICMP echo request, id 11217, seq 1, length 64
14:10:21.771453 IP (tos 0x0, ttl 58, id 45932, offset 0, flags [none], proto ICMP (1), length 84)
    1.1.1.1 > 128.93.166.2: ICMP echo reply, id 11217, seq 1, length 64
14:10:21.771475 IP (tos 0x0, ttl 57, id 45932, offset 0, flags [none], proto ICMP (1), length 84)
    1.1.1.1 > 128.93.166.2: ICMP echo reply, id 11217, seq 1, length 64
14:10:22.781612 IP (tos 0x0, ttl 64, id 20041, offset 0, flags [none], proto ICMP (1), length 84)
    128.93.166.2 > 1.1.1.1: ICMP echo request, id 11217, seq 2, length 64
14:10:22.785218 IP (tos 0x0, ttl 58, id 44859, offset 0, flags [none], proto ICMP (1), length 84)
    1.1.1.1 > 128.93.166.2: ICMP echo reply, id 11217, seq 2, length 64
14:10:22.785233 IP (tos 0x0, ttl 57, id 44859, offset 0, flags [none], proto ICMP (1), length 84)
    1.1.1.1 > 128.93.166.2: ICMP echo reply, id 11217, seq 2, length 64

I'm trying to figure why...

Well, I let this problem aside for the moment as there is nothing special configured for the interface on the VLAN1300 and I have no idea of what can be the source of the problem. Perhaps the "illumination" will come later...

I will focus now on the HA.

The HA was quite simple to configure with the documentation [1] and an additional blog post which helps with the nat section not very explicit in the official documentation [2]
It's recommended to have a dedicated network link between the 2 firewalls used to the synchronization. In the tests I have done, I configured the sync on the admin network (VLAN442). It works but it's not the optimal configuration.

Several changes were made to have the HA working :

configure a new opnsense server (clone the previous and update its interfaces)
Configure a VIP (Virtual ip) per network, they will be the new gateway address
Update the servers to use the vip as a gateway
Configure the configuration synchronization between the firewalls

This is a recap of the ips used :

type	Admin net (VLAN442)	Prod net (VLAN440)	Staging net (VLEN443)	Public net (VLAN1300)
opnsense 1 (previously configured)	192.168.50.1	192.168.100.198	192.168.129.200	128.93.166.2
VIPs	192.168.50.2	192.168.100.197	192.168.129.1	128.93.166.3
opnsense 2 (new fw)	192.168.50.3	192.168.100.196	192.168.129.3	128.93.166.4

Configure the new firewall

Clone the previous opnsense in proxmox
Disconnect its interfaces via the proxmox interface
Start the fw and configure the new ip addresses via the console
Connect the interfaces

VIPs configuration

In Interfaces / Virtual IPs / Settings :

Create one vip per network on both server as explained in the documentation (type carp / a different vhid per ip / same password / ...)
For example for the vip of the admin network :

The final result looks like :

The status of the VIP should be different on each firewall :

on the active :

on the passive :

If any, the previous existing firewall rules using the firewall ip have to be updated to use the vip addresses
As explained in the documentation, the nat rules have to be updated to translate the real interface ip to the vip for the outgoing traffic. As in our configuration there is 2 gateways (128.93.66.62 forVLAN1300 and 192.168.100.1 for VLAN 440), 2 rules must be created per network :

Update the gw of the servers

edit the /etc/network/interfaces on each search as usual.

For example for the server in the staging network :

root@test-staging2:~# cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

# The loopback network interface
auto lo
iface lo inet loopback

# The normal eth0
auto eth0
iface eth0 inet static
  address 192.168.129.198
  netmask 24
  gateway 192.168.129.1

Configure the firewall synchronization

on each firewall go on System / High Availability / Settings
Configure the firewall to sync with its neighbor (adapt the sync ip)

Edit 2020-10-15
The HA / status page uses an xmlrpc call to the other firewall to retrieve the sync status.
The user used in the configuration must be declared on the firewall and added in the admin group. Once done, the page is correctly displayed and the sync can be be manually triggered :

Tests

Scheduled maintenance

Start a ping between 2 servers in the staging and the admin networks
Set the master in maintenance mode via the interface (Interfaces / Virtual IPs / Status => Enter Persistence CARP maintenance mode)
Check the VIPs status => The owner of the vip has changed OK
Check the ping :

root@test-admin:~# ping -D 192.168.129.198
PING 192.168.129.198 (192.168.129.198) 56(84) bytes of data.
[1602659845.546186] 64 bytes from 192.168.129.198: icmp_seq=1 ttl=63 time=0.414 ms
..
[1602659849.550817] 64 bytes from 192.168.129.198: icmp_seq=5 ttl=63 time=0.418 ms
[1602659850.552411] 64 bytes from 192.168.129.198: icmp_seq=6 ttl=63 time=0.601 ms  <=== VIPs switched here 
[1602659851.553593] 64 bytes from 192.168.129.198: icmp_seq=7 ttl=63 time=0.360 ms
...
[1602659858.561841] 64 bytes from 192.168.129.198: icmp_seq=14 ttl=63 time=0.428 ms
^C
--- 192.168.129.198 ping statistics ---
14 packets transmitted, 14 received, 0% packet loss, time 29ms   <==== No lost packets
rtt min/avg/max/mdev = 0.352/0.420/0.601/0.065 ms

There is nothing visible on the client side. the connection states are well replicated between the firewalls and no packet were lost.

Simulate a crash of the master

Launch a ping between 2 servers in the admin and staging networks
Perform a reset via the proxmox interface of the firewall currently master
Check the status of the VIPs => the owner has changed OK
Check the status of the ping

root@test-admin:~# ping -D 192.168.129.198
PING 192.168.129.198 (192.168.129.198) 56(84) bytes of data.
[1602660031.083762] 64 bytes from 192.168.129.198: icmp_seq=1 ttl=63 time=0.398 ms
[1602660032.084749] 64 bytes from 192.168.129.198: icmp_seq=2 ttl=63 time=0.334 ms
[1602660033.103285] 64 bytes from 192.168.129.198: icmp_seq=3 ttl=63 time=0.360 ms
[1602660036.143286] 64 bytes from 192.168.129.198: icmp_seq=6 ttl=63 time=0.376 ms  <=== 4s between this response and the previous
[1602660037.144289] 64 bytes from 192.168.129.198: icmp_seq=7 ttl=63 time=0.313 ms
...
^C
--- 192.168.129.198 ping statistics ---
10 packets transmitted, 8 received, 20% packet loss, time 73ms   <==== few packets were lost
rtt min/avg/max/mdev = 0.313/0.344/0.398/0.036 ms

There is 4s of downtime but the traffic automatically restarts after the failure was detected

Synchronization traffic

This is an example of a synchronization volume through the admin network during a small iperf3 benchmark

root@test-staging2:~# iperf3 -p 90 -c  192.168.129.3
Connecting to host 192.168.129.3, port 90
[  5] local 192.168.129.198 port 37338 connected to 192.168.129.3 port 90
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   358 MBytes  3.00 Gbits/sec    0   1.60 MBytes       
[  5]   1.00-2.00   sec   391 MBytes  3.28 Gbits/sec    1   1.39 MBytes       
[  5]   2.00-3.00   sec   392 MBytes  3.29 Gbits/sec   32   1.14 MBytes       
[  5]   3.00-4.00   sec   380 MBytes  3.19 Gbits/sec    0   1.37 MBytes       
[  5]   4.00-5.00   sec   389 MBytes  3.26 Gbits/sec   25   1.11 MBytes       
[  5]   5.00-6.00   sec   391 MBytes  3.28 Gbits/sec    0   1.35 MBytes       
[  5]   6.00-7.00   sec   385 MBytes  3.23 Gbits/sec    0   1.54 MBytes       
[  5]   7.00-8.00   sec   392 MBytes  3.29 Gbits/sec    5   1.21 MBytes       
[  5]   8.00-9.00   sec   388 MBytes  3.25 Gbits/sec    0   1.43 MBytes       
[  5]   9.00-10.00  sec   388 MBytes  3.25 Gbits/sec    0   1.61 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  3.76 GBytes  3.23 Gbits/sec   63             sender
[  5]   0.00-10.00  sec  3.76 GBytes  3.23 Gbits/sec                  receiver

iperf Done.

[1]: https://docs.opnsense.org/manual/how-tos/carp.html
[2]: https://www.thomas-krenn.com/en/wiki/OPNsense_HA_Cluster_configuration

vsellier updated the task description. (Show Details)Oct 14 2020, 11:44 AM

I was not able to test the git backup plugin as it seems it's not yet released and it doesn't appear on the installable plugin list.
The commit for the version 1.0 was done 6 days ago : https://github.com/opnsense/plugins/commit/87c4c96fe1d1dc881f72f91ee67b6a84c9dea42a
I have also tested with the development version of pfsense but it also does not appear.

There are 3 other solutions to backup the configuration :

download the config file
store the config in a google drive
store the config in a nextcloud instance

There is also a page on the gui to list the history of the changes. By default, the last 60 changes are kept :

vsellier updated the task description. (Show Details)Oct 14 2020, 3:00 PM

IPSec / Azure configuration

This is the current configuration for the SWH/Azure ipsec VPN :

conn louvre-to-azure-west-europe
	closeaction=restart
	dpdaction=restart
	ike=aes256-sha1-modp1024
	esp=aes256-sha1
	reauth=no
	keyexchange=ikev2
	mobike=no
	ikelifetime=28800s
	keylife=3600s
	keyingtries=%forever
	authby=secret
	left=128..xxx.xxx.xxx
	leftsubnet=192.168.100.0/23
	leftid=128.xxx.xxx.xxx
	right=13..xxx.xxx.xxx
	rightid=13..xxx.xxx.xxx
	rightsubnet=192.168.200.0/21
	auto=start

These specs are supported by OPNSense. There is also a dedicated section on the documentation on how to setup an ipsec vpn for azure [1] so IMO we can consider it's something supported.

[1]: https://docs.opnsense.org/manual/how-tos/ipsec-s2s-route-azure.html

vsellier updated the task description. (Show Details)Oct 14 2020, 3:40 PM

OpenVPN

The open vpn configuration support a certificat authority and csr stuff currently manually managed on louvre.

The vpn configuration by itself is a classical von configuration, routing and firewall routes [1]

[1]: https://docs.opnsense.org/manual/how-tos/sslvpn_client.html

Monitoring

A prometheus exporter is available as an additional plugin.

After its installation internal metrics are available:

root@router:~ # curl http://192.168.50.1:9100/metrics > /tmp/metrics.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 22846    0 22846    0     0  10.8M      0 --:--:-- --:--:-- --:--:-- 21.7M

metrics.txt22 KBDownload

vsellier updated the task description. (Show Details)Oct 14 2020, 5:20 PM

@olasd / @ardumont , IMO the tests seem to confirm OPNsense can be a viable solution or at worst, it deserves to be tested with the staging infrastructure.

The scope it will covered on the infra could be progressively extended if the poc with staging is concluant

fix the wrong status change embedded with the previous comment

ardumont updated the task description. (Show Details)Oct 15 2020, 9:43 AM

The test phase is achieved. OPNSense seems to have a consensus with no blocking points.
Let's start the real implementation now.

This task has been migrated to GitLab.

Test and select a software router
Closed, MigratedEdits Locked
Actions

Description

Related Objects
Search...

Event Timeline

Configure the new firewall

VIPs configuration

Update the gw of the servers

Configure the firewall synchronization

Tests

Scheduled maintenance

Simulate a crash of the master

Synchronization traffic

IPSec / Azure configuration

OpenVPN

Monitoring

Test and select a software routerClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

Configure the new firewall

VIPs configuration

Update the gw of the servers

Configure the firewall synchronization

Tests

Scheduled maintenance

Simulate a crash of the master

Synchronization traffic

IPSec / Azure configuration

OpenVPN

Monitoring

Test and select a software router
Closed, MigratedEdits Locked
Actions

Related Objects
Search...