Servers using the public swh network gateway can't reach inria's ntp servers
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vsellier
	Feb 19 2021, 6:07 PM

Description

The servers (staging/admin) behind the firewall can't reach the inria's ntp servers.
They are all in the future. For example, the scheduler in staging ignores some scheduled tasks (it was manually fixed).

Public servers can be reached without any problem.

For example:

root@worker0:~# ntpdate sesi-ntp1.inria.fr
19 Feb 16:56:50 ntpdate[659277]: no server suitable for synchronization found
root@worker0:~# ntpdate europe.pool.ntp.org
19 Feb 17:00:15 ntpdate[659400]: step time server 145.239.25.75 offset -60.343803 sec

Possible causes can be:

Filtering of the packet coming from the public swh network (128.93.166.0/26) at the dsi level, it seems the servers are not reachable from the outside
Some routing issue on the firewall (I didn't found anything after a quick investigation)

Revisions and Commits

rSPSITE puppet-swh-site
	D5126	rSPSITE6c22456c231b Add an icinga check for ntp synchronization on all linux hosts
	D5126	rSPSITE538e422ff5cf Add the public debian ntp pool to the sesi_rocquencourt subnet
	D5126	rSPSITE266a7a5798c2 Drop specific NTP configuration on subnets behind the new firewall

Event Timeline

vsellier triaged this task as High priority.Feb 19 2021, 6:07 PM

vsellier created this task.

it seems the filtering is a good culprit as from a production worker, directly plugged on the public swh vlan, the inria's ntp server can't be reach either :

vsellier@worker01 ~ % ip route
default via 128.93.166.62 dev ens18 onlink 
128.93.166.0/26 dev ens18 proto kernel scope link src 128.93.166.16 
192.168.100.0/24 dev ens19 proto kernel scope link src 192.168.100.21 
192.168.101.0/24 via 192.168.100.1 dev ens19 
192.168.200.0/21 via 192.168.100.1 dev ens19 
vsellier@worker01 ~ % sudo systemctl stop ntp        
vsellier@worker01 ~ % sudo ntpdate sesi-ntp1.inria.fr
19 Feb 17:30:54 ntpdate[1868740]: no server suitable for synchronization found
vsellier@worker01 ~ % sudo ntpdate europe.pool.ntp.org
19 Feb 17:31:42 ntpdate[1868761]: step time server 185.125.206.73 offset -0.555238 sec
vsellier@worker01 ~ % sudo systemctl start ntp

vsellier renamed this task from Servers behind the firewall can't reach the sesi ntp servers to Servers using the public swh network gateway can't reach inria's ntp servers.Feb 19 2021, 6:33 PM

In general I think we should migrate machines to the default ntp pool of our distributor ([0-3].debian.pool.ntp.org).

The main issue with that is that some machines on the sesi_rocquencourt subnet have access to that, and some machines have access to sesi-ntp[12].inria.fr.

Fortunately ntpd handles unavailability of some time servers well, so we can probably just stick both sets of ntp servers in the config, and have a working configuration.

I'm trying to test that setup (on worker01) but the kernel seems to be stuck in a bad state: when restarting ntpd with all the servers configured, the following messages are printed:

Feb 22 11:51:51 worker01 ntpd[2095713]: kernel reports TIME_ERROR: 0x2041: Clock Unsynchronized

Even forcing the sync with ntpd -gq doesn't seem to reset the "unsynchronized" state.

Let's see what happens after a reboot...

I've added a NTP dashboard to grafana from the data already collected by the prometheus node exporter.

I've tested changes to ntp config on worker01 around 9:45 UTC :

https://grafana.softwareheritage.org/d/38sNyU7iz/ntp?orgId=1&refresh=5m&from=1613969540991&to=1614012740991&var-environment=production&var-instance=worker01.internal.softwareheritage.org

Looks like things were working fine with all six servers in the ntp config, and the kernel report is a red herring.

olasd added a revision: D5126: Improve NTP handling:.Feb 22 2021, 6:17 PM

olasd added a commit: rSPSITE266a7a5798c2: Drop specific NTP configuration on subnets behind the new firewall.Feb 22 2021, 6:19 PM

olasd added a commit: rSPSITE538e422ff5cf: Add the public debian ntp pool to the sesi_rocquencourt subnet.

olasd added a commit: rSPSITE6c22456c231b: Add an icinga check for ntp synchronization on all linux hosts.

(yes, I cheated and Just Pushed the diff to get the changes to show up here...)

Looks like all servers (in icinga) have recovered their time sync, except for worker{0,1,2} in staging (on which cron seems to be off?)

After checking with @ardumont I turned cron back on on worker[0-2].staging, and ran puppet on them.

NTP is now green on all hosts: https://icinga.softwareheritage.org/monitoring/list/services?service=ntp&limit=100

This task has been migrated to GitLab.

Servers using the public swh network gateway can't reach inria's ntp serversClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Event Timeline

Servers using the public swh network gateway can't reach inria's ntp servers
Closed, MigratedEdits Locked
Actions