Page MenuHomeSoftware Heritage

Servers using the public swh network gateway can't reach inria's ntp servers
Closed, MigratedEdits Locked

Description

The servers (staging/admin) behind the firewall can't reach the inria's ntp servers.
They are all in the future. For example, the scheduler in staging ignores some scheduled tasks (it was manually fixed).

Public servers can be reached without any problem.

For example:

root@worker0:~# ntpdate sesi-ntp1.inria.fr
19 Feb 16:56:50 ntpdate[659277]: no server suitable for synchronization found
root@worker0:~# ntpdate europe.pool.ntp.org
19 Feb 17:00:15 ntpdate[659400]: step time server 145.239.25.75 offset -60.343803 sec

Possible causes can be:

  • Filtering of the packet coming from the public swh network (128.93.166.0/26) at the dsi level, it seems the servers are not reachable from the outside
  • Some routing issue on the firewall (I didn't found anything after a quick investigation)

Event Timeline

vsellier created this task.

it seems the filtering is a good culprit as from a production worker, directly plugged on the public swh vlan, the inria's ntp server can't be reach either :

vsellier@worker01 ~ % ip route
default via 128.93.166.62 dev ens18 onlink 
128.93.166.0/26 dev ens18 proto kernel scope link src 128.93.166.16 
192.168.100.0/24 dev ens19 proto kernel scope link src 192.168.100.21 
192.168.101.0/24 via 192.168.100.1 dev ens19 
192.168.200.0/21 via 192.168.100.1 dev ens19 
vsellier@worker01 ~ % sudo systemctl stop ntp        
vsellier@worker01 ~ % sudo ntpdate sesi-ntp1.inria.fr
19 Feb 17:30:54 ntpdate[1868740]: no server suitable for synchronization found
vsellier@worker01 ~ % sudo ntpdate europe.pool.ntp.org
19 Feb 17:31:42 ntpdate[1868761]: step time server 185.125.206.73 offset -0.555238 sec
vsellier@worker01 ~ % sudo systemctl start ntp
vsellier renamed this task from Servers behind the firewall can't reach the sesi ntp servers to Servers using the public swh network gateway can't reach inria's ntp servers.Feb 19 2021, 6:33 PM

In general I think we should migrate machines to the default ntp pool of our distributor ([0-3].debian.pool.ntp.org).

The main issue with that is that some machines on the sesi_rocquencourt subnet have access to that, and some machines have access to sesi-ntp[12].inria.fr.

Fortunately ntpd handles unavailability of some time servers well, so we can probably just stick both sets of ntp servers in the config, and have a working configuration.

I'm trying to test that setup (on worker01) but the kernel seems to be stuck in a bad state: when restarting ntpd with all the servers configured, the following messages are printed:

Feb 22 11:51:51 worker01 ntpd[2095713]: kernel reports TIME_ERROR: 0x2041: Clock Unsynchronized

Even forcing the sync with ntpd -gq doesn't seem to reset the "unsynchronized" state.

Let's see what happens after a reboot...

I've added a NTP dashboard to grafana from the data already collected by the prometheus node exporter.

I've tested changes to ntp config on worker01 around 9:45 UTC :

https://grafana.softwareheritage.org/d/38sNyU7iz/ntp?orgId=1&refresh=5m&from=1613969540991&to=1614012740991&var-environment=production&var-instance=worker01.internal.softwareheritage.org

Looks like things were working fine with all six servers in the ntp config, and the kernel report is a red herring.

(yes, I cheated and Just Pushed the diff to get the changes to show up here...)

Looks like all servers (in icinga) have recovered their time sync, except for worker{0,1,2} in staging (on which cron seems to be off?)

olasd claimed this task.
olasd added a subscriber: ardumont.

After checking with @ardumont I turned cron back on on worker[0-2].staging, and ran puppet on them.

NTP is now green on all hosts: https://icinga.softwareheritage.org/monitoring/list/services?service=ntp&limit=100