Page MenuHomeSoftware Heritage

Migrate sentry to admin vlan
Closed, MigratedEdits Locked

Description

Impacts after migration:

  • [1] still reachable as before
  • the machine shall be reached at riverside.internal.admin.swh.network (ssh).

Note:
Node exposing sentry service: riverside.internal.softwareheritage.org [2].

[1] https://sentry.softwareheritage.org

[2] https://inventory.internal.admin.swh.network/virtualization/virtual-machines/12/

Step-by-step plan:

  • Gandi: Reduce sentry.s.o CNAME ttl early (days before migration starts, e.g. ~300s)
  • Inventory:
    • Reserve new ip in vlan 442
    • Deprecate the ip from vlan 440
  • D7045: Puppet manifest adaptations for moving the node to the admin vlan [4]
  • Firewall: Open rule to allow access from pergamon to riverside:9000
  • On {pergamon, riverside, rp1} [5]
    • Stop puppet agent
  • On pergamon
    • Deploy new puppet manifest change (last time we forgot ¯\_(ツ)_/¯)
  • On riverside:
    • Update the ip to the new vlan442 ip (192.168.50.70)
      • Connect through ssh and adapt /etc/network/interfaces with new ip
      • Modify directly through the proxmox ui (not terraform-ed yet)
      • Adapt hardware entry about network (proxmox ui) to change from vmbr0 to vmbr442
    • Update the hostname to riverside.i.a.s.n
    • Remove the puppet certificates rm -rf /var/lib/puppet/ssl (agent node)
    • Update facts deployment and subnets /etc/facter/facts.d/deployment.txt to admin [6]
    • Reboot machine (poweroff, start)
    • Run puppet with puppet agent --test --fqdn riverside.internal.admin.swh.network
    • Install necessary facts for cloud-init to stop tampering with /etc/hosts
  • On pergamon:
    • Run puppet agent
    • Decommission riverside.i.s.o certificate
  • On rp1:
    • Run puppet agent
  • Gandi: Change sentry.s.o CNAME value from pergamon to swh-rproxy3.inria.fr. (to target the admin reverse proxy)
  • Inventory:
    • Change the reserved ip status to active
    • Update sentry node with its new ip [1]
  • Clean up no longer necessary sentry reverse proxy on pergamon
  • Gandi: Bump the sentry.s.o CNAME TTL to "standard" value of 1800 (like the others)
  • Terraform:
    • Reference riverside node in sysadm terraform admin manifest [3] node is diverging too much, the risk/benefit seems off so we don't do it.

[3] https://forge.softwareheritage.org/source/swh-sysadmin-provisioning/browse/master/proxmox/terraform/admin/admin.tf

[4] Check the diff description/code for more details

[5]

$ clush -b -w pergamon -w riverside -w rp1.internal.admin.swh.network "puppet agent --disable T3891"

[6]

root@riverside:~# cat /etc/facter/facts.d/deployment.txt
deployment=admin
root@riverside:~# cat /etc/facter/facts.d/subnet.txt
subnet=sesi_rocquencourt_admin

Event Timeline

ardumont triaged this task as Normal priority.Jan 26 2022, 2:42 PM
ardumont created this task.
ardumont updated the task description. (Show Details)
ardumont changed the task status from Open to Work in Progress.Jan 28 2022, 3:34 PM
ardumont moved this task from Backlog to in-progress on the System administration board.

Looks like at least some parts of staging don't have access to sentry anymore; on storage1, for instance, https connections to sentry.softwareheritage.org just hang.

Looks like at least some parts of staging don't have access to sentry anymore; on storage1, for instance, https connections to sentry.softwareheritage.org just hang.

It's been fixed [1].

22:22 <+olasd> I see some of these on staging storage1: Feb 02 21:21:50 storage1 python3[3935968]: 2022-02-02 21:21:50 [3935968] urllib3.connectionpool:WARNING Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f06505ea978>: Failed to establish a new connection: [Errno 110]
22:22 <+olasd> Connection timed out')': /api/5/store/
22:23 <+olasd> looks like bits of staging doesn't have access to sentry anymore? curl https://sentry.softwareheritage.org just hangs
01:04 <+vsellier> looks like it was larger than this. any node without an ip in the VLAN 1300 was not able to reach the admin vlan
01:05 <+vsellier> I fixed it by adding all the interfaces on the NAT/port forward rules
01:05 <+vsellier> (https://forge.softwareheritage.org/R228:25dd204fe564636321487a3bdc8db21c2dd8b695)
01:08 <+vsellier> s/admin vlan/admin reverse proxy
ardumont claimed this task.
ardumont moved this task from deployed/landed/monitoring to done on the System administration board.