Page MenuHomeSoftware Heritage

upgrade all machines to Debian Stretch
Closed, MigratedEdits Locked

Description

catch-all tasks to track the progress of upgrading Software Heritage machines to Debian Stretch

machines:

  • banco
  • beaubourg
  • getty
  • grand-palais
  • louvre
  • moma
  • pergamon
  • petit-palais
  • prado
  • saatchi
  • somerset
  • tate
  • uffizi
  • worker{01-16}
  • worker{01-08}.euwest.azure

Event Timeline

zack changed the task status from Open to Work in Progress.Sep 14 2017, 10:03 PM
zack created this task.
zack renamed this task from upgrade all machines to debian stretch to upgrade all machines to Debian Stretch.Oct 9 2017, 11:37 AM
olasd added a subscriber: olasd.

tate still has PHP 5 installed as that's what our current version of mediawiki supports (pending T434).

In T761#14583, @olasd wrote:

tate still has PHP 5 installed as that's what our current version of mediawiki supports (pending T434).

Fixed now.

Somerset upgraded, and migrated to postgresql 10 (via pg_upgrade). However pglogical fails to resume replication, more investigation is needed.

Only the workers are left to be upgraded.

Multipath on the MD3260 was flapping on louvre while beaubourg wasn't upgraded, with the following messages:

[ 4393.957050] device-mapper: multipath: Reinstating path 8:224.
[ 4393.957556] device-mapper: multipath: Reinstating path 8:240.
[ 4393.970855] sd 1:0:1:1: rdac: array SoftwareHeritage1, ctlr 0, queueing MODE_SELECT command
[ 4393.970997] sd 1:0:1:1: rdac: array SoftwareHeritage1, ctlr 0, MODE_SELECT returned with sense 05/24/00
[ 4393.970999] device-mapper: multipath: Failing path 8:224.
[ 4393.971567] sd 1:0:1:2: rdac: array SoftwareHeritage1, ctlr 0, queueing MODE_SELECT command
[ 4393.971700] sd 1:0:1:2: rdac: array SoftwareHeritage1, ctlr 0, MODE_SELECT returned with sense 05/24/00
[ 4393.971703] device-mapper: multipath: Failing path 8:240.

The flapping subsided when beaubourg was upgraded. I'm guessing there's some interaction issues when the array is connected to servers with different kernel versions.

When both servers were upgraded, DRBD wouldn't recognize the disks (drbd-overview would show the replica pair as Diskless/Diskless, and lvm wouldn't come back up). Trying to bring the disks back up with drbdadm attach r0 would fail with a no meta data found error. After poking around for a while, I rebooted beaubourg (which was Primary before its upgrade) on the old kernel, which let it come back up as Primary for the drbd pair.

After the reboot of beaubourg on the old kernel, I did a drbdadm create-md r0 on louvre, to try to re-create the replica. It turns out that /that/ recognized the metadata on disk:

You want me to create a v08 style flexible-size internal meta data block.                                        
There appears to be a v09 flexible-size internal meta data block                
already in place on /dev/vg-louvre/drbd at byte offset 300647706624                                          
                                                                                            
Valid v09 meta-data found, convert to v08?                     
[need to type 'yes' to confirm] yes   
                                                                
md_offset 300647706624                                         
al_offset 300647673856            
bm_offset 300638498816                                     

Found LVM2 physical volume signature
   293588992 kB data area apparently used
   293592284 kB left usable by current configuration

Even though it looks like this would place the new meta data into
unused space, you still need to confirm, as this is only a guess.

Do you want to proceed?
[need to type 'yes' to confirm] yes

Writing meta data...
New drbd meta data block successfully created.

After a drbdadm attach r0 on louvre, the sync completed and louvre was a proper secondary again.

I then stopped all services on beaubourg, then manually set the drbd device to secondary (ensuring no further changes, and that louvre's copy was UpToDate). I rebooted under the new kernel, and the drbd device came back up as Primary... with a Diskless status ???

I did the drbdadm create-md r0 dance on beaubourg as well, and after drbdadm attach r0 the replica pair is back to healthy status.

Considering those issues, as well as the stability issues we had when using drbd in Primary/Primary mode in the past, I'll migrate the remaining machines away from this drbd volume and get rid of it once and for all. We'll be able to move VM storage to Ceph (which has better integration with Proxmox) soon enough.

Following the woes with drbd, I have now removed all traces of it from our hypervisors.

I've now upgraded PostgreSQL to version 10 everywhere. Replication needs to be reset.

I've started doing the work to reinstall workers under stretch. worker01 is back up and running, but there are two issues:

  • nfs mounts fail on boot
  • the new "persistent" interface naming is weird (interfaces come up as ens18/19), and confuses our configuration management

For info: it looks like the fix for T755, which is now deployed on pergamon, requires the version of monitoring-plugins-basic that is on Debian Stretch; previous versions do not have the --only-critical flag.
So, until this is fixed, pending package upgrades on all workers have status "unknown". (Which isn't a big deal.)

After some thorough massaging, I've finished updated our preseeding configuration for Debian Stretch, and re-created the 16 local workers.

worker08.euwest.azure has been migrated (scratched and recreated back).

Wiki documentation about it has been updated.

The only gotcha i hit was the /etc/facter/facts.d/location.txt file that we need to install ourselves for puppet to be satisfied.

ardumont updated the task description. (Show Details)