Page MenuHomeSoftware Heritage

Use local disks for worker scratch space
Closed, MigratedEdits Locked

Description

We currently use a temporary directory on the main filesystem of our workers when the memory overflows.

This means that we're replicating temporary content three times using ceph, introducing latency, churn, and notable SSD endurance use for these.

We should migrate to local storage that would not be replicated, and cleared at each stop/start of the virtual machines used as workers.

To do so, we need to:

  • remove one SSD from each hypervisor's ceph pool to use as scratch space
  • reset the extracted SSD as local storage that can be used by proxmox
  • write a hook script for proxmox to create and add the storage when the VM starts, and to remove and drop the storage when the VM stops
  • update the machine's boot process to create partitions on these new temporary disks, and mount them as /tmp

The hook script idea failed, so we're now just creating local storage on each worker for a swap partition, and we mount a tmpfs of the size of that partition on /tmp

Event Timeline

olasd triaged this task as Normal priority.Nov 5 2021, 2:20 PM
olasd created this task.

I've kicked a SSD out of branly and hypervisor3's ceph allocation; ceph is currently rebalancing.

On the three hypervisors currently running workers (hypervisor3, branly, pompidou), I've created a thin LVM pool for scratch data, using the following commands:

# Check which disk is free
lsblk

# Create new volume group
vgcreate scratch /dev/sdf

# Create new logical volume
lvcreate -L 80%FREE -n data scratch

# Set it as a thin pool
lvconvert --type thin-pool scratch/data

I've registered the thin pool as scratch storage using the proxmox UI.

ardumont changed the task status from Open to Work in Progress.Nov 10 2021, 3:36 PM
ardumont moved this task from Backlog to in-progress on the System administration board.

So, it turns out that proxmox unconditionally locks the VM configuration file while it's starting, so the hook can't attach new storage.

We'll go the "hardcoded" route, which will make migrating worker VMs more annoying (as they'll have to transfer useless data from a temporary disk), but that shouldn't be too much of an issue, as we pretty much never migrate worker VMs.

Today, we've implemented the following plan:

  • add a virtual disk, on the local system, for a swap partition
  • add a tmpfs, for the size of the swap partition, as /tmp on the worker

After a week of trial on worker2.staging, this has been done on worker[09-12] in production, and will be generalised at the beginning of next week.

All worker nodes (in staging and production, in Rocquencourt) are now using a tmpfs, backed by a partition allocated on a hypervisor-local SSD disk.