We currently use a temporary directory on the main filesystem of our workers when the memory overflows.
This means that we're replicating temporary content three times using ceph, introducing latency, churn, and notable SSD endurance use for these.
We should migrate to local storage that would not be replicated, and cleared at each stop/start of the virtual machines used as workers.
To do so, we need to:
- remove one SSD from each hypervisor's ceph pool to use as scratch space
- reset the extracted SSD as local storage that can be used by proxmox
-
write a hook script for proxmox to create and add the storage when the VM starts, and to remove and drop the storage when the VM stops -
update the machine's boot process to create partitions on these new temporary disks, and mount them as /tmp
The hook script idea failed, so we're now just creating local storage on each worker for a swap partition, and we mount a tmpfs of the size of that partition on /tmp