Page MenuHomeSoftware Heritage

Dar backups fill up disk space on client machines
Closed, MigratedEdits Locked

Description

I had to fix a disk full issue on tate:/ this morning.

It was caused by the presence of dar files in /srv/backups/

Event Timeline

ftigeot triaged this task as High priority.Aug 9 2018, 9:35 AM
ftigeot created this task.

Our backups are stored on a remote filesystem provided by SESI (filer-backup nfs mount on louvre).

The way our setup for dar works currently is:

  • A cronjob runs a backup locally, on each host, at a random minute between midnight and 01:00 UTC
    • this backup is stored in a dar file in /srv/backups
    • when the backup is done, a flag file is created
  • On louvre, one cronjob per host runs, every 10 minutes between midnight and 04:00 UTC; this cronjob:
    • checks whether the backup has completed (by looking for the flag file)
    • if the backup is completed, it copies it locally to the remote filer
    • once the copy is done, the backup is removed

I think the issue is that if the remote copy fails (for instance if DNS resolution is fubar), the old backups will accumulate on each host.

A quick fix for this issue would be to adapt the local backup script to remove old backups before starting again (and warn by mail that a backup wasn't cleaned up, which means the copy failed somehow).

olasd claimed this task.

dar backups have now been replaced with a setup around borg-backup and borgmatic, which only needs a small cache on the machines that are being backed up. borg is fast enough and its dedup is efficient enough that we're able to backups every hour now.

The admin documentation in https://intranet.softwareheritage.org/wiki/Backups has been updated to reflect the new setup.

All the dar setup, crontabs, etc. have been cleaned up from all hosts.