Page MenuHomeSoftware Heritage

Dar backups fill up disk space on client machines
Open, HighPublic

Description

I had to fix a disk full issue on tate:/ this morning.

It was caused by the presence of dar files in /srv/backups/

Event Timeline

ftigeot created this task.Aug 9 2018, 9:35 AM
ftigeot triaged this task as High priority.
olasd added a subscriber: olasd.Aug 17 2018, 5:03 PM

Our backups are stored on a remote filesystem provided by SESI (filer-backup nfs mount on louvre).

The way our setup for dar works currently is:

  • A cronjob runs a backup locally, on each host, at a random minute between midnight and 01:00 UTC
    • this backup is stored in a dar file in /srv/backups
    • when the backup is done, a flag file is created
  • On louvre, one cronjob per host runs, every 10 minutes between midnight and 04:00 UTC; this cronjob:
    • checks whether the backup has completed (by looking for the flag file)
    • if the backup is completed, it copies it locally to the remote filer
    • once the copy is done, the backup is removed

I think the issue is that if the remote copy fails (for instance if DNS resolution is fubar), the old backups will accumulate on each host.

A quick fix for this issue would be to adapt the local backup script to remove old backups before starting again (and warn by mail that a backup wasn't cleaned up, which means the copy failed somehow).

ftigeot changed the status of subtask T1165: Fix lack of disk space on louvre:/ from Open to Work in Progress.Aug 23 2018, 2:18 PM