Dar backups fill up disk space on client machines
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	ftigeot
	Aug 9 2018, 9:35 AM

Description

I had to fix a disk full issue on tate:/ this morning.

It was caused by the presence of dar files in /srv/backups/

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T1282 Revisit backups
Migrated	gitlab-migration	T1164 Dar backups fill up disk space on client machines
Migrated	gitlab-migration	T1165 Fix lack of disk space on louvre:/

Event Timeline

ftigeot triaged this task as High priority.Aug 9 2018, 9:35 AM

ftigeot created this task.

Our backups are stored on a remote filesystem provided by SESI (filer-backup nfs mount on louvre).

The way our setup for dar works currently is:

A cronjob runs a backup locally, on each host, at a random minute between midnight and 01:00 UTC
- this backup is stored in a dar file in /srv/backups
- when the backup is done, a flag file is created
On louvre, one cronjob per host runs, every 10 minutes between midnight and 04:00 UTC; this cronjob:
- checks whether the backup has completed (by looking for the flag file)
- if the backup is completed, it copies it locally to the remote filer
- once the copy is done, the backup is removed

I think the issue is that if the remote copy fails (for instance if DNS resolution is fubar), the old backups will accumulate on each host.

A quick fix for this issue would be to adapt the local backup script to remove old backups before starting again (and warn by mail that a backup wasn't cleaned up, which means the copy failed somehow).

ftigeot changed the status of subtask T1165: Fix lack of disk space on louvre:/ from Open to Work in Progress.Aug 23 2018, 2:18 PM

ftigeot closed subtask T1165: Fix lack of disk space on louvre:/ as Resolved.Aug 24 2018, 2:37 PM

zack added a project: System administration.Aug 25 2018, 4:25 PM

ftigeot added a parent task: T1282: Revisit backups.Oct 22 2018, 2:21 PM

dar backups have now been replaced with a setup around borg-backup and borgmatic, which only needs a small cache on the machines that are being backed up. borg is fast enough and its dedup is efficient enough that we're able to backups every hour now.

The admin documentation in https://intranet.softwareheritage.org/wiki/Backups has been updated to reflect the new setup.

All the dar setup, crontabs, etc. have been cleaned up from all hosts.

This task has been migrated to GitLab.

gitlab-migration changed the status of subtask T1165: Fix lack of disk space on louvre:/ from Resolved to Migrated.Oct 19 2022, 5:54 PM

Dar backups fill up disk space on client machinesClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

Dar backups fill up disk space on client machines
Closed, MigratedEdits Locked
Actions

Related Objects
Search...