Page MenuHomeSoftware Heritage

borg issues on multiple nodes
Closed, MigratedEdits Locked

Description

Issue received by email.

borg@banco.internal.softwareheritage.org:/srv/borg/repositories/storage1.internal.staging.swh.network: Error running actions for repository
Command 'borg prune --keep-hourly 24 --keep-daily 7 --keep-weekly 4 --keep-monthly 6 --prefix storage1.internal.staging.swh.network- borg@banco.internal.softwareheritage.org:/srv/borg/repositories/storage1.internal.staging.swh.network' returned non-zero exit status 2.
/etc/borgmatic/config.yaml: Error running configuration file

summary:
/etc/borgmatic/config.yaml: Error running configuration file
borg@banco.internal.softwareheritage.org:/srv/borg/repositories/storage1.internal.staging.swh.network: Error running actions for repository
Failed to create/acquire the lock /srv/borg/repositories/storage1.internal.staging.swh.network/lock.exclusive (timeout).
Command 'borg prune --keep-hourly 24 --keep-daily 7 --keep-weekly 4 --keep-monthly 6 --prefix storage1.internal.staging.swh.network- borg@banco.internal.softwareheritage.org:/srv/borg/repositories/storage1.internal.staging.swh.network' returned non-zero exit status 2.

Event Timeline

ardumont triaged this task as Normal priority.Feb 16 2022, 12:37 PM
ardumont created this task.
ardumont renamed this task from borg issue on storage1.staging to borg issues on multiple nodes.Feb 17 2022, 9:10 AM

It's probably linked to the new replication and sync deployed on storage1 (staging).
A new (large) partition not yet excluded in the manifest which ends up being also backup-ed by banco.
And then we run out of disk on banco.

@vsellier worked on it to make some space.

20:25 <+swhbot> icinga PROBLEM: service disk /data/sync on storage1.internal.staging.swh.network is CRITICAL: DISK CRITICAL - /data/sync is not accessible: No such file or directory
20:25 <+swhbot> icinga PROBLEM: service disk /data/sync/db1 on storage1.internal.staging.swh.network is CRITICAL: DISK CRITICAL - /data/sync/db1 is not accessible: No such file or directory
20:25 <+swhbot> icinga PROBLEM: service disk /data/sync/db1/postgresql-main-12 on storage1.internal.staging.swh.network is CRITICAL: DISK CRITICAL - /data/sync/db1/postgresql-main-12 is not accessible: No such file or directory
20:42 <vsellier> ^ made some cleanup to fix the backup issues
20:43 <vsellier> *did
21:01 <+swhbot> icinga PROBLEM: service disk /srv/borg on banco.softwareheritage.org is WARNING: DISK WARNING - free space: /srv/borg 399933 MB (11% inode=99%);

Yes, my bad, it's due to T3911.

Creating a zfs dataset without an explicit mountpoint ended to an automatic mount on /data/<dataset>

I removed the automatic mounts and I will ensure it will not the same when the replication db1 -> storage will be deployed.

root@storage1:~# zfs set canmount=noauto data/sync
root@storage1:~# zfs set canmount=noauto data/sync/db1
root@storage1:~# zfs set canmount=noauto data/sync/db1/postgresql-main-12
root@storage1:~# zfs set mountpoint=none data/sync
root@storage1:~# zfs set mountpoint=none data/sync/db1
root@storage1:~# zfs set mountpoint=none data/sync/db1/postgresql-main-12

And cleaned the backups since the beginning of the replication on banco (the 2022-02-14):

root@banco:/srv/borg/repositories# sudo -u borg borg list storage1.internal.staging.swh.network | grep 2022-02-1[456] | cut -f1 -d" " | xargs -t -r sudo -u borg borg delete storage1.internal.staging.swh.network
Enter passphrase for key /srv/borg/repositories/storage1.internal.staging.swh.network: 
Enter passphrase for key /srv/borg/repositories/storage1.internal.staging.swh.network:

The passphrase can be found in the private data repo or in the /etc/borgmatic/config.yaml file of storage1

ardumont claimed this task.

Thanks!