Race condition when a zfs sync is started in the same second
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vsellier
	Feb 21 2022, 5:27 PM

Description

To transfer the data between 2 hosts, syncoid uses ssh with a control socket for connection sharing.
Due to the socket naming based on the time (with a second precision) , when 2 synchronizations start in the same second, the connection is shared by the 2 sync. The first sync who finished closes the socket and makes the other one failed

https://github.com/jimsalterjrs/sanoid/issues/532

Feb 21 15:27:43 db1 systemd[1]: Starting ZFS dataset synchronization of...
Feb 21 15:27:44 db1 syncoid[4136471]: ControlSocket /tmp/syncoid-root-root@storage1.internal.staging.swh.network-1645457263 already exists, disabling multiplexing
Feb 21 15:27:45 db1 syncoid[4136469]: Sending incremental data/objects@syncoid_db1_2022-02-21:15:21:59 ... syncoid_db1_2022-02-21:15:27:44 (~ 301.4 MB):
Feb 21 15:27:49 db1 syncoid[4136796]: lzop: Inappropriate ioctl for device: <stdin>
Feb 21 15:27:50 db1 syncoid[4136792]: cannot receive incremental stream: checksum mismatch or incomplete stream.
Feb 21 15:27:50 db1 syncoid[4136792]: Partially received snapshot is saved.
Feb 21 15:27:50 db1 syncoid[4136792]: A resuming stream can be generated on the sending system by running:
Feb 21 15:27:50 db1 syncoid[4136792]:     zfs send -t 1-10db178eb8-100-789c636064000310a501c49c50360710a715e5e7a69766a630404183d9521fcfe7ebdf2800d9ec48eaf293b252934b1818a2de0481d561c8a7a515a79630c001489e0d493ea9b224b518481f78b49f079bfe927c882b7cde7cddbe7676e42c0f24794eb07c5e626e2a0343>
Feb 21 15:27:50 db1 syncoid[4136469]: CRITICAL ERROR: ssh    -i /root/.ssh/id_ed25519.syncoid_db1 -S /tmp/syncoid-root-root@storage1.internal.staging.swh.network-1645457263 root@storage1.internal.staging.swh.network ' zfs send  -I '"'"'data/objects'"'"'@'"'"'syncoid_db1_2022-02-21:15:>
Feb 21 15:27:50 db1 systemd[1]: syncoid-storage1-objects.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 21 15:27:50 db1 systemd[1]: syncoid-storage1-objects.service: Failed with result 'exit-code'.
Feb 21 15:27:50 db1 systemd[1]: Failed to start ZFS dataset synchronization of.
Feb 21 15:27:50 db1 systemd[1]: syncoid-storage1-objects.service: Consumed 2.132s CPU time.

Feb 21 15:27:43 db1 systemd[1]: Starting ZFS dataset synchronization of...
Feb 21 15:27:45 db1 syncoid[4136468]: Sending incremental data/kafka@syncoid_db1_2022-02-21:15:21:55 ... syncoid_db1_2022-02-21:15:27:44 (~ 192.1 MB):
Feb 21 15:27:48 db1 systemd[1]: syncoid-storage1-kafka.service: Succeeded.
Feb 21 15:27:48 db1 systemd[1]: Finished ZFS dataset synchronization of.
Feb 21 15:27:48 db1 systemd[1]: syncoid-storage1-kafka.service: Consumed 2.408s CPU time.

Hopefully, syncoid is resilient to this kind of error and stabilizes itself the next run:

Feb 21 15:33:29 db1 systemd[1]: Starting ZFS dataset synchronization of...
Feb 21 15:33:30 db1 syncoid[4167164]: Resuming interrupted zfs send/receive from data/objects to data/sync/storage1/objects (~ 97.4 MB remaining):
Feb 21 15:33:38 db1 syncoid[4167164]: Sending incremental data/objects@syncoid_db1_2022-02-21:15:27:44 ... syncoid_db1_2022-02-21:15:33:36 (~ 269.3 MB):
Feb 21 15:33:51 db1 systemd[1]: syncoid-storage1-objects.service: Succeeded.
Feb 21 15:33:51 db1 systemd[1]: Finished ZFS dataset synchronization of.
Feb 21 15:33:51 db1 systemd[1]: syncoid-storage1-objects.service: Consumed 7.327s CPU time.

Revisions and Commits

rSPSITE puppet-swh-site
	D7216	rSPSITE36a13a6410dd syncoid: Try to restart the synchronization if a race condition occurred

Event Timeline

vsellier changed the task status from Open to Work in Progress.Feb 21 2022, 5:27 PM

vsellier triaged this task as Normal priority.

vsellier created this task.

vsellier moved this task from Backlog to in-progress on the System administration board.

vsellier added a revision: D7216: syncoid: Try to restart the synchronization if a race condition occurred.Feb 21 2022, 6:30 PM

vsellier added a commit: rSPSITE36a13a6410dd: syncoid: Try to restart the synchronization if a race condition occurred.Feb 22 2022, 10:32 AM

A workaround is deployed to restart the sync if it was interrupted by a race condition scenario

This task has been migrated to GitLab.

Race condition when a zfs sync is started in the same secondClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Event Timeline

Race condition when a zfs sync is started in the same second
Closed, MigratedEdits Locked
Actions