content archiver
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	zack
	Dec 14 2015, 6:17 PM

Description

software component(s) that run periodically and archive unarchived objects from the master object storage to one (or several) slave object storage(s)

Revisions and Commits

rDSTO Storage manager
	Abandoned		D81 Refactor and optimize the archiver
	Closed		D23 Content archiver
		D30	rDSTO6941140c59ea Add a way to launch archiver director from cl

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T239 preserve at least 2 copies of each content object
Migrated	gitlab-migration	T240 content archiver
Migrated	gitlab-migration	T381 HTTP client/server version of swh.storage.objstorage
Migrated	gitlab-migration	T402 Dev: Improve starting server routine
Migrated	gitlab-migration	T405 Deploying objstorage API server
Migrated	gitlab-migration	T400 Content archiver synchronous version
Migrated	gitlab-migration	T401 Content archiver - Asynchronous version
Migrated	gitlab-migration	T404 Debian packaging update
Migrated	gitlab-migration	T403 Improve content archiver testing coverage
Migrated	gitlab-migration	T406 Make the archiver asynchronous mode optional
Migrated	gitlab-migration	T512 Make archiver have a more symmetrical behavior treating storages as potential sources & destinations at the same time
Migrated	gitlab-migration	T513 Add methods to easily instantiate ObjStorage of different types
Migrated	gitlab-migration	T482 First swh-storage-archiver run to catch up uffizi
Migrated	gitlab-migration	T481 Deploying archiver on uffizi
Migrated	gitlab-migration	T412 Bootstrap archiver's database
Migrated	gitlab-migration	T484 List banco's current sha1s for injection in archiver db
Migrated	gitlab-migration	T485 Synchronously catchup backup from uffizi to banco
Migrated	gitlab-migration	T487 Moving the log DB from the SSD-based DB to the spinning-drive one
Migrated	gitlab-migration	T494 swh-journal: archiver-client: Keep archiver table in sync with new contents
Migrated	gitlab-migration	T499 Change the swh.storage of the archiver to a swh.objstorage
Migrated	gitlab-migration	T502 Update archiver's profile manifest
Migrated	gitlab-migration	T523 Figure out what to do with corrupted copies detected by the archiver

Event Timeline

Draft blueprint of the content archiver is available at https://intranet.softwareheritage.org/index.php?title=Content_archiver_blueprint

Current version copied below for future memory.

Software Heritage Archiver

The Software Heritage (SWH) Archiver is responsible for backing up SWH objects
as to reduce the risk of losing them.

Currently, the archiver only deals with content objects (i.e., those referenced
by the content table in the DB and stored in the SWH object storage). The
database itself is lively replicated by other means.

Requirements

Master/slave architecture

There is 1 master copy and 1 or more slave copies. There is a retention policy that defines the minimum number of needed copies of each object to stay on the safe side.

Append-only archival

The archiver treat master as read-only storage. The archiver writes to slave storages append-only, never deleting any previously archived object. If removals are needed, in either master or slave, they will be dealt with by means other than the archiver.

Asynchronous archival.

Periodically (e.g., via cron), the archiver kicks in, produces a list of the objects that need to be copied from master to slaves, and starts copying objects as needed. Very likely, during any given archival run other objects that need replication will be added. It will *not* be up to that archival run to replicate them, but of future runs.

Integrity at archival time.

Before copying objects from master to slaves, the archiver performs integrity checks on the objects that are in need of replication. For instance, content objects are verified to ensure that they can be decompressed and that their content match their (checksum-based) unique identifiers. Corrupt objects will not be archived and suitable errors reporting about the corruption will be emitted.

Note that archival-time integrity checks are not meant to replace periodic integrity-checks on all master/slaves; they are still needed!

Parallel archival

Once the list of objects to be archived in a given run has been identified, it SHOULD be possible to archive them in parallel w.r.t. one another.

Persistent archival status

The archiver maintains a mapping between objects and the locations where they are stored. Locations are the set {master, slave\_1, ..., slave\_n}.

Each object is also associated to the following information:
- status: 3-state: missing (copy not present at destination), ongoing (copy to destination ongoing), present (copy present at destination)
- mtime: timestamp of last status change. In practice this is either the destination archival time (status=present), or the timestamp of the last archival request (status=ongoing)

Architecture

Thw archiver is comprised of the following software components:

archiver director
archiver worker
archiver copier

Archiver director

The archiver director is run periodically, e.g., via cron.

Runtime parameters:

execution periodicity (external)
retention policy
archival max age
archival batch size

At each execution the director:

for each object: retrieve its archival status
for each object that is in the master storage but has less copies than requested by retention policy:
1. if status=ongoing and mtime is not older than archival max age then continue to next object
2. check object integrity (e.g., with swh.storage.ObjStorage.check(obj\_id))
3. mark object as needing archival
group objects in need of archival in batches of archival batch size
for each batch:
1. set status=ongoing and mtime=now() on each object in the batch
2. spawn an archive worker on the whole batch (e.g., submitting the relevant celery task)

Note that if an archiver worker task takes a long time (t \> archival max age)
to complete it is possible for another task to be scheduled on the same batch,
or an overlapping one.

Archiver worker

The archiver is executed on demand (e.g., by a celery worker) to archive a
given set of objects.

Runtime parameters:

objects to archive

At each execution a worker:

create empty map { destinations -> objects that need to be copied there }
for each object to archive:
1. retrieve current archive status
2. update the map noting where the object need to be copied
for each destination:
1. look up in the map objects that need to be copied there
2. copy all objects to destination using the copier
3. set status=present and mtime=now() on all copied objects

Note that:

In case multiple jobs where tasked to archive the same of overlapping objects, step (2.2) might decide that some/all objects of this batch no longer need to be archived to some/all destinations.

Due to parallelism, it is also possible that the same objects will be copied over at the same time by multiple workers.

Archiver copier

The copier is run on demand by archiver workers, to transfer a bunch of files
from the master to a given destination.

The copier transfers all file together with a single network connection. The
copying process is atomic at the file granularity (i.e., individual files might
be visible on the destination before *all* files have been transferred) and
ensures that concurrent transfer of the same files by multiple copier instances
do not result in corrupted files. Note that, due to this and the fact that
timestamps are updated by the director, all files copied in the same batch will
have the same mtime even though the actual file creation times on a given
destination might differ.

As a first approximation, the copier can be implemented using rsync, but a
dedicated protocol can be devised in the future. In the case of rsync, we
should use --files-from to list the file to be copied. We observe that rsync
atomically rename files one-by-one during transfer; so as long as --inplace is
*not* used, concurrent rsync of the same files should not be a problem.

DB structure

Postgres SQL definitions for the archival status:

    CREATE DOMAIN archive_id AS TEXT;

    CREATE TABLE archives (
      id   archive_id PRIMARY KEY,
      url  TEXT
    );

    CREATE TYPE archive_status AS ENUM (
      'missing',
      'ongoing',
      'present'
    );

    CREATE TABLE content_archive (
      content_id  sha1 REFERENCES content(sha1),
      archive_id  archive_id REFERENCES archives(id),
      status      archive_status,
      mtime       timestamptz,
	  PRIMARY KEY (content_id, archive_id)
    );

zack mentioned this in T7: backup: object storage — 2nd copy after first large batch import.Jan 18 2016, 3:25 PM

zack added a project: Storage manager.Feb 4 2016, 3:00 PM

zack removed projects: Developers, Staff.Mar 10 2016, 5:51 PM

zack added a project: Restricted Project.Apr 27 2016, 10:28 AM

zack moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.

zack assigned this task to qcampos.Apr 27 2016, 9:13 PM

zack created subtask T381: HTTP client/server version of swh.storage.objstorage.

qcampos closed subtask T381: HTTP client/server version of swh.storage.objstorage as Resolved.May 4 2016, 3:36 PM

qcampos added a revision: D23: Content archiver.May 9 2016, 4:56 PM

qcampos mentioned this in T400: Content archiver synchronous version.May 12 2016, 12:05 PM

qcampos created subtask T400: Content archiver synchronous version.

qcampos mentioned this in T401: Content archiver - Asynchronous version.

qcampos created subtask T401: Content archiver - Asynchronous version.

qcampos created subtask T402: Dev: Improve starting server routine.May 12 2016, 12:07 PM

qcampos created subtask T403: Improve content archiver testing coverage.

qcampos created subtask T404: Debian packaging update.May 12 2016, 12:10 PM

qcampos created subtask T405: Deploying objstorage API server.

qcampos created subtask T406: Make the archiver asynchronous mode optional.May 12 2016, 1:01 PM

qcampos closed subtask T400: Content archiver synchronous version as Resolved.May 12 2016, 1:31 PM

qcampos closed subtask T401: Content archiver - Asynchronous version as Resolved.

qcampos closed subtask T406: Make the archiver asynchronous mode optional as Resolved.

qcampos closed subtask T402: Dev: Improve starting server routine as Resolved.May 13 2016, 10:34 AM

olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:07 PM

qcampos closed subtask T403: Improve content archiver testing coverage as Resolved.May 17 2016, 12:43 PM

qcampos closed subtask T404: Debian packaging update as Resolved.May 20 2016, 9:07 AM

qcampos added a revision: D30: Command line launch improvements.May 20 2016, 12:07 PM

qcampos added a commit: rDSTO6941140c59ea: Add a way to launch archiver director from cl.May 20 2016, 1:48 PM

qcampos created subtask T412: Bootstrap archiver's database.May 23 2016, 11:52 AM

zack removed a project: Restricted Project.Jun 1 2016, 7:19 PM

ardumont added a parent task: T481: Deploying archiver on uffizi.Jul 7 2016, 12:29 PM

ardumont removed a parent task: T481: Deploying archiver on uffizi.

ardumont added a subtask: T481: Deploying archiver on uffizi.

ardumont created subtask T482: First swh-storage-archiver run to catch up uffizi.Jul 7 2016, 4:18 PM

ardumont closed subtask T405: Deploying objstorage API server as Resolved.Jul 8 2016, 4:07 PM

ardumont closed subtask T481: Deploying archiver on uffizi as Resolved.

ardumont created subtask T485: Synchronously catchup backup from uffizi to banco.Jul 12 2016, 1:06 PM

ardumont created subtask T486: DB storage cleanup.Jul 12 2016, 6:42 PM

zack mentioned this in T486: DB storage cleanup.Jul 13 2016, 9:01 AM

ardumont edited subtasks, added: T487: Moving the log DB from the SSD-based DB to the spinning-drive one; removed: T486: DB storage cleanup.Jul 13 2016, 10:55 AM

ardumont changed the status of subtask T487: Moving the log DB from the SSD-based DB to the spinning-drive one from Open to Work in Progress.Jul 13 2016, 11:36 AM

ardumont closed subtask T487: Moving the log DB from the SSD-based DB to the spinning-drive one as Resolved.Jul 16 2016, 10:09 AM

ardumont changed the status of subtask T412: Bootstrap archiver's database from Open to Work in Progress.Jul 16 2016, 10:21 AM

qcampos created subtask T494: swh-journal: archiver-client: Keep archiver table in sync with new contents.Jul 18 2016, 4:11 PM

ardumont mentioned this in T494: swh-journal: archiver-client: Keep archiver table in sync with new contents.Jul 19 2016, 2:26 PM

qcampos created subtask T499: Change the swh.storage of the archiver to a swh.objstorage.Jul 20 2016, 3:02 PM

ardumont closed subtask T499: Change the swh.storage of the archiver to a swh.objstorage as Resolved.Jul 20 2016, 7:04 PM

ardumont closed subtask T412: Bootstrap archiver's database as Resolved.Jul 20 2016, 7:47 PM

ardumont closed subtask T485: Synchronously catchup backup from uffizi to banco as Invalid.

ardumont changed the status of subtask T485: Synchronously catchup backup from uffizi to banco from Invalid to Resolved.Jul 21 2016, 11:10 AM

ardumont changed the status of subtask T485: Synchronously catchup backup from uffizi to banco from Resolved to Invalid.Jul 21 2016, 11:12 AM

ardumont reopened subtask T412: Bootstrap archiver's database as Work in Progress.Jul 21 2016, 1:10 PM

qcampos removed a subtask: T412: Bootstrap archiver's database.Jul 22 2016, 12:29 PM

qcampos removed a subtask: T401: Content archiver - Asynchronous version.

olasd changed the status of subtask T482: First swh-storage-archiver run to catch up uffizi from Open to Work in Progress.Jul 25 2016, 12:53 PM

qcampos added a revision: D81: Refactor and optimize the archiver.Jul 26 2016, 4:42 PM

qcampos removed a subtask: T404: Debian packaging update.Jul 27 2016, 1:07 PM

qcampos removed a subtask: T405: Deploying objstorage API server.

qcampos removed a subtask: T402: Dev: Improve starting server routine.

qcampos removed a subtask: T403: Improve content archiver testing coverage.

qcampos removed a subtask: T406: Make the archiver asynchronous mode optional.

qcampos added a revision: D86: Complete refactoring of the archiver.Aug 2 2016, 1:13 PM

qcampos removed a revision: D86: Complete refactoring of the archiver.Aug 2 2016, 2:05 PM

olasd created subtask T523: Figure out what to do with corrupted copies detected by the archiver.Aug 3 2016, 3:31 PM

This has actually been in production for quite a while. Some related tasks are still pending, but there is no need to keep the general "we need a content archiver" task open anymore.

olasd closed subtask T482: First swh-storage-archiver run to catch up uffizi as Resolved.Oct 9 2017, 11:51 AM

ardumont closed subtask T494: swh-journal: archiver-client: Keep archiver table in sync with new contents as Invalid.Jan 13 2019, 12:33 PM

zack closed subtask T523: Figure out what to do with corrupted copies detected by the archiver as Invalid.May 25 2019, 4:58 PM