Page MenuHomeSoftware Heritage

content archiver
Closed, MigratedEdits Locked

Description

software component(s) that run periodically and archive unarchived objects from the master object storage to one (or several) slave object storage(s)

Related Objects

StatusAssignedTask
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration

Event Timeline

zack raised the priority of this task from to Normal.
zack updated the task description. (Show Details)
zack added projects: Staff, Developers.
zack added a subscriber: zack.

Draft blueprint of the content archiver is available at https://intranet.softwareheritage.org/index.php?title=Content_archiver_blueprint

Current version copied below for future memory.


Software Heritage Archiver

The Software Heritage (SWH) Archiver is responsible for backing up SWH objects
as to reduce the risk of losing them.

Currently, the archiver only deals with content objects (i.e., those referenced
by the content table in the DB and stored in the SWH object storage). The
database itself is lively replicated by other means.

Requirements

  • Master/slave architecture

    There is 1 master copy and 1 or more slave copies. There is a retention policy that defines the minimum number of needed copies of each object to stay on the safe side.
  • Append-only archival

    The archiver treat master as read-only storage. The archiver writes to slave storages append-only, never deleting any previously archived object. If removals are needed, in either master or slave, they will be dealt with by means other than the archiver.
  • Asynchronous archival.

    Periodically (e.g., via cron), the archiver kicks in, produces a list of the objects that need to be copied from master to slaves, and starts copying objects as needed. Very likely, during any given archival run other objects that need replication will be added. It will *not* be up to that archival run to replicate them, but of future runs.
  • Integrity at archival time.

    Before copying objects from master to slaves, the archiver performs integrity checks on the objects that are in need of replication. For instance, content objects are verified to ensure that they can be decompressed and that their content match their (checksum-based) unique identifiers. Corrupt objects will not be archived and suitable errors reporting about the corruption will be emitted.

    Note that archival-time integrity checks are not meant to replace periodic integrity-checks on all master/slaves; they are still needed!
  • Parallel archival

    Once the list of objects to be archived in a given run has been identified, it SHOULD be possible to archive them in parallel w.r.t. one another.
  • Persistent archival status

    The archiver maintains a mapping between objects and the locations where they are stored. Locations are the set {master, slave\_1, ..., slave\_n}.

    Each object is also associated to the following information:
    • status: 3-state: missing (copy not present at destination), ongoing (copy to destination ongoing), present (copy present at destination)
    • mtime: timestamp of last status change. In practice this is either the destination archival time (status=present), or the timestamp of the last archival request (status=ongoing)

Architecture

Thw archiver is comprised of the following software components:

  • archiver director
  • archiver worker
  • archiver copier

Archiver director

The archiver director is run periodically, e.g., via cron.

Runtime parameters:

  • execution periodicity (external)
  • retention policy
  • archival max age
  • archival batch size

At each execution the director:

  1. for each object: retrieve its archival status
  2. for each object that is in the master storage but has less copies than requested by retention policy:
    1. if status=ongoing and mtime is not older than archival max age then continue to next object
    2. check object integrity (e.g., with swh.storage.ObjStorage.check(obj\_id))
    3. mark object as needing archival
  3. group objects in need of archival in batches of archival batch size
  4. for each batch:
    1. set status=ongoing and mtime=now() on each object in the batch
    2. spawn an archive worker on the whole batch (e.g., submitting the relevant celery task)

Note that if an archiver worker task takes a long time (t \> archival max age)
to complete it is possible for another task to be scheduled on the same batch,
or an overlapping one.

Archiver worker

The archiver is executed on demand (e.g., by a celery worker) to archive a
given set of objects.

Runtime parameters:

  • objects to archive

At each execution a worker:

  1. create empty map { destinations -> objects that need to be copied there }
  2. for each object to archive:
    1. retrieve current archive status
    2. update the map noting where the object need to be copied
  3. for each destination:
    1. look up in the map objects that need to be copied there
    2. copy all objects to destination using the copier
    3. set status=present and mtime=now() on all copied objects

Note that:

  • In case multiple jobs where tasked to archive the same of overlapping objects, step (2.2) might decide that some/all objects of this batch no longer need to be archived to some/all destinations.
  • Due to parallelism, it is also possible that the same objects will be copied over at the same time by multiple workers.

Archiver copier

The copier is run on demand by archiver workers, to transfer a bunch of files
from the master to a given destination.

The copier transfers all file together with a single network connection. The
copying process is atomic at the file granularity (i.e., individual files might
be visible on the destination before *all* files have been transferred) and
ensures that concurrent transfer of the same files by multiple copier instances
do not result in corrupted files. Note that, due to this and the fact that
timestamps are updated by the director, all files copied in the same batch will
have the same mtime even though the actual file creation times on a given
destination might differ.

As a first approximation, the copier can be implemented using rsync, but a
dedicated protocol can be devised in the future. In the case of rsync, we
should use --files-from to list the file to be copied. We observe that rsync
atomically rename files one-by-one during transfer; so as long as --inplace is
*not* used, concurrent rsync of the same files should not be a problem.

DB structure

Postgres SQL definitions for the archival status:

    CREATE DOMAIN archive_id AS TEXT;

    CREATE TABLE archives (
      id   archive_id PRIMARY KEY,
      url  TEXT
    );

    CREATE TYPE archive_status AS ENUM (
      'missing',
      'ongoing',
      'present'
    );

    CREATE TABLE content_archive (
      content_id  sha1 REFERENCES content(sha1),
      archive_id  archive_id REFERENCES archives(id),
      status      archive_status,
      mtime       timestamptz,
	  PRIMARY KEY (content_id, archive_id)
    );
zack added a project: Restricted Project.Apr 27 2016, 10:28 AM
zack moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.
olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:07 PM
zack removed a project: Restricted Project.Jun 1 2016, 7:19 PM
ardumont changed the status of subtask T412: Bootstrap archiver's database from Open to Work in Progress.Jul 16 2016, 10:21 AM

This has actually been in production for quite a while. Some related tasks are still pending, but there is no need to keep the general "we need a content archiver" task open anymore.

gitlab-migration changed the status of subtask T400: Content archiver synchronous version from Resolved to Migrated.
gitlab-migration changed the status of subtask T481: Deploying archiver on uffizi from Resolved to Migrated.
gitlab-migration changed the status of subtask T482: First swh-storage-archiver run to catch up uffizi from Resolved to Migrated.
gitlab-migration changed the status of subtask T487: Moving the log DB from the SSD-based DB to the spinning-drive one from Resolved to Migrated.
gitlab-migration changed the status of subtask T499: Change the swh.storage of the archiver to a swh.objstorage from Resolved to Migrated.