software component(s) that run periodically and archive unarchived objects from the master object storage to one (or several) slave object storage(s)
Description
Revisions and Commits
Event Timeline
Draft blueprint of the content archiver is available at https://intranet.softwareheritage.org/index.php?title=Content_archiver_blueprint
Current version copied below for future memory.
Software Heritage Archiver
The Software Heritage (SWH) Archiver is responsible for backing up SWH objects
as to reduce the risk of losing them.
Currently, the archiver only deals with content objects (i.e., those referenced
by the content table in the DB and stored in the SWH object storage). The
database itself is lively replicated by other means.
Requirements
- Master/slave architecture
There is 1 master copy and 1 or more slave copies. There is a retention policy that defines the minimum number of needed copies of each object to stay on the safe side.
- Append-only archival
The archiver treat master as read-only storage. The archiver writes to slave storages append-only, never deleting any previously archived object. If removals are needed, in either master or slave, they will be dealt with by means other than the archiver.
- Asynchronous archival.
Periodically (e.g., via cron), the archiver kicks in, produces a list of the objects that need to be copied from master to slaves, and starts copying objects as needed. Very likely, during any given archival run other objects that need replication will be added. It will *not* be up to that archival run to replicate them, but of future runs.
- Integrity at archival time.
Before copying objects from master to slaves, the archiver performs integrity checks on the objects that are in need of replication. For instance, content objects are verified to ensure that they can be decompressed and that their content match their (checksum-based) unique identifiers. Corrupt objects will not be archived and suitable errors reporting about the corruption will be emitted.
Note that archival-time integrity checks are not meant to replace periodic integrity-checks on all master/slaves; they are still needed!
- Parallel archival
Once the list of objects to be archived in a given run has been identified, it SHOULD be possible to archive them in parallel w.r.t. one another.
- Persistent archival status
The archiver maintains a mapping between objects and the locations where they are stored. Locations are the set {master, slave\_1, ..., slave\_n}.
Each object is also associated to the following information:- status: 3-state: missing (copy not present at destination), ongoing (copy to destination ongoing), present (copy present at destination)
- mtime: timestamp of last status change. In practice this is either the destination archival time (status=present), or the timestamp of the last archival request (status=ongoing)
Architecture
Thw archiver is comprised of the following software components:
- archiver director
- archiver worker
- archiver copier
Archiver director
The archiver director is run periodically, e.g., via cron.
Runtime parameters:
- execution periodicity (external)
- retention policy
- archival max age
- archival batch size
At each execution the director:
- for each object: retrieve its archival status
- for each object that is in the master storage but has less copies than requested by retention policy:
- if status=ongoing and mtime is not older than archival max age then continue to next object
- check object integrity (e.g., with swh.storage.ObjStorage.check(obj\_id))
- mark object as needing archival
- group objects in need of archival in batches of archival batch size
- for each batch:
- set status=ongoing and mtime=now() on each object in the batch
- spawn an archive worker on the whole batch (e.g., submitting the relevant celery task)
Note that if an archiver worker task takes a long time (t \> archival max age)
to complete it is possible for another task to be scheduled on the same batch,
or an overlapping one.
Archiver worker
The archiver is executed on demand (e.g., by a celery worker) to archive a
given set of objects.
Runtime parameters:
- objects to archive
At each execution a worker:
- create empty map { destinations -> objects that need to be copied there }
- for each object to archive:
- retrieve current archive status
- update the map noting where the object need to be copied
- for each destination:
- look up in the map objects that need to be copied there
- copy all objects to destination using the copier
- set status=present and mtime=now() on all copied objects
Note that:
- In case multiple jobs where tasked to archive the same of overlapping objects, step (2.2) might decide that some/all objects of this batch no longer need to be archived to some/all destinations.
- Due to parallelism, it is also possible that the same objects will be copied over at the same time by multiple workers.
Archiver copier
The copier is run on demand by archiver workers, to transfer a bunch of files
from the master to a given destination.
The copier transfers all file together with a single network connection. The
copying process is atomic at the file granularity (i.e., individual files might
be visible on the destination before *all* files have been transferred) and
ensures that concurrent transfer of the same files by multiple copier instances
do not result in corrupted files. Note that, due to this and the fact that
timestamps are updated by the director, all files copied in the same batch will
have the same mtime even though the actual file creation times on a given
destination might differ.
As a first approximation, the copier can be implemented using rsync, but a
dedicated protocol can be devised in the future. In the case of rsync, we
should use --files-from to list the file to be copied. We observe that rsync
atomically rename files one-by-one during transfer; so as long as --inplace is
*not* used, concurrent rsync of the same files should not be a problem.
DB structure
Postgres SQL definitions for the archival status:
CREATE DOMAIN archive_id AS TEXT; CREATE TABLE archives ( id archive_id PRIMARY KEY, url TEXT ); CREATE TYPE archive_status AS ENUM ( 'missing', 'ongoing', 'present' ); CREATE TABLE content_archive ( content_id sha1 REFERENCES content(sha1), archive_id archive_id REFERENCES archives(id), status archive_status, mtime timestamptz, PRIMARY KEY (content_id, archive_id) );
This has actually been in production for quite a while. Some related tasks are still pending, but there is no need to keep the general "we need a content archiver" task open anymore.