diff --git a/doc/archiver-blueprint.md b/doc/archiver-blueprint.md new file mode 100644 index 00000000..52c5e252 --- /dev/null +++ b/doc/archiver-blueprint.md @@ -0,0 +1,181 @@ +Software Heritage Archiver +========================== + +The Software Heritage (SWH) Archiver is responsible for backing up SWH objects +as to reduce the risk of losing them. + +Currently, the archiver only deals with content objects (i.e., those referenced +by the content table in the DB and stored in the SWH object storage). The +database itself is lively replicated by other means. + + +Requirements +------------ + +* **Master/slave architecture** + + There is 1 master copy and 1 or more slave copies of each object. A retention + policy specifies the minimum number of copies that are required to be "safe". + +* **Append-only archival** + + The archiver treats master as read-only storage and slaves as append-only + storages. The archiver never deletes any object. If removals are needed, in + either master or slaves, they will be dealt with by other means. + +* **Asynchronous archival.** + + Periodically (e.g., via cron), the archiver runs, produces a list of objects + that need to be copied from master to slaves, and starts copying them over. + Very likely, during any given archival run other new objects will be added to + master; it will be the responsibility of *future* archiver runs, and not the + current one, to copy new objects over. + +* **Integrity at archival time.** + + Before archiving objects, the archiver performs suitable integrity checks on + them. For instance, content objects are verified to ensure that they can be + decompressed and that their content match their (checksum-based) unique + identifiers. Corrupt objects will not be archived and suitable errors + reporting about the corruption will be emitted. + + Note that archival-time integrity checks are *not meant to replace periodic + integrity checks* on both master and slave copies. + +* **Parallel archival** + + Once the list of objects to be archived has been identified, it SHOULD be + possible to archive objects in parallel w.r.t. one another. + +* **Persistent archival status** + + The archiver maintains a mapping between objects and the locations where they + are stored. Locations are the set {master, slave_1, ..., slave_n}. + + Each pair is also associated to the following + information: + + * **status**: 3-state: *missing* (copy not present at destination), *ongoing* + (copy to destination ongoing), *present* (copy present at destination) + * **mtime**: timestamp of last status change. This is either the destination + archival time (when status=present), or the timestamp of the last archival + request (status=ongoing); the timestamp is undefined when status=missing. + + +Architecture +------------ + +The archiver is comprised of the following software components: + +* archiver director +* archiver worker +* archiver copier + + +### Archiver director + +The archiver director is run periodically, e.g., via cron. + +Runtime parameters: + +* execution periodicity (external) +* retention policy +* archival max age +* archival batch size + +At each execution the director: + +1. for each object: retrieve its archival status +2. for each object that is in the master storage but has fewer copies than + those requested by the retention policy: + 1. if status=ongoing and mtime is not older than archival max age + then continue to next object + 2. check object integrity (e.g., with swh.storage.ObjStorage.check(obj_id)) + 3. mark object as needing archival +3. group objects in need of archival in batches of archival batch size +4. for each batch: + 1. set status=ongoing and mtime=now() for each object in the batch + 2. spawn an archive worker on the whole batch (e.g., submitting the relevant + celery task) + +Note that if an archiver worker task takes a long time (t > archival max age) +to complete, it is possible for another task to be scheduled on the same batch, +or an overlapping one. + + +### Archiver worker + +The archiver is executed on demand (e.g., by a celery worker) to archive a +given set of objects. + +Runtime parameters: + +* objects to archive + +At each execution a worker: + +1. create empty map { destinations -> objects that need to be copied there } +2. for each object to archive: + 1. retrieve current archive status for all destinations + 2. update the map noting where the object needs to be copied +3. for each destination: + 1. look up in the map objects that need to be copied there + 2. copy all objects to destination using the copier + 3. set status=present and mtime=now() for each copied object + +Note that: + +* In case multiple jobs where tasked to archive the same of overlapping + objects, step (2.2) might decide that some/all objects of this batch no + longer need to be archived to some/all destinations. + +* Due to parallelism, it is also possible that the same objects will be copied + over at the same time by multiple workers. + + +### Archiver copier + +The copier is run on demand by archiver workers, to transfer file batches from +master to a given destination. + +The copier transfers all files together with a single network connection. The +copying process is atomic at the file granularity (i.e., individual files might +be visible on the destination before *all* files have been transferred) and +ensures that *concurrent transfer of the same files by multiple copier +instances do not result in corrupted files*. Note that, due to this and the +fact that timestamps are updated by the director, all files copied in the same +batch will have the same mtime even though the actual file creation times on a +given destination might differ. + +As a first approximation, the copier can be implemented using rsync, but a +dedicated protocol can be devised later. In the case of rsync, one should use +--files-from to list the file to be copied. Rsync atomically renames files +one-by-one during transfer; so as long as --inplace is *not* used, concurrent +rsync of the same files should not be a problem. + + +DB structure +------------ + +Postgres SQL definitions for the archival status: + + CREATE DOMAIN archive_id AS TEXT; + + CREATE TABLE archives ( + id archive_id PRIMARY KEY, + url TEXT + ); + + CREATE TYPE archive_status AS ENUM ( + 'missing', + 'ongoing', + 'present' + ); + + CREATE TABLE content_archive ( + content_id sha1 REFERENCES content(sha1), + archive_id archive_id REFERENCES archives(id), + status archive_status, + mtime timestamptz, + PRIMARY KEY (content_id, archive_id) + );