diff --git a/docs/archiver-blueprint.md b/docs/archiver-blueprint.md --- a/docs/archiver-blueprint.md +++ b/docs/archiver-blueprint.md @@ -12,24 +12,32 @@ Requirements ------------ -* **Master/slave architecture** +* **Peer-to-peer topology** - There is 1 master copy and 1 or more slave copies of each object. A retention - policy specifies the minimum number of copies that are required to be "safe". + Every storage involved in the archival process can be used as a source or a + destination for the archival, depending on the blobs it contains. A + retention policy specifies the minimum number of copies that are required + to be "safe". + + Althoug the servers are totally equals the coordination of which content + should be copied and from where to where is centralized. * **Append-only archival** - The archiver treats master as read-only storage and slaves as append-only - storages. The archiver never deletes any object. If removals are needed, in - either master or slaves, they will be dealt with by other means. + The archiver treats involved storages as append-only storages. The archiver + never deletes any object. If removals are needed, they will be dealt with + by other means. * **Asynchronous archival.** Periodically (e.g., via cron), the archiver runs, produces a list of objects - that need to be copied from master to slaves, and starts copying them over. - Very likely, during any given archival run other new objects will be added to - master; it will be the responsibility of *future* archiver runs, and not the - current one, to copy new objects over. + that need to have more copies, and starts copying them over. The decision of + which storages are choosen to be sources and destinations are not up to the + storages themselves. + + Very likely, during any given archival run, other new objects will be added + to storages; it will be the responsibility of *future* archiver runs, and + not the current one, to copy new objects over if needed. * **Integrity at archival time.** @@ -40,7 +48,7 @@ reporting about the corruption will be emitted. Note that archival-time integrity checks are *not meant to replace periodic - integrity checks* on both master and slave copies. + integrity checks*. * **Parallel archival** @@ -56,7 +64,8 @@ information: * **status**: 3-state: *missing* (copy not present at destination), *ongoing* - (copy to destination ongoing), *present* (copy present at destination) + (copy to destination ongoing), *present* (copy present at destination), + *corrupted* (content detected as corrupted during an archival). * **mtime**: timestamp of last status change. This is either the destination archival time (when status=present), or the timestamp of the last archival request (status=ongoing); the timestamp is undefined when status=missing. @@ -86,22 +95,14 @@ At each execution the director: 1. for each object: retrieve its archival status -2. for each object that is in the master storage but has fewer copies than - those requested by the retention policy: - 1. if status=ongoing and mtime is not older than archival max age - then continue to next object - 2. check object integrity (e.g., with swh.storage.ObjStorage.check(obj_id)) - 3. mark object as needing archival -3. group objects in need of archival in batches of archival batch size +2. for each object that has fewer copies than those requested by the + retention policy: + 1. mark object as needing archival +3. group objects in need of archival in batches of `archival batch size` 4. for each batch: - 1. set status=ongoing and mtime=now() for each object in the batch - 2. spawn an archive worker on the whole batch (e.g., submitting the relevant + 1. spawn an archive worker on the whole batch (e.g., submitting the relevant celery task) -Note that if an archiver worker task takes a long time (t > archival max age) -to complete, it is possible for another task to be scheduled on the same batch, -or an overlapping one. - ### Archiver worker @@ -111,47 +112,66 @@ Runtime parameters: * objects to archive +* archival policies (retention & archival max age) At each execution a worker: -1. create empty map { destinations -> objects that need to be copied there } +1. for each object in the batch + 1. check that the object still need to be archived + (#present copies < retention policy) + 2. if an object has status=ongoing but the elapsed time from task submission + is less than the *archival max age*, it count as present, as we assume + that it will be copied in the futur. If the delay is elapsed, otherwise, + is count as a missing copy. 2. for each object to archive: 1. retrieve current archive status for all destinations - 2. update the map noting where the object needs to be copied -3. for each destination: - 1. look up in the map objects that need to be copied there - 2. copy all objects to destination using the copier - 3. set status=present and mtime=now() for each copied object + 2. create a map noting where the object is present and where it can be copied + 3. Randomly choose couples (source, destination), where destinations are all + differents, to make enough copies +3. for each (content, source, destination): + 1. Join the contents by key (source, destination) to have a map + {(source, destination) -> [contents]} + 2. for each transfer couple, use a copier to make the copy. Note that: * In case multiple jobs where tasked to archive the same of overlapping - objects, step (2.2) might decide that some/all objects of this batch no - longer need to be archived to some/all destinations. + objects, step (1) might decide that some/all objects of this batch no + longer need to be archived. -* Due to parallelism, it is also possible that the same objects will be copied - over at the same time by multiple workers. +* Due to parallelism, it is possible that the same objects will be copied + over at the same time by multiple workers. Also, the same object could end + having more copies than the minimal number required. ### Archiver copier The copier is run on demand by archiver workers, to transfer file batches from -master to a given destination. +a given source to a given destination. + +The copier transfers files one by one. The copying process is atomic at the file +granularity (i.e., individual files might be visible on the destination before +*all* files have been transferred) and ensures that *concurrent transfer of the +same files by multiple copier instances do not result in corrupted files*. Note +that, due to this and the fact that timestamps are updated by the worker, all +files copied in the same batch will have the same mtime even though the actual +file creation times on a given destination might differ. -The copier transfers all files together with a single network connection. The -copying process is atomic at the file granularity (i.e., individual files might -be visible on the destination before *all* files have been transferred) and -ensures that *concurrent transfer of the same files by multiple copier -instances do not result in corrupted files*. Note that, due to this and the -fact that timestamps are updated by the director, all files copied in the same -batch will have the same mtime even though the actual file creation times on a -given destination might differ. +The copier is implemented using the ObjStorage API for the sources and +destinations. -As a first approximation, the copier can be implemented using rsync, but a -dedicated protocol can be devised later. In the case of rsync, one should use ---files-from to list the file to be copied. Rsync atomically renames files -one-by-one during transfer; so as long as --inplace is *not* used, concurrent -rsync of the same files should not be a problem. +At each execution of the copier: + +1. for each object in the batch: + 1. check the object on the source storages + * if the object if corrupted or missing + * update its status in the database + * remove it from the current batch + 2. start the copy of the content from the source storage from the + destination storage + * if an error occured on one of the content that should have been valid, + consider the whole batch as a failure. + 3. Set status=present and mtime=now for each successfully copied object DB structure @@ -159,9 +179,13 @@ Postgres SQL definitions for the archival status: - CREATE DOMAIN archive_id AS TEXT; + -- Those names are sample of archives server names + CREATE TYPE archive_id AS ENUM ( + 'uffizi', + 'banco' + ); - CREATE TABLE archives ( + CREATE TABLE archive ( id archive_id PRIMARY KEY, url TEXT ); @@ -169,13 +193,39 @@ CREATE TYPE archive_status AS ENUM ( 'missing', 'ongoing', - 'present' + 'present', + 'corrupted' ); CREATE TABLE content_archive ( - content_id sha1 REFERENCES content(sha1), - archive_id archive_id REFERENCES archives(id), - status archive_status, - mtime timestamptz, - PRIMARY KEY (content_id, archive_id) + content_id sha1 unique, + copies jsonb ); + + +Where the content_archive.copies field is of type jsonb and contains datas +about the storages that contains (or not) the content represented by the sha1 + + { + "$schema": "http://json-schema.org/schema#", + "title": "Copies data" + "description": "Data about the presence of a content into the storages" + "type": "object", + "Patternproperties": { + "^[a-zA-Z1-9]+$": { + "description": "archival status for the server named by key", + "type": "object", + "properties": { + "status": { + "description": "Archival status on this copy matching one of the archive_status enum" + "type": string + }, + "mtime": { + "description": "Last time of the status update" + "type": float + } + } + } + }, + "additionalProperties": False + }