Differential D88 Diff 290 docs/archiver-blueprint.md

Changeset View

Standalone View

docs/archiver-blueprint.md

	Software Heritage Archiver			Software Heritage Archiver
	==========================			==========================

	The Software Heritage (SWH) Archiver is responsible for backing up SWH objects			The Software Heritage (SWH) Archiver is responsible for backing up SWH objects
	as to reduce the risk of losing them.			as to reduce the risk of losing them.

	Currently, the archiver only deals with content objects (i.e., those referenced			Currently, the archiver only deals with content objects (i.e., those referenced
	by the content table in the DB and stored in the SWH object storage). The			by the content table in the DB and stored in the SWH object storage). The
	database itself is lively replicated by other means.			database itself is lively replicated by other means.


	Requirements			Requirements
	------------			------------

	* Master/slave architecture			* Peer-to-peer topology
				zackUnsubmitted Done Inline Actions "Peer-to-peer" (or P2P) is the standard spelling of this notion zack: "Peer-to-peer" (or P2P) is the standard spelling of this notion

	There is 1 master copy and 1 or more slave copies of each object. A retention			Every storage involved in the archival process can be used as a source or a
	policy specifies the minimum number of copies that are required to be "safe".			destination for the archival, depending on the blobs it contains. A
				retention policy specifies the minimum number of copies that are required
				to be "safe".

				Althoug the servers are totally equals the coordination of which content
				should be copied and from where to where is centralized.

				zackUnsubmitted Done Inline Actions A more substantial comment here is that the recent changes in how the archiver works didn't really turn it into a P2P system. Most notably we still rely on a centralized director that: a) spots that more copies are needed, and b) decides what-to-copy-where. So I propose to call this section "Peer-to-peer topology". And maybe add a paragraph to state that, whereas peers are in general considered to be equal, coordination is currently centralized. zack: A more substantial comment here is that the recent changes in how the archiver works didn't…
	* Append-only archival			* Append-only archival

	The archiver treats master as read-only storage and slaves as append-only			The archiver treats involved storages as append-only storages. The archiver
	storages. The archiver never deletes any object. If removals are needed, in			never deletes any object. If removals are needed, they will be dealt with
	either master or slaves, they will be dealt with by other means.			by other means.

	* Asynchronous archival.			* Asynchronous archival.

	Periodically (e.g., via cron), the archiver runs, produces a list of objects			Periodically (e.g., via cron), the archiver runs, produces a list of objects
	that need to be copied from master to slaves, and starts copying them over.			that need to have more copies, and starts copying them over. The decision of
	Very likely, during any given archival run other new objects will be added to			which storages are choosen to be sources and destinations are not up to the
	master; it will be the responsibility of future archiver runs, and not the			storages themselves.
	current one, to copy new objects over.
				Very likely, during any given archival run, other new objects will be added
				to storages; it will be the responsibility of future archiver runs, and
				not the current one, to copy new objects over if needed.
				zackUnsubmitted Done Inline Actions Here is a plug to reiterate, if needed, that the source/destination decision is not up to individual nodes. zack: Here is a plug to reiterate, if needed, that the source/destination decision is not up to…

	* Integrity at archival time.			* Integrity at archival time.

	Before archiving objects, the archiver performs suitable integrity checks on			Before archiving objects, the archiver performs suitable integrity checks on
	them. For instance, content objects are verified to ensure that they can be			them. For instance, content objects are verified to ensure that they can be
	decompressed and that their content match their (checksum-based) unique			decompressed and that their content match their (checksum-based) unique
	identifiers. Corrupt objects will not be archived and suitable errors			identifiers. Corrupt objects will not be archived and suitable errors
	reporting about the corruption will be emitted.			reporting about the corruption will be emitted.

	Note that archival-time integrity checks are *not meant to replace periodic			Note that archival-time integrity checks are *not meant to replace periodic
	integrity checks* on both master and slave copies.			integrity checks*.

	* Parallel archival			* Parallel archival

	Once the list of objects to be archived has been identified, it SHOULD be			Once the list of objects to be archived has been identified, it SHOULD be
	possible to archive objects in parallel w.r.t. one another.			possible to archive objects in parallel w.r.t. one another.

	* Persistent archival status			* Persistent archival status

	The archiver maintains a mapping between objects and the locations where they			The archiver maintains a mapping between objects and the locations where they
	are stored. Locations are the set {master, slave_1, ..., slave_n}.			are stored. Locations are the set {master, slave_1, ..., slave_n}.

	Each pair <object,destination> is also associated to the following			Each pair <object,destination> is also associated to the following
	information:			information:

	* status: 3-state: missing (copy not present at destination), ongoing			* status: 3-state: missing (copy not present at destination), ongoing
	(copy to destination ongoing), present (copy present at destination)			(copy to destination ongoing), present (copy present at destination),
				corrupted (content detected as corrupted during an archival).
	* mtime: timestamp of last status change. This is either the destination			* mtime: timestamp of last status change. This is either the destination
	archival time (when status=present), or the timestamp of the last archival			archival time (when status=present), or the timestamp of the last archival
	request (status=ongoing); the timestamp is undefined when status=missing.			request (status=ongoing); the timestamp is undefined when status=missing.


	Architecture			Architecture
	------------			------------

	Show All 12 Lines

	* execution periodicity (external)			* execution periodicity (external)
	* retention policy			* retention policy
	* archival max age			* archival max age
	* archival batch size			* archival batch size

	At each execution the director:			At each execution the director:

	1. for each object: retrieve its archival status			1. for each object: retrieve its archival status
	2. for each object that is in the master storage but has fewer copies than			2. for each object that has fewer copies than those requested by the
	those requested by the retention policy:			retention policy:
	1. if status=ongoing and mtime is not older than archival max age			1. mark object as needing archival
	then continue to next object			3. group objects in need of archival in batches of `archival batch size`
	2. check object integrity (e.g., with swh.storage.ObjStorage.check(obj_id))
	3. mark object as needing archival
	3. group objects in need of archival in batches of archival batch size
	4. for each batch:			4. for each batch:
	1. set status=ongoing and mtime=now() for each object in the batch			1. spawn an archive worker on the whole batch (e.g., submitting the relevant
				zackUnsubmitted Done Inline Actions why is the discussion of mtime gone from here? zack: why is the discussion of mtime gone from here?
				qcamposAuthorUnsubmitted Done Inline Actions The new specification does not check /mtime/ in the director. It only count the number of effectively present copies to select contents that need or not more copies. The check of the /mtime/s of /ongoing/ status is now done in the worker only (note: this was previously done twice). This makes the batches less accurate about contents that really needs archival, but they are generated quicker, and as this check was done anyway in the worker, we can save a db request for each content. qcampos: The new specification does not check /mtime/ in the director. It only count the number of…
				zackUnsubmitted Done Inline Actions OK, fair enough. Thanks for clarifying. zack: OK, fair enough. Thanks for clarifying.
	2. spawn an archive worker on the whole batch (e.g., submitting the relevant
	celery task)			celery task)

	Note that if an archiver worker task takes a long time (t > archival max age)
	to complete, it is possible for another task to be scheduled on the same batch,
	or an overlapping one.


	### Archiver worker			### Archiver worker

	The archiver is executed on demand (e.g., by a celery worker) to archive a			The archiver is executed on demand (e.g., by a celery worker) to archive a
	given set of objects.			given set of objects.

	Runtime parameters:			Runtime parameters:

	* objects to archive			* objects to archive
				zackUnsubmitted Done Inline Actions same as above: why is this gone? zack: same as above: why is this gone?
				* archival policies (retention & archival max age)

	At each execution a worker:			At each execution a worker:

	1. create empty map { destinations -> objects that need to be copied there }			1. for each object in the batch
				zackUnsubmitted Done Inline Actions I don't understand what "but elapsed time" means here, please clarify/rewrite. zack: I don't understand what "but elapsed time" means here, please clarify/rewrite.
				1. check that the object still need to be archived
				(#present copies < retention policy)
				2. if an object has status=ongoing but the elapsed time from task submission
				is less than the archival max age, it count as present, as we assume
				that it will be copied in the futur. If the delay is elapsed, otherwise,
				is count as a missing copy.
	2. for each object to archive:			2. for each object to archive:
	1. retrieve current archive status for all destinations			1. retrieve current archive status for all destinations
	2. update the map noting where the object needs to be copied			2. create a map noting where the object is present and where it can be copied
	3. for each destination:			3. Randomly choose couples (source, destination), where destinations are all
	1. look up in the map objects that need to be copied there			differents, to make enough copies
				zackUnsubmitted Done Inline Actions nitpick: writing this "for each (content, source, destination)" would make it less odd zack: nitpick: writing this "for each (content, source, destination)" would make it less odd
	2. copy all objects to destination using the copier			3. for each (content, source, destination):
	3. set status=present and mtime=now() for each copied object			1. Join the contents by key (source, destination) to have a map
				{(source, destination) -> [contents]}
				zackUnsubmitted Done Inline Actions typo: "transfert" -> "transfer" zack: typo: "transfert" -> "transfer"
				4. for each (source, destination) -> contents
				zackUnsubmitted Done Inline Actions nitpick: there are capital letters at the beginning of bullet points, whereas they aren't there all over the document zack: nitpick: there are capital letters at the beginning of bullet points, whereas they aren't there…
				1. for each content in content, check its integrity on the source storage
				* if the object if corrupted or missing
				* update its status in the database
				* remove it from the current contents list
				5. start the copy of the batces by launching for each transfer tuple a copier
				* if an error occured on one of the content that should have been valid,
				consider the whole batch as a failure.
				6. set status=present and mtime=now for each successfully copied object

	Note that:			Note that:

	* In case multiple jobs where tasked to archive the same of overlapping			* In case multiple jobs where tasked to archive the same of overlapping
	objects, step (2.2) might decide that some/all objects of this batch no			objects, step (1) might decide that some/all objects of this batch no
	longer need to be archived to some/all destinations.			longer need to be archived.

	* Due to parallelism, it is also possible that the same objects will be copied			* Due to parallelism, it is possible that the same objects will be copied
	over at the same time by multiple workers.			over at the same time by multiple workers. Also, the same object could end
				having more copies than the minimal number required.


	### Archiver copier			### Archiver copier

	The copier is run on demand by archiver workers, to transfer file batches from			The copier is run on demand by archiver workers, to transfer file batches from
	master to a given destination.			a given source to a given destination.

	The copier transfers all files together with a single network connection. The			The copier transfers files one by one. The copying process is atomic at the file
	copying process is atomic at the file granularity (i.e., individual files might			granularity (i.e., individual files might be visible on the destination before
	be visible on the destination before all files have been transferred) and			all files have been transferred) and ensures that *concurrent transfer of the
	ensures that *concurrent transfer of the same files by multiple copier			same files by multiple copier instances do not result in corrupted files*. Note
	instances do not result in corrupted files*. Note that, due to this and the			that, due to this and the fact that timestamps are updated by the worker, all
	fact that timestamps are updated by the director, all files copied in the same			files copied in the same batch will have the same mtime even though the actual
	batch will have the same mtime even though the actual file creation times on a			file creation times on a given destination might differ.
	given destination might differ.
				The copier is implemented using the ObjStorage API for the sources and
	As a first approximation, the copier can be implemented using rsync, but a			destinations.
				zackUnsubmitted Done Inline Actions typo: "api" -> "API" zack: typo: "api" -> "API"
	dedicated protocol can be devised later. In the case of rsync, one should use
	--files-from to list the file to be copied. Rsync atomically renames files
	one-by-one during transfer; so as long as --inplace is not used, concurrent
	rsync of the same files should not be a problem.


	DB structure			DB structure
	------------			------------

	Postgres SQL definitions for the archival status:			Postgres SQL definitions for the archival status:

	CREATE DOMAIN archive_id AS TEXT;			-- Those names are sample of archives server names
				CREATE TYPE archive_id AS ENUM (
				zackUnsubmitted Done Inline Actions maybe add a comment here that these are sample/initial archive names zack: maybe add a comment here that these are sample/initial archive names
				'uffizi',
				'banco'
				);

	CREATE TABLE archives (			CREATE TABLE archive (
	id archive_id PRIMARY KEY,			id archive_id PRIMARY KEY,
	url TEXT			url TEXT
	);			);

	CREATE TYPE archive_status AS ENUM (			CREATE TYPE archive_status AS ENUM (
	'missing',			'missing',
	'ongoing',			'ongoing',
	'present'			'present',
				'corrupted'
	);			);

	CREATE TABLE content_archive (			CREATE TABLE content_archive (
	content_id sha1 REFERENCES content(sha1),			content_id sha1 unique,
				zackUnsubmitted Done Inline Actions The schema of the JSON here should be described, in a way that allows validation, ideally using json-schema. See sql/json/ in swh-storage for examples. zack: The schema of the JSON here should be described, in a way that allows validation, ideally using…
	archive_id archive_id REFERENCES archives(id),			copies jsonb
	status archive_status,
	mtime timestamptz,
	PRIMARY KEY (content_id, archive_id)
	);			);


				Where the content_archive.copies field is of type jsonb and contains datas
				about the storages that contains (or not) the content represented by the sha1

				{
				"$schema": "http://json-schema.org/schema#",
				"title": "Copies data",
				"description": "Data about the presence of a content into the storages",
				"type": "object",
				"Patternproperties": {
				"^[a-zA-Z1-9]+$": {
				"description": "archival status for the server named by key",
				"type": "object",
				"properties": {
				"status": {
				"description": "Archival status on this copy",
				zackUnsubmitted Done Inline Actions A description of valid values is missing here, adding the following line here should do: "enum": ["missing", "ongoing", "present", "corrupted"] (but please check the JSON Schema spec because I'm not 100% sure) zack: A description of valid values is missing here, adding the following line here should do…
				qcamposAuthorUnsubmitted Done Inline Actions "enum" seems valid. Thanks ! qcampos: "enum" seems valid. Thanks !
				"type": "string",
				"enum": ["missing", "ongoing", "present", "corrupted"]
				},
				"mtime": {
				"description": "Last time of the status update",
				"type": "float"
				}
				}
				zackUnsubmitted Done Inline Actions Several separator "," are missing in this JSON. Pro tip: you can check whether it's correct or not using json-glib-format (which I've just installed on your machine ;-)) zack: Several separator "," are missing in this JSON. Pro tip: you can check whether it's correct or…
				}
				},
				"additionalProperties": false
				}