diff --git a/docs/mirror.rst b/docs/mirror.rst index 145ab6c..b2fd491 100644 --- a/docs/mirror.rst +++ b/docs/mirror.rst @@ -1,138 +1,132 @@ -.. highlight:: bash - .. _mirror: -Software Heritage Mirror -======================== + +Mirroring +========= + Description ----------- -A mirror is a full copy of the |swh| Archive. A minimal copy consists in 2 -parts: +A mirror is a full copy of the |swh| archive, operated independently from the +Software Heritage initiative. A minimal mirror consists of two parts: - the graph storage (typically an instance of :ref:`swh.storage `), -- the object storage (typically an instance of :ref:`swh.objstorage `). + which contains the Merkle DAG structure of the archive, *except* the + actual content of source code files (AKA blobs), + +- the object storage (typically an instance of :ref:`swh.objstorage `), + which contains all the blobs corresponding to archived source code files. -However, a usable mirror needs also to be accessible. As such, a proper mirror -should also allow to: +However, a usable mirror needs also to be accessible by others. As such, a +proper mirror should also allow to: -- navigate the copy of the archive using a web browser (typically using the - :ref:`the web application `), +- navigate the archive copy using a Web browser and/or the Web API (typically + using the :ref:`the web application `), - retrieve data from the copy of the archive (typically using the :ref:`the vault service `) -A mirror is filled consuming data from the |swh| Kafka-based :ref:`journal -` and retrieving the blob objects (file content) from the |swh| -:ref:`object storage `. +A mirror is initially populated and maintained up-to-date by consuming data +from the |swh| Kafka-based :ref:`journal ` and retrieving the +blob objects (file content) from the |swh| :ref:`object storage `. -.. note:: A mirror of the |swh| Archive is not necessarly implemented using the - |swh| software stack. In this documentation however we will describe the - case of a mirror using the |swh| software stack. +.. note:: It is not required that a mirror is deployed using the |swh| software + stack. Other technologies, including different storage methods, can be + used. But we will focus in this documentation to the case of mirror + deployment using the |swh| software stack. .. thumbnail:: images/mirror-architecture.svg General view of the |swh| mirroring architecture. -In this documentation, we will focus only on replication mechanisms using the -software stack provided by |swh|. Setting up web services or other storage -methods will not be covered here. +Mirroring the Graph Storage +~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Replicating the Graph Storage -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The replication of the graph is based on a journal using Kafka as event +The replication of the graph is based on a journal using Kafka_ as event streaming platform. -On the main Software Heritage side, every addition made to the graph consist in -the insertion of a :ref:`data-model` object. This added object is also -serialized as a msgpack_ bytestring which is used as value of a Kafka message -in a topic dedicated to the object type. +On the Software Heritage side, every addition made to the archive consist of +the addition of a :ref:`data-model` object. The new object is also serialized +as a msgpack_ bytestring which is used as the value of a message added to a +Kafka topic dedicated to the object type. -Topics for the main part of the |swh| :ref:`data-model` are: +The main Kafka topics for the |swh| :ref:`data-model` are: - `swh.journal.objects.content` -- `swh.journal.objects.skipped_content` - `swh.journal.objects.directory` -- `swh.journal.objects.revision` -- `swh.journal.objects.release` -- `swh.journal.objects.snapshot` -- `swh.journal.objects.origin` -- `swh.journal.objects.origin_visit` -- `swh.journal.objects.origin_visit_status` - -In addition to these are a few topics for :ref:`extrinsic metadata -`: - - `swh.journal.objects.metadata_authority` - `swh.journal.objects.metadata_fetcher` +- `swh.journal.objects.origin_visit_status` +- `swh.journal.objects.origin_visit` +- `swh.journal.objects.origin` - `swh.journal.objects.raw_extrinsic_metadata` +- `swh.journal.objects.release` +- `swh.journal.objects.revision` +- `swh.journal.objects.skipped_content` +- `swh.journal.objects.snapshot` - -In order to set up a mirror of the graph, one need to deploy a stack capable of -retrieving all these topics and store their content relialably. For example a -kafka cluster configured as a replica of the main kafka broker hoste by |swh| +In order to set up a mirror of the graph, one needs to deploy a stack capable +of retrieving all these topics and store their content reliably. For example a +kafka cluster configured as a replica of the main kafka broker hosted by |swh| would do the job (albeit not in a very useful manner by itself). -A more usable mirror can be set up using the :ref:`Storage ` +A more useful mirror can be set up using the :ref:`storage ` component with the help of the special service named `replayer` provided by the :doc:`apidoc/swh.storage.replay` module. + .. TODO: replace this previous link by a link to the 'swh storage replay' command once available, and ideally once - https://github.com/sphinx-doc/sphinx/issues/880 is fixed, but humm... + https://github.com/sphinx-doc/sphinx/issues/880 is fixed + -Replicating the Object Storage -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Mirroring the Object Storage +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -File content (blobs) are **not** embedded in messages of the -`swh.journal.objects.content` Kafka topic. As these messages do not include the -file content, another component must be in charge of replicating blob objects -from the original Software Heritage Archive and inserted in the local object -storage instance. +File content (blobs) are *not* directly stored in messages of the +`swh.journal.objects.content` Kafka topic, which only contains metadata about +them, such as various kinds of cryptographic hashes. A separate component is in +charge of replicating blob objects from the archive and stored them in the +local object storage instance. -The idea for this component is to have another `swh-journal` client that -subscribe to the `swh.journal.objects.content` topic to get the stream of blob -objects identifiers, then retrieve the blob object from Software Heritage's -object storage and insert it in the local object storage. +A separate `swh-journal` client should subscribe to the +`swh.journal.objects.content` topic to get the stream of blob objects +identifiers, then retrieve corresponding blobs from the main Software Heritage +object storage, and store them in the local object storage. -The proposed implementation for this component is called the :ref:`content -replayer `. +A reference implementation for this component is available in +:ref:`content replayer `. Installation ------------ -If using the |swh| software stack to deploy a mirror, a number of -|swh| software components must be installed. - -As shown in the architecture diagram above, one needs to have: +When using the |swh| software stack to deploy a mirror, a number of |swh| +software components must be installed (cf. architecture diagram above): -- a database to store the graph of the |swh| Archive, +- a database to store the graph of the |swh| archive, - the :ref:`swh-storage` component, -- an object storage solution (can be cloud based or on local filesystem like +- an object storage solution (can be cloud-based or on local filesystem like ZFS pools), - the :ref:`swh-objstorage` component, - the :ref:`swh.storage.replay` service (part of the :ref:`swh-storage` package) - the :ref:`swh.objstorage.replayer.replay` service (from the :ref:`swh-objstorage-replayer` package). -As this can be quite complex to set up properly, we provide a `docker-swarm -`_ based deployment which is provided as -a working example of the mirror stack: +A `docker-swarm `_ based deployment +solution is provided as a working example of the mirror stack: https://forge.softwareheritage.org/source/swh-docker It is strongly recommended to start from there before planning a production-like deployment. -See the `README -`_ -file of the `swh-docker -`_ repository for more -detailed explanations. +See the `README `_ +file of the `swh-docker `_ +repository for details. +.. _kafka: https://kafka.apache.org/ .. _msgpack: https://msgpack.org