diff --git a/docs/images/mirror-architecture.svg b/docs/images/mirror-architecture.svg new file mode 100644 --- /dev/null +++ b/docs/images/mirror-architecture.svg @@ -0,0 +1,2312 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Web App + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Storage + API + + + + + + + + + + + + Object Storage + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ObjStorage + API + + + + + + + GraReplayer + + + + + + Insertion HTTP request + Get HTTP request + SELECT SQL request + INSERT SQL request + Service + Kafka topics + + + + Mirror + + + + + Software Heritage + + + + ObjReplayer + + + + Service + + + + + + ObjStorage + API + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + "content" + * + + + + + Journal + + + + + + + + + + + + + + + + + + + diff --git a/docs/index.rst b/docs/index.rst --- a/docs/index.rst +++ b/docs/index.rst @@ -8,17 +8,20 @@ Getting started --------------- -* :ref:`getting-started` ← start here to get your own Software Heritage - platform running in less than 5 minutes, or -* :ref:`developer-setup` ← here to hack on the Software Heritage software - stack +* :ref:`getting-started` → deploy a local copy of the Software Heritage + software stack in less than 5 minutes, or +* :ref:`developer-setup` → get a working development setup that allows to hack + on the Software Heritage software stack Architecture ------------ -* :ref:`architecture` ← go there to have a glimpse on the Software Heritage software +* :ref:`architecture` → get a glimpse of the Software Heritage software architecture +* :ref:`mirror` → learn what a Software Heritage mirror is and how to set up + one + Components @@ -153,6 +156,7 @@ getting-started developer-setup journal + mirror API documentation swh.core swh.dataset diff --git a/docs/mirror.rst b/docs/mirror.rst new file mode 100644 --- /dev/null +++ b/docs/mirror.rst @@ -0,0 +1,132 @@ +.. _mirror: + + +Mirroring +========= + + +Description +----------- + +A mirror is a full copy of the |swh| archive, operated independently from the +Software Heritage initiative. A minimal mirror consists of two parts: + +- the graph storage (typically an instance of :ref:`swh.storage `), + which contains the Merkle DAG structure of the archive, *except* the + actual content of source code files (AKA blobs), + +- the object storage (typically an instance of :ref:`swh.objstorage `), + which contains all the blobs corresponding to archived source code files. + +However, a usable mirror needs also to be accessible by others. As such, a +proper mirror should also allow to: + +- navigate the archive copy using a Web browser and/or the Web API (typically + using the :ref:`the web application `), +- retrieve data from the copy of the archive (typically using the :ref:`the + vault service `) + +A mirror is initially populated and maintained up-to-date by consuming data +from the |swh| Kafka-based :ref:`journal ` and retrieving the +blob objects (file content) from the |swh| :ref:`object storage `. + +.. note:: It is not required that a mirror is deployed using the |swh| software + stack. Other technologies, including different storage methods, can be + used. But we will focus in this documentation to the case of mirror + deployment using the |swh| software stack. + + +.. thumbnail:: images/mirror-architecture.svg + + General view of the |swh| mirroring architecture. + + +Mirroring the Graph Storage +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The replication of the graph is based on a journal using Kafka_ as event +streaming platform. + +On the Software Heritage side, every addition made to the archive consist of +the addition of a :ref:`data-model` object. The new object is also serialized +as a msgpack_ bytestring which is used as the value of a message added to a +Kafka topic dedicated to the object type. + +The main Kafka topics for the |swh| :ref:`data-model` are: + +- `swh.journal.objects.content` +- `swh.journal.objects.directory` +- `swh.journal.objects.metadata_authority` +- `swh.journal.objects.metadata_fetcher` +- `swh.journal.objects.origin_visit_status` +- `swh.journal.objects.origin_visit` +- `swh.journal.objects.origin` +- `swh.journal.objects.raw_extrinsic_metadata` +- `swh.journal.objects.release` +- `swh.journal.objects.revision` +- `swh.journal.objects.skipped_content` +- `swh.journal.objects.snapshot` + +In order to set up a mirror of the graph, one needs to deploy a stack capable +of retrieving all these topics and store their content reliably. For example a +kafka cluster configured as a replica of the main kafka broker hosted by |swh| +would do the job (albeit not in a very useful manner by itself). + +A more useful mirror can be set up using the :ref:`storage ` +component with the help of the special service named `replayer` provided by the +:doc:`apidoc/swh.storage.replay` module. + +.. TODO: replace this previous link by a link to the 'swh storage replay' + command once available, and ideally once + https://github.com/sphinx-doc/sphinx/issues/880 is fixed + + +Mirroring the Object Storage +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +File content (blobs) are *not* directly stored in messages of the +`swh.journal.objects.content` Kafka topic, which only contains metadata about +them, such as various kinds of cryptographic hashes. A separate component is in +charge of replicating blob objects from the archive and stored them in the +local object storage instance. + +A separate `swh-journal` client should subscribe to the +`swh.journal.objects.content` topic to get the stream of blob objects +identifiers, then retrieve corresponding blobs from the main Software Heritage +object storage, and store them in the local object storage. + +A reference implementation for this component is available in +:ref:`content replayer `. + + +Installation +------------ + +When using the |swh| software stack to deploy a mirror, a number of |swh| +software components must be installed (cf. architecture diagram above): + +- a database to store the graph of the |swh| archive, +- the :ref:`swh-storage` component, +- an object storage solution (can be cloud-based or on local filesystem like + ZFS pools), +- the :ref:`swh-objstorage` component, +- the :ref:`swh.storage.replay` service (part of the :ref:`swh-storage` + package) +- the :ref:`swh.objstorage.replayer.replay` service (from the + :ref:`swh-objstorage-replayer` package). + +A `docker-swarm `_ based deployment +solution is provided as a working example of the mirror stack: + + https://forge.softwareheritage.org/source/swh-docker + +It is strongly recommended to start from there before planning a +production-like deployment. + +See the `README `_ +file of the `swh-docker `_ +repository for details. + + +.. _kafka: https://kafka.apache.org/ +.. _msgpack: https://msgpack.org