diff --git a/docs/images/mirror-architecture.svg b/docs/images/mirror-architecture.svg new file mode 100644 index 0000000..1cabe68 --- /dev/null +++ b/docs/images/mirror-architecture.svg @@ -0,0 +1,2312 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Web App + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Storage + API + + + + + + + + + + + + Object Storage + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ObjStorage + API + + + + + + + GraReplayer + + + + + + Insertion HTTP request + Get HTTP request + SELECT SQL request + INSERT SQL request + Service + Kafka topics + + + + Mirror + + + + + Software Heritage + + + + ObjReplayer + + + + Service + + + + + + ObjStorage + API + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + "content" + * + + + + + Journal + + + + + + + + + + + + + + + + + + + diff --git a/docs/index.rst b/docs/index.rst index b78fea5..8e6b906 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,174 +1,178 @@ .. _swh-docs: Software Heritage - Development Documentation ============================================= Getting started --------------- * :ref:`getting-started` ← start here to get your own Software Heritage platform running in less than 5 minutes, or * :ref:`developer-setup` ← here to hack on the Software Heritage software stack Architecture ------------ * :ref:`architecture` ← go there to have a glimpse on the Software Heritage software architecture +* :ref:`mirror` ← go there to have learn what a Software Heritage mirror is and + how set up one + Components ---------- Here is brief overview of the most relevant software components in the Software Heritage stack. Each component name is linked to the development documentation of the corresponding Python module. :ref:`swh.core ` low-level utilities and helpers used by almost all other modules in the stack :ref:`swh.dataset ` public datasets and periodic data dumps of the archive released by Software Heritage :ref:`swh.deposit ` push-based deposit of software artifacts to the archive swh.docs developer documentation (used to generate this doc you are reading) :ref:`swh.fuse ` Virtual file system to browse the Software Heritage archive, based on `FUSE `_ :ref:`swh.graph ` Fast, compressed, in-memory representation of the archive, with tooling to generate and query it. :ref:`swh.indexer ` tools and workers used to crawl the content of the archive and extract derived information from any artifact stored in it :ref:`swh.journal ` persistent logger of changes to the archive, with publish-subscribe support :ref:`swh.lister ` collection of listers for all sorts of source code hosting and distribution places (forges, distributions, package managers, etc.) :ref:`swh.loader-core ` low-level loading utilities and helpers used by all other loaders :ref:`swh.loader-git ` loader for `Git `_ repositories :ref:`swh.loader-mercurial ` loader for `Mercurial `_ repositories :ref:`swh.loader-svn ` loader for `Subversion `_ repositories :ref:`swh.model ` implementation of the :ref:`data-model` to archive source code artifacts :ref:`swh.objstorage ` content-addressable object storage :ref:`swh.objstorage.replayer ` Object storage replication tool :ref:`swh.scanner ` source code scanner to analyze code bases and compare them with source code artifacts archived by Software Heritage :ref:`swh.scheduler ` task manager for asynchronous/delayed tasks, used for recurrent (e.g., listing a forge, loading new stuff from a Git repository) and one-off activities (e.g., loading a specific version of a source package) :ref:`swh.search ` search engine for the archive :ref:`swh.storage ` abstraction layer over the archive, allowing to access all stored source code artifacts as well as their metadata :ref:`swh.vault ` implementation of the vault service, allowing to retrieve parts of the archive as self-contained bundles (e.g., individual releases, entire repository snapshots, etc.) :ref:`swh.web ` Web application(s) to browse the archive, for both interactive (HTML UI) and mechanized (REST API) use :ref:`swh.web.client ` Python client for :ref:`swh.web ` Dependencies ------------ The dependency relationships among the various modules are depicted below. .. _py-deps-swh: .. figure:: images/py-deps-swh.svg :width: 1024px :align: center Dependencies among top-level Python modules (click to zoom). Archive ------- * :ref:`Archive ChangeLog `: notable changes to the archive over time Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * `URLs index `_ * :ref:`search` * :ref:`glossary` .. ensure sphinx does not complain about index files not being included .. toctree:: :maxdepth: 2 :caption: Contents: :titlesonly: :hidden: architecture getting-started developer-setup journal + mirror API documentation swh.core swh.dataset swh.deposit swh.fuse swh.graph swh.indexer swh.journal swh.lister swh.loader swh.model swh.objstorage swh.scanner swh.scheduler swh.search swh.storage swh.vault swh.web swh.web.client diff --git a/docs/mirror.rst b/docs/mirror.rst new file mode 100644 index 0000000..145ab6c --- /dev/null +++ b/docs/mirror.rst @@ -0,0 +1,138 @@ +.. highlight:: bash + +.. _mirror: + +Software Heritage Mirror +======================== + +Description +----------- + +A mirror is a full copy of the |swh| Archive. A minimal copy consists in 2 +parts: + +- the graph storage (typically an instance of :ref:`swh.storage `), +- the object storage (typically an instance of :ref:`swh.objstorage `). + +However, a usable mirror needs also to be accessible. As such, a proper mirror +should also allow to: + +- navigate the copy of the archive using a web browser (typically using the + :ref:`the web application `), +- retrieve data from the copy of the archive (typically using the :ref:`the + vault service `) + +A mirror is filled consuming data from the |swh| Kafka-based :ref:`journal +` and retrieving the blob objects (file content) from the |swh| +:ref:`object storage `. + +.. note:: A mirror of the |swh| Archive is not necessarly implemented using the + |swh| software stack. In this documentation however we will describe the + case of a mirror using the |swh| software stack. + + +.. thumbnail:: images/mirror-architecture.svg + + General view of the |swh| mirroring architecture. + +In this documentation, we will focus only on replication mechanisms using the +software stack provided by |swh|. Setting up web services or other storage +methods will not be covered here. + + +Replicating the Graph Storage +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The replication of the graph is based on a journal using Kafka as event +streaming platform. + +On the main Software Heritage side, every addition made to the graph consist in +the insertion of a :ref:`data-model` object. This added object is also +serialized as a msgpack_ bytestring which is used as value of a Kafka message +in a topic dedicated to the object type. + +Topics for the main part of the |swh| :ref:`data-model` are: + +- `swh.journal.objects.content` +- `swh.journal.objects.skipped_content` +- `swh.journal.objects.directory` +- `swh.journal.objects.revision` +- `swh.journal.objects.release` +- `swh.journal.objects.snapshot` +- `swh.journal.objects.origin` +- `swh.journal.objects.origin_visit` +- `swh.journal.objects.origin_visit_status` + +In addition to these are a few topics for :ref:`extrinsic metadata +`: + +- `swh.journal.objects.metadata_authority` +- `swh.journal.objects.metadata_fetcher` +- `swh.journal.objects.raw_extrinsic_metadata` + + +In order to set up a mirror of the graph, one need to deploy a stack capable of +retrieving all these topics and store their content relialably. For example a +kafka cluster configured as a replica of the main kafka broker hoste by |swh| +would do the job (albeit not in a very useful manner by itself). + +A more usable mirror can be set up using the :ref:`Storage ` +component with the help of the special service named `replayer` provided by the +:doc:`apidoc/swh.storage.replay` module. +.. TODO: replace this previous link by a link to the 'swh storage replay' + command once available, and ideally once + https://github.com/sphinx-doc/sphinx/issues/880 is fixed, but humm... + +Replicating the Object Storage +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +File content (blobs) are **not** embedded in messages of the +`swh.journal.objects.content` Kafka topic. As these messages do not include the +file content, another component must be in charge of replicating blob objects +from the original Software Heritage Archive and inserted in the local object +storage instance. + +The idea for this component is to have another `swh-journal` client that +subscribe to the `swh.journal.objects.content` topic to get the stream of blob +objects identifiers, then retrieve the blob object from Software Heritage's +object storage and insert it in the local object storage. + +The proposed implementation for this component is called the :ref:`content +replayer `. + + +Installation +------------ + +If using the |swh| software stack to deploy a mirror, a number of +|swh| software components must be installed. + +As shown in the architecture diagram above, one needs to have: + +- a database to store the graph of the |swh| Archive, +- the :ref:`swh-storage` component, +- an object storage solution (can be cloud based or on local filesystem like + ZFS pools), +- the :ref:`swh-objstorage` component, +- the :ref:`swh.storage.replay` service (part of the :ref:`swh-storage` + package) +- the :ref:`swh.objstorage.replayer.replay` service (from the + :ref:`swh-objstorage-replayer` package). + +As this can be quite complex to set up properly, we provide a `docker-swarm +`_ based deployment which is provided as +a working example of the mirror stack: + + https://forge.softwareheritage.org/source/swh-docker + +It is strongly recommended to start from there before planning a +production-like deployment. + +See the `README +`_ +file of the `swh-docker +`_ repository for more +detailed explanations. + + +.. _msgpack: https://msgpack.org