diff --git a/docs/architecture.rst b/docs/architecture.rst --- a/docs/architecture.rst +++ b/docs/architecture.rst @@ -1,92 +1,3 @@ -.. _architecture: +:orphan: -Software Architecture -===================== - -From an end-user point of view, the |swh| platform consists in the -:term:`archive`, which can be accessed using the web interface or its REST API. -Behind the scene (and the web app) are several components that expose -different aspects of the |swh| :term:`archive` as internal RPC APIs. - -Each of these internal APIs have a dedicated (Postgresql) database. - -A global (and incomplete) view of this architecture looks like: - -.. thumbnail:: images/general-architecture.svg - - General view of the |swh| architecture. - -The front API components are: - -- :ref:`Storage API ` (including the Metadata Storage) -- :ref:`Deposit API ` -- :ref:`Vault API ` -- :ref:`Indexer API ` -- :ref:`Scheduler API ` - -On the back stage of this show, a celery_ based game of tasks and workers -occurs to perform all the required work to fill, maintain and update the |swh| -:term:`archive`. - -The main components involved in this choreography are: - -- :term:`Listers `: a lister is a type of task aiming at scraping a - web site, a forge, etc. to gather all the source code repositories it can - find. For each found source code repository, a :term:`loader` task is - created. - -- :term:`Loaders `: a loader is a type of task aiming at importing or - updating a source code repository. It is the one that inserts :term:`blob` - objects in the :term:`object storage`, and inserts nodes and edges in the - :ref:`graph `. - -- :term:`Indexers `: an indexer is a type of task aiming at crawling - the content of the :term:`archive` to extract derived information (mimetype, - etc.) - -- :term:`Vault `: this type of celery task is responsible for cooking a - compressed archive (zip or tgz) of an archived object (typically a directory - or a repository). Since this can be a rather long process, it is delegated to - an asynchronous (celery) task. - - -Tasks ------ - -Listers -+++++++ - -The following sequence diagram shows the interactions between these components -when a new forge needs to be archived. This example depicts the case of a -gitlab_ forge, but any other supported source type would be very similar. - -.. thumbnail:: images/tasks-lister.svg - -As one might observe in this diagram, it does two things: - -- it asks the forge (a gitlab_ instance in this case) the list of known - repositories, and - -- it insert one :term:`loader` task for each source code repository that will - be in charge of importing the content of that repository. - -Note that most listers usually work in incremental mode, meaning they store in a -dedicated database the current state of the listing of the forge. Then, on a subsequent -execution of the lister, it will ask only for new repositories. - -Also note that if the lister inserts a new loading task for a repository for which a -loading task already exists, the existing task will be updated (if needed) instead of -creating a new task. - -Loaders -+++++++ - -The sequence diagram below describe this second step of importing the content -of a repository. Once again, we take the example of a git repository, but any -other type of repository would be very similar. - -.. thumbnail:: images/tasks-git-loader.svg - - -.. _celery: https://www.celeryproject.org -.. _gitlab: https://gitlab.com +This page was moved to: :ref:`architecture-overview`. diff --git a/docs/architecture/index.rst b/docs/architecture/index.rst --- a/docs/architecture/index.rst +++ b/docs/architecture/index.rst @@ -1,10 +1,13 @@ -Architecture -============ +.. _architecture: + +Software Architecture +===================== + .. toctree:: :maxdepth: 2 :titlesonly: - ../architecture - ../mirror + overview + mirror ../keycloak/index diff --git a/docs/mirror.rst b/docs/architecture/mirror.rst copy from docs/mirror.rst copy to docs/architecture/mirror.rst --- a/docs/mirror.rst +++ b/docs/architecture/mirror.rst @@ -130,3 +130,4 @@ .. _Kafka: https://kafka.apache.org/ .. _msgpack: https://msgpack.org + diff --git a/docs/architecture.rst b/docs/architecture/overview.rst copy from docs/architecture.rst copy to docs/architecture/overview.rst --- a/docs/architecture.rst +++ b/docs/architecture/overview.rst @@ -1,7 +1,8 @@ -.. _architecture: +.. _architecture-overview: + +Software Architecture Overview +============================== -Software Architecture -===================== From an end-user point of view, the |swh| platform consists in the :term:`archive`, which can be accessed using the web interface or its REST API. @@ -90,3 +91,4 @@ .. _celery: https://www.celeryproject.org .. _gitlab: https://gitlab.com + diff --git a/docs/mirror.rst b/docs/mirror.rst --- a/docs/mirror.rst +++ b/docs/mirror.rst @@ -1,132 +1,3 @@ -.. _mirror: +:orphan: - -Mirroring -========= - - -Description ------------ - -A mirror is a full copy of the |swh| archive, operated independently from the -Software Heritage initiative. A minimal mirror consists of two parts: - -- the graph storage (typically an instance of :ref:`swh.storage `), - which contains the Merkle DAG structure of the archive, *except* the - actual content of source code files (AKA blobs), - -- the object storage (typically an instance of :ref:`swh.objstorage `), - which contains all the blobs corresponding to archived source code files. - -However, a usable mirror needs also to be accessible by others. As such, a -proper mirror should also allow to: - -- navigate the archive copy using a Web browser and/or the Web API (typically - using the :ref:`the web application `), -- retrieve data from the copy of the archive (typically using the :ref:`the - vault service `) - -A mirror is initially populated and maintained up-to-date by consuming data -from the |swh| Kafka-based :ref:`journal ` and retrieving the -blob objects (file content) from the |swh| :ref:`object storage `. - -.. note:: It is not required that a mirror is deployed using the |swh| software - stack. Other technologies, including different storage methods, can be - used. But we will focus in this documentation to the case of mirror - deployment using the |swh| software stack. - - -.. thumbnail:: images/mirror-architecture.svg - - General view of the |swh| mirroring architecture. - - -Mirroring the Graph Storage -~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The replication of the graph is based on a journal using Kafka_ as event -streaming platform. - -On the Software Heritage side, every addition made to the archive consist of -the addition of a :ref:`data-model` object. The new object is also serialized -as a msgpack_ bytestring which is used as the value of a message added to a -Kafka topic dedicated to the object type. - -The main Kafka topics for the |swh| :ref:`data-model` are: - -- `swh.journal.objects.content` -- `swh.journal.objects.directory` -- `swh.journal.objects.metadata_authority` -- `swh.journal.objects.metadata_fetcher` -- `swh.journal.objects.origin_visit_status` -- `swh.journal.objects.origin_visit` -- `swh.journal.objects.origin` -- `swh.journal.objects.raw_extrinsic_metadata` -- `swh.journal.objects.release` -- `swh.journal.objects.revision` -- `swh.journal.objects.skipped_content` -- `swh.journal.objects.snapshot` - -In order to set up a mirror of the graph, one needs to deploy a stack capable -of retrieving all these topics and store their content reliably. For example a -Kafka cluster configured as a replica of the main Kafka broker hosted by |swh| -would do the job (albeit not in a very useful manner by itself). - -A more useful mirror can be set up using the :ref:`storage ` -component with the help of the special service named `replayer` provided by the -:doc:`apidoc/swh.storage.replay` module. - -.. TODO: replace this previous link by a link to the 'swh storage replay' - command once available, and ideally once - https://github.com/sphinx-doc/sphinx/issues/880 is fixed - - -Mirroring the Object Storage -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -File content (blobs) are *not* directly stored in messages of the -`swh.journal.objects.content` Kafka topic, which only contains metadata about -them, such as various kinds of cryptographic hashes. A separate component is in -charge of replicating blob objects from the archive and stored them in the -local object storage instance. - -A separate `swh-journal` client should subscribe to the -`swh.journal.objects.content` topic to get the stream of blob objects -identifiers, then retrieve corresponding blobs from the main Software Heritage -object storage, and store them in the local object storage. - -A reference implementation for this component is available in -:ref:`content replayer `. - - -Installation ------------- - -When using the |swh| software stack to deploy a mirror, a number of |swh| -software components must be installed (cf. architecture diagram above): - -- a database to store the graph of the |swh| archive, -- the :ref:`swh-storage` component, -- an object storage solution (can be cloud-based or on local filesystem like - ZFS pools), -- the :ref:`swh-objstorage` component, -- the :ref:`swh.storage.replay` service (part of the :ref:`swh-storage` - package) -- the :ref:`swh.objstorage.replayer.replay` service (from the - :ref:`swh-objstorage-replayer` package). - -A `docker-swarm `_ based deployment -solution is provided as a working example of the mirror stack: - - https://forge.softwareheritage.org/source/swh-docker - -It is strongly recommended to start from there before planning a -production-like deployment. - -See the `README `_ -file of the `swh-docker `_ -repository for details. - - -.. _Kafka: https://kafka.apache.org/ -.. _msgpack: https://msgpack.org +This page was moved to: :ref:`mirror`. diff --git a/swh/docs/sphinx/conf.py b/swh/docs/sphinx/conf.py --- a/swh/docs/sphinx/conf.py +++ b/swh/docs/sphinx/conf.py @@ -133,6 +133,8 @@ "swh-deposit/metadata": "api/metadata.html", "swh-deposit/specs/blueprint": "../api/use-cases.html", "swh-deposit/user-manual": "api/user-manual.html", + "architecture": "architecture/overview.html", + "mirror": "architecture/mirror.html", }