diff --git a/docs/architecture/index.rst b/docs/architecture/index.rst --- a/docs/architecture/index.rst +++ b/docs/architecture/index.rst @@ -9,5 +9,4 @@ :titlesonly: overview - mirror metadata diff --git a/docs/architecture/mirror.rst b/docs/architecture/mirror.rst deleted file mode 100644 --- a/docs/architecture/mirror.rst +++ /dev/null @@ -1,134 +0,0 @@ -.. _mirror: - - -Mirroring -========= - - -Description ------------ - -A mirror is a full copy of the |swh| archive, operated independently from the -Software Heritage initiative. A minimal mirror consists of two parts: - -- the graph storage (typically an instance of :ref:`swh.storage `), - which contains the Merkle DAG structure of the archive, *except* the - actual content of source code files (AKA blobs), - -- the object storage (typically an instance of :ref:`swh.objstorage `), - which contains all the blobs corresponding to archived source code files. - -However, a usable mirror needs also to be accessible by others. As such, a -proper mirror should also allow to: - -- navigate the archive copy using a Web browser and/or the Web API (typically - using the :ref:`the web application `), -- retrieve data from the copy of the archive (typically using the :ref:`the - vault service `) - -A mirror is initially populated and maintained up-to-date by consuming data -from the |swh| Kafka-based :ref:`journal ` and retrieving the -blob objects (file content) from the |swh| :ref:`object storage `. - -.. note:: It is not required that a mirror is deployed using the |swh| software - stack. Other technologies, including different storage methods, can be - used. But we will focus in this documentation to the case of mirror - deployment using the |swh| software stack. - - -.. thumbnail:: ../images/mirror-architecture.svg - - General view of the |swh| mirroring architecture. - - -Mirroring the Graph Storage -~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The replication of the graph is based on a journal using Kafka_ as event -streaming platform. - -On the Software Heritage side, every addition made to the archive consist of -the addition of a :ref:`data-model` object. The new object is also serialized -as a msgpack_ bytestring which is used as the value of a message added to a -Kafka topic dedicated to the object type. - -The main Kafka topics for the |swh| :ref:`data-model` are: - -- `swh.journal.objects.content` -- `swh.journal.objects.directory` -- `swh.journal.objects.extid` -- `swh.journal.objects.metadata_authority` -- `swh.journal.objects.metadata_fetcher` -- `swh.journal.objects.origin_visit_status` -- `swh.journal.objects.origin_visit` -- `swh.journal.objects.origin` -- `swh.journal.objects.raw_extrinsic_metadata` -- `swh.journal.objects.release` -- `swh.journal.objects.revision` -- `swh.journal.objects.skipped_content` -- `swh.journal.objects.snapshot` - -In order to set up a mirror of the graph, one needs to deploy a stack capable -of retrieving all these topics and store their content reliably. For example a -Kafka cluster configured as a replica of the main Kafka broker hosted by |swh| -would do the job (albeit not in a very useful manner by itself). - -A more useful mirror can be set up using the :ref:`storage ` -component with the help of the special service named `replayer` provided by the -:mod:`swh.storage.replay` module. - -.. TODO: replace this previous link by a link to the 'swh storage replay' - command once available, and ideally once - https://github.com/sphinx-doc/sphinx/issues/880 is fixed - - -Mirroring the Object Storage -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -File content (blobs) are *not* directly stored in messages of the -`swh.journal.objects.content` Kafka topic, which only contains metadata about -them, such as various kinds of cryptographic hashes. A separate component is in -charge of replicating blob objects from the archive and stored them in the -local object storage instance. - -A separate `swh-journal` client should subscribe to the -`swh.journal.objects.content` topic to get the stream of blob objects -identifiers, then retrieve corresponding blobs from the main Software Heritage -object storage, and store them in the local object storage. - -A reference implementation for this component is available in -:ref:`content replayer `. - - -Installation ------------- - -When using the |swh| software stack to deploy a mirror, a number of |swh| -software components must be installed (cf. architecture diagram above): - -- a database to store the graph of the |swh| archive, -- the :ref:`swh-storage` component, -- an object storage solution (can be cloud-based or on local filesystem like - ZFS pools), -- the :ref:`swh-objstorage` component, -- the :mod:`swh.storage.replay` service (part of the :ref:`swh-storage` - package) -- the :mod:`swh.objstorage.replayer.replay` service (from the - :ref:`swh-objstorage-replayer` package). - -A `docker-swarm `_ based deployment -solution is provided as a working example of the mirror stack: - - https://forge.softwareheritage.org/source/swh-docker - -It is strongly recommended to start from there before planning a -production-like deployment. - -See the `README `_ -file of the `swh-docker `_ -repository for details. - - -.. _Kafka: https://kafka.apache.org/ -.. _msgpack: https://msgpack.org - diff --git a/docs/index.rst b/docs/index.rst --- a/docs/index.rst +++ b/docs/index.rst @@ -26,12 +26,8 @@ * :ref:`architecture-overview` → get a glimpse of the Software Heritage software architecture -* :ref:`mirror` → learn what a Software Heritage mirror is and how to set up - one * :ref:`Metadata workflow ` → learn how Software Heritage stores and handles metadata -* :ref:`Keycloak ` → learn how to use Keycloak, - the authentication system used by |swh|'s web interface and public APIs Data Model and Specifications ----------------------------- diff --git a/docs/images/mirror-architecture.svg b/sysadm/images/mirror-architecture.svg rename from docs/images/mirror-architecture.svg rename to sysadm/images/mirror-architecture.svg diff --git a/sysadm/mirror-operations/content-replayer.rst b/sysadm/mirror-operations/content-replayer.rst deleted file mode 100644 --- a/sysadm/mirror-operations/content-replayer.rst +++ /dev/null @@ -1,7 +0,0 @@ -.. _content_replayer: - -Content Replayer Service -======================== - -.. todo:: - This page is a work in progress. diff --git a/sysadm/mirror-operations/deploy.rst b/sysadm/mirror-operations/deploy.rst --- a/sysadm/mirror-operations/deploy.rst +++ b/sysadm/mirror-operations/deploy.rst @@ -9,20 +9,26 @@ A mirror deployment will consists in running several components of the |swh| stack: -- an instance of the storage (swh-storage) with its backend storage (PostgreSQL - or Cassandra), -- an instance of the object storage (swh-objstorage) with its backend storage - solution (in-house with the `pathslicer` backend, or cloud based) -- an instance of the front page (swh-web) -- an instance of the search engine (swh-search) -- the vault service and its support tooling, -- the replayer services. +- An instance of the storage (:ref:`swh-storage`); +- A backend database (PostgreSQL or Cassandra) for the storage; +- An instance of the object storage (:ref:`swh-objstorage`); +- A large storage system (zfs or cloud storage) as the objstorage backend; +- An instance of the frontend (:ref:`swh-web`); +- [Optional] An instance of the search engine backend (:ref:`swh-search`); +- [Optional] An elasticsearch instance as swh-search backend; +- [Optional] The vault service and its support tooling (RabbitMQ, + :ref:`swh-scheduler`, :ref:`swh-vault`, ...); +- The replayer services: + + - :mod:`swh.storage.replay` service (part of the :ref:`swh-storage` + package) + - :mod:`swh.objstorage.replayer.replay` service (from the + :ref:`swh-objstorage-replayer` package) Each service consists in an HTTP-based RPC served by a `gunicorn `_ `WSGI `_ server. - Docker-based deployment ----------------------- @@ -36,55 +42,8 @@ It is strongly recommended to :ref:`start from there ` in a test environment before planning a production-like deployment. - -Step by step deployment of a mirror ------------------------------------ - -When using the |swh| software stack to deploy a mirror, a number of |swh| -software components must be installed and configured to interact woth each other: - -#. :ref:`How to deploy the objstorage `: the objstorage - consists in an object storage solution (can be cloud-based or on local - filesystem like ZFS pools) and the :ref:`swh-objstorage` service, - -#. :ref:`How to deploy graph replayer services `: - :mod:`swh-devel:swh.objstorage.replayer.replay` service is responsible for - consuming the ``content`` topic from the |swh| kafka broker and filling the mirror - objstorage, retrieving blob objects from a |swh| objstarage, - -#. :ref:`How to deploy the storage `: the storage consists in a - database to store the graph of the |swh| archive (PostgreSQL or Cassandra) - and the :ref:`swh-devel:swh-storage` service, - -#. :ref:`How to deploy graph replayer services `: - :mod:`swh-devel:swh.storage.replay` service is responsible for consuming from - the |swh| kafka broker and fill the mirror storage, - -#. :ref:`How to deploy the frontend `: the :ref:`frontend - ` consists in a `django `_ - based application serving both the web API and the main UI for browsing the - Archive. - -#. :ref:`How to deploy the search engine `: the :ref:`search engine - ` consists in a `ElasticSearch `_ - based application used by the frontend. - -#. :ref:`How to deploy the vault service `: the :ref:`vault - service ` consists in a backend asynchronous service - allowing the user to ask for a zip archive of a given repository or git - history. - - - .. toctree:: :titlesonly: :hidden: docker - objstorage - storage - content-replayer - graph-replayer - frontend - search - vault diff --git a/sysadm/mirror-operations/frontend.rst b/sysadm/mirror-operations/frontend.rst deleted file mode 100644 --- a/sysadm/mirror-operations/frontend.rst +++ /dev/null @@ -1,8 +0,0 @@ -.. _mirror_frontend: - -Frontend Services -================= - - -.. todo:: - This page is a work in progress. diff --git a/sysadm/mirror-operations/graph-replayer.rst b/sysadm/mirror-operations/graph-replayer.rst deleted file mode 100644 --- a/sysadm/mirror-operations/graph-replayer.rst +++ /dev/null @@ -1,8 +0,0 @@ -.. _mirror_graph_replayer: - -Graph Replayer Service -====================== - - -.. todo:: - This page is a work in progress. diff --git a/sysadm/mirror-operations/index.rst b/sysadm/mirror-operations/index.rst --- a/sysadm/mirror-operations/index.rst +++ b/sysadm/mirror-operations/index.rst @@ -1,31 +1,122 @@ .. _mirror_operations: + Mirror Operations ================= +Description +----------- + A mirror is a full copy of the |swh| archive, operated independently from the -Software Heritage initiative. +Software Heritage initiative. A minimal mirror consists of two parts: + +- the graph storage (typically an instance of :ref:`swh.storage `), + which contains the Merkle DAG structure of the archive, *except* the + actual content of source code files (AKA blobs), + +- the object storage (typically an instance of :ref:`swh.objstorage `), + which contains all the blobs corresponding to archived source code files. + +However, a usable mirror needs also to be accessible by others. As such, a +proper mirror should also allow to: + +- navigate the archive copy using a Web browser and/or the Web API (typically + using the :ref:`the web application `), +- retrieve data from the copy of the archive (typically using the :ref:`the + vault service `) + +A mirror is initially populated and maintained up-to-date by consuming data +from the |swh| Kafka-based :ref:`journal ` and retrieving the +blob objects (file content) from the |swh| :ref:`object storage `. + +.. note:: It is not required that a mirror is deployed using the |swh| software + stack. Other technologies, including different storage methods, can be + used. But we will focus in this documentation to the case of mirror + deployment using the |swh| software stack. + + +.. thumbnail:: ../images/mirror-architecture.svg + + General view of the |swh| mirroring architecture. + +Mirroring the Graph Storage +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The replication of the graph is based on a journal using Kafka_ as event +streaming platform. + +On the Software Heritage side, every addition made to the archive consist of +the addition of a :ref:`data-model` object. The new object is also serialized +as a msgpack_ bytestring which is used as the value of a message added to a +Kafka topic dedicated to the object type. + +The main Kafka topics for the |swh| :ref:`data-model` are: + +- `swh.journal.objects.content` +- `swh.journal.objects.directory` +- `swh.journal.objects.extid` +- `swh.journal.objects.metadata_authority` +- `swh.journal.objects.metadata_fetcher` +- `swh.journal.objects.origin_visit_status` +- `swh.journal.objects.origin_visit` +- `swh.journal.objects.origin` +- `swh.journal.objects.raw_extrinsic_metadata` +- `swh.journal.objects.release` +- `swh.journal.objects.revision` +- `swh.journal.objects.skipped_content` +- `swh.journal.objects.snapshot` + +In order to set up a mirror of the graph, one needs to deploy a stack capable +of retrieving all these topics and store their content reliably. For example a +Kafka cluster configured as a replica of the main Kafka broker hosted by |swh| +would do the job (albeit not in a very useful manner by itself). + +A more useful mirror can be set up using the :ref:`storage ` +component with the help of the special service named `replayer` provided by the +:mod:`swh.storage.replay` module. + +.. TODO: replace this previous link by a link to the 'swh storage replay' + command once available, and ideally once + https://github.com/sphinx-doc/sphinx/issues/880 is fixed + + +Mirroring the Object Storage +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +File content (blobs) are *not* directly stored in messages of the +`swh.journal.objects.content` Kafka topic, which only contains metadata about +them, such as various kinds of cryptographic hashes. A separate component is in +charge of replicating blob objects from the archive and stored them in the +local object storage instance. + +A separate `swh-journal` client should subscribe to the +`swh.journal.objects.content` topic to get the stream of blob objects +identifiers, then retrieve corresponding blobs from the main Software Heritage +object storage, and store them in the local object storage. + +A reference implementation for this component is available in +:ref:`content replayer `. -A mirror should be able to: -- store a full copy of the archive, +Installation +------------ -- serve the data using the web UI, +When using the |swh| software stack to deploy a mirror, a number of |swh| +software components must be installed (cf. architecture diagram above). -- search the archive using the web UI, +A `docker-swarm `_ based deployment +solution is provided as a working example of the mirror stack, +see :ref:`mirror_deploy`. -- serve the data using the public API, +It is strongly recommended to start from there before planning a +production-like deployment. -- allow users to retrieve content from the archive using the :ref:`Vault - ` service. +.. _Kafka: https://kafka.apache.org/ +.. _msgpack: https://msgpack.org -See the :ref:`swh-devel:mirror` for a complete description of the mirror -architecture. -You may want to read: +You may also want to read: -- :ref:`mirror_deploy` if you want to deploy a mirror of the |swh| archive on - your infrastructure. - :ref:`mirror_monitor` to learn how to monitor your mirror and how to report its health back the |swh|. - :ref:`mirror_onboard` for the |swh| side view of adding a new mirror. diff --git a/sysadm/mirror-operations/objstorage.rst b/sysadm/mirror-operations/objstorage.rst deleted file mode 100644 --- a/sysadm/mirror-operations/objstorage.rst +++ /dev/null @@ -1,8 +0,0 @@ -.. _mirror_objstorage: - -Objstorage Service -================== - - -.. todo:: - This page is a work in progress. diff --git a/sysadm/mirror-operations/search.rst b/sysadm/mirror-operations/search.rst deleted file mode 100644 --- a/sysadm/mirror-operations/search.rst +++ /dev/null @@ -1,8 +0,0 @@ -.. _mirror_search: - -Search Services -=============== - - -.. todo:: - This page is a work in progress. diff --git a/sysadm/mirror-operations/storage.rst b/sysadm/mirror-operations/storage.rst deleted file mode 100644 --- a/sysadm/mirror-operations/storage.rst +++ /dev/null @@ -1,8 +0,0 @@ -.. _mirror_storage: - -Storage Services -================ - - -.. todo:: - This page is a work in progress. diff --git a/sysadm/mirror-operations/vault.rst b/sysadm/mirror-operations/vault.rst deleted file mode 100644 --- a/sysadm/mirror-operations/vault.rst +++ /dev/null @@ -1,8 +0,0 @@ -.. _mirror_vault: - -Vault Services -============== - - -.. todo:: - This page is a work in progress.