diff --git a/docs/architecture/index.rst b/docs/architecture/index.rst index 3664ce2..291dbeb 100644 --- a/docs/architecture/index.rst +++ b/docs/architecture/index.rst @@ -1,13 +1,12 @@ .. _architecture: Software Architecture ===================== .. toctree:: :maxdepth: 2 :titlesonly: overview - mirror metadata diff --git a/docs/architecture/mirror.rst b/docs/architecture/mirror.rst deleted file mode 100644 index 7dc88f3..0000000 --- a/docs/architecture/mirror.rst +++ /dev/null @@ -1,134 +0,0 @@ -.. _mirror: - - -Mirroring -========= - - -Description ------------ - -A mirror is a full copy of the |swh| archive, operated independently from the -Software Heritage initiative. A minimal mirror consists of two parts: - -- the graph storage (typically an instance of :ref:`swh.storage `), - which contains the Merkle DAG structure of the archive, *except* the - actual content of source code files (AKA blobs), - -- the object storage (typically an instance of :ref:`swh.objstorage `), - which contains all the blobs corresponding to archived source code files. - -However, a usable mirror needs also to be accessible by others. As such, a -proper mirror should also allow to: - -- navigate the archive copy using a Web browser and/or the Web API (typically - using the :ref:`the web application `), -- retrieve data from the copy of the archive (typically using the :ref:`the - vault service `) - -A mirror is initially populated and maintained up-to-date by consuming data -from the |swh| Kafka-based :ref:`journal ` and retrieving the -blob objects (file content) from the |swh| :ref:`object storage `. - -.. note:: It is not required that a mirror is deployed using the |swh| software - stack. Other technologies, including different storage methods, can be - used. But we will focus in this documentation to the case of mirror - deployment using the |swh| software stack. - - -.. thumbnail:: ../images/mirror-architecture.svg - - General view of the |swh| mirroring architecture. - - -Mirroring the Graph Storage -~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The replication of the graph is based on a journal using Kafka_ as event -streaming platform. - -On the Software Heritage side, every addition made to the archive consist of -the addition of a :ref:`data-model` object. The new object is also serialized -as a msgpack_ bytestring which is used as the value of a message added to a -Kafka topic dedicated to the object type. - -The main Kafka topics for the |swh| :ref:`data-model` are: - -- `swh.journal.objects.content` -- `swh.journal.objects.directory` -- `swh.journal.objects.extid` -- `swh.journal.objects.metadata_authority` -- `swh.journal.objects.metadata_fetcher` -- `swh.journal.objects.origin_visit_status` -- `swh.journal.objects.origin_visit` -- `swh.journal.objects.origin` -- `swh.journal.objects.raw_extrinsic_metadata` -- `swh.journal.objects.release` -- `swh.journal.objects.revision` -- `swh.journal.objects.skipped_content` -- `swh.journal.objects.snapshot` - -In order to set up a mirror of the graph, one needs to deploy a stack capable -of retrieving all these topics and store their content reliably. For example a -Kafka cluster configured as a replica of the main Kafka broker hosted by |swh| -would do the job (albeit not in a very useful manner by itself). - -A more useful mirror can be set up using the :ref:`storage ` -component with the help of the special service named `replayer` provided by the -:mod:`swh.storage.replay` module. - -.. TODO: replace this previous link by a link to the 'swh storage replay' - command once available, and ideally once - https://github.com/sphinx-doc/sphinx/issues/880 is fixed - - -Mirroring the Object Storage -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -File content (blobs) are *not* directly stored in messages of the -`swh.journal.objects.content` Kafka topic, which only contains metadata about -them, such as various kinds of cryptographic hashes. A separate component is in -charge of replicating blob objects from the archive and stored them in the -local object storage instance. - -A separate `swh-journal` client should subscribe to the -`swh.journal.objects.content` topic to get the stream of blob objects -identifiers, then retrieve corresponding blobs from the main Software Heritage -object storage, and store them in the local object storage. - -A reference implementation for this component is available in -:ref:`content replayer `. - - -Installation ------------- - -When using the |swh| software stack to deploy a mirror, a number of |swh| -software components must be installed (cf. architecture diagram above): - -- a database to store the graph of the |swh| archive, -- the :ref:`swh-storage` component, -- an object storage solution (can be cloud-based or on local filesystem like - ZFS pools), -- the :ref:`swh-objstorage` component, -- the :mod:`swh.storage.replay` service (part of the :ref:`swh-storage` - package) -- the :mod:`swh.objstorage.replayer.replay` service (from the - :ref:`swh-objstorage-replayer` package). - -A `docker-swarm `_ based deployment -solution is provided as a working example of the mirror stack: - - https://forge.softwareheritage.org/source/swh-docker - -It is strongly recommended to start from there before planning a -production-like deployment. - -See the `README `_ -file of the `swh-docker `_ -repository for details. - - -.. _Kafka: https://kafka.apache.org/ -.. _msgpack: https://msgpack.org - diff --git a/docs/index.rst b/docs/index.rst index c76b9d5..40326a1 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,218 +1,214 @@ .. _swh-docs: Software Heritage - Development Documentation ============================================= Getting started --------------- * :ref:`getting-started` → deploy a local copy of the Software Heritage software stack in less than 5 minutes, or * :ref:`developer-setup` → get a working development setup that allows to hack on the Software Heritage software stack * :ref:`faq` Contributing ------------ * :ref:`patch-submission` → learn how to submit your patches to the Software Heritage codebase * :ref:`code-review` → rules and guidelines to review code in Software Heritage * :ref:`python-style-guide` → how to format the Python code you write Architecture ------------ * :ref:`architecture-overview` → get a glimpse of the Software Heritage software architecture -* :ref:`mirror` → learn what a Software Heritage mirror is and how to set up - one * :ref:`Metadata workflow ` → learn how Software Heritage stores and handles metadata -* :ref:`Keycloak ` → learn how to use Keycloak, - the authentication system used by |swh|'s web interface and public APIs Data Model and Specifications ----------------------------- * :ref:`persistent-identifiers` Specifications of the SoftWare Heritage persistent IDentifiers (SWHID). * :ref:`data-model` Documentation of the main |swh| archive data model. * :ref:`journal-specs` Documentation of the Kafka journal of the |swh| archive. Tutorials --------- * :ref:`testing-guide` * :doc:`/tutorials/issue-debugging-monitoring` * :ref:`Listing the content of your favorite forge ` and :ref:`running a lister in Docker ` * :ref:`Add a new swh package ` * :ref:`doc-contribution` Roadmap ------- * :ref:`roadmap-2021` Engineering ----------- * :ref:`Infrastructure ` Components ---------- Here is brief overview of the most relevant software components in the Software Heritage stack, in alphabetical order. For a better introduction to the architecture, see the :ref:`architecture-overview`, which presents each of them in a didactical order. Each component name is linked to the development documentation of the corresponding Python module. :ref:`swh.auth ` low-level library used by modules needing keycloak authentication :ref:`swh.core ` low-level utilities and helpers used by almost all other modules in the stack :ref:`swh.counters ` service providing efficient estimates of the number of objects in the SWH archive, using Redis's Hyperloglog :ref:`swh.dataset ` public datasets and periodic data dumps of the archive released by Software Heritage :ref:`swh.deposit ` push-based deposit of software artifacts to the archive swh.docs developer documentation (used to generate this doc you are reading) :ref:`swh.fuse ` Virtual file system to browse the Software Heritage archive, based on `FUSE `_ :ref:`swh.graph ` Fast, compressed, in-memory representation of the archive, with tooling to generate and query it. :ref:`swh.indexer ` tools and workers used to crawl the content of the archive and extract derived information from any artifact stored in it :ref:`swh.journal ` persistent logger of changes to the archive, with publish-subscribe support :ref:`swh.lister ` collection of listers for all sorts of source code hosting and distribution places (forges, distributions, package managers, etc.) :ref:`swh.loader-core ` low-level loading utilities and helpers used by all other loaders :ref:`swh.loader-git ` loader for `Git `_ repositories :ref:`swh.loader-mercurial ` loader for `Mercurial `_ repositories :ref:`swh.loader-svn ` loader for `Subversion `_ repositories :ref:`swh.loader-cvs ` loader for `CVS `_ repositories :ref:`swh.model ` implementation of the :ref:`data-model` to archive source code artifacts :ref:`swh.objstorage ` content-addressable object storage :ref:`swh.objstorage.replayer ` Object storage replication tool :ref:`swh.perfecthash ` Low level management for read-only content-addressable object storage indexed with a perfect hash table :ref:`swh.scanner ` source code scanner to analyze code bases and compare them with source code artifacts archived by Software Heritage :ref:`swh.scheduler ` task manager for asynchronous/delayed tasks, used for recurrent (e.g., listing a forge, loading new stuff from a Git repository) and one-off activities (e.g., loading a specific version of a source package) :ref:`swh.search ` search engine for the archive :ref:`swh.storage ` abstraction layer over the archive, allowing to access all stored source code artifacts as well as their metadata :ref:`swh.vault ` implementation of the vault service, allowing to retrieve parts of the archive as self-contained bundles (e.g., individual releases, entire repository snapshots, etc.) :ref:`swh.web ` Web application(s) to browse the archive, for both interactive (HTML UI) and mechanized (REST API) use :ref:`swh.web.client ` Python client for :ref:`swh.web ` Dependencies ------------ The dependency relationships among the various modules are depicted below. .. _py-deps-swh: .. figure:: images/py-deps-swh.svg :width: 1024px :align: center Dependencies among top-level Python modules (click to zoom). Archive ------- * :ref:`Archive ChangeLog `: notable changes to the archive over time Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * `URLs index `_ * :ref:`search` * :ref:`glossary` .. ensure sphinx does not complain about index files not being included .. toctree:: :maxdepth: 2 :caption: Contents: :titlesonly: :hidden: getting-started/index architecture/index contributing/index tutorials/index faq/index roadmap/roadmap-2021 api-reference archive-changelog journal Python modules autodocumentation diff --git a/docs/images/mirror-architecture.svg b/sysadm/images/mirror-architecture.svg similarity index 100% rename from docs/images/mirror-architecture.svg rename to sysadm/images/mirror-architecture.svg diff --git a/sysadm/mirror-operations/content-replayer.rst b/sysadm/mirror-operations/content-replayer.rst deleted file mode 100644 index d6ed295..0000000 --- a/sysadm/mirror-operations/content-replayer.rst +++ /dev/null @@ -1,7 +0,0 @@ -.. _content_replayer: - -Content Replayer Service -======================== - -.. todo:: - This page is a work in progress. diff --git a/sysadm/mirror-operations/deploy.rst b/sysadm/mirror-operations/deploy.rst index 24f6889..6aedb82 100644 --- a/sysadm/mirror-operations/deploy.rst +++ b/sysadm/mirror-operations/deploy.rst @@ -1,90 +1,49 @@ .. _mirror_deploy: How to deploy a mirror ====================== This section describes how to deploy a mirror using the software stack provided by |swh|. A mirror deployment will consists in running several components of the |swh| stack: -- an instance of the storage (swh-storage) with its backend storage (PostgreSQL - or Cassandra), -- an instance of the object storage (swh-objstorage) with its backend storage - solution (in-house with the `pathslicer` backend, or cloud based) -- an instance of the front page (swh-web) -- an instance of the search engine (swh-search) -- the vault service and its support tooling, -- the replayer services. +- An instance of the storage (:ref:`swh-storage`); +- A backend database (PostgreSQL or Cassandra) for the storage; +- An instance of the object storage (:ref:`swh-objstorage`); +- A large storage system (zfs or cloud storage) as the objstorage backend; +- An instance of the frontend (:ref:`swh-web`); +- [Optional] An instance of the search engine backend (:ref:`swh-search`); +- [Optional] An elasticsearch instance as swh-search backend; +- [Optional] The vault service and its support tooling (RabbitMQ, + :ref:`swh-scheduler`, :ref:`swh-vault`, ...); +- The replayer services: + + - :mod:`swh.storage.replay` service (part of the :ref:`swh-storage` + package) + - :mod:`swh.objstorage.replayer.replay` service (from the + :ref:`swh-objstorage-replayer` package) Each service consists in an HTTP-based RPC served by a `gunicorn `_ `WSGI `_ server. - Docker-based deployment ----------------------- This represents a lot of services to configure and orchestrate. In order to help to start the configuration of a mirror, a `docker-swarm `_ based deployment solution is provided as a working example of the mirror stack: https://forge.softwareheritage.org/source/swh-docker It is strongly recommended to :ref:`start from there ` in a test environment before planning a production-like deployment. - -Step by step deployment of a mirror ------------------------------------ - -When using the |swh| software stack to deploy a mirror, a number of |swh| -software components must be installed and configured to interact woth each other: - -#. :ref:`How to deploy the objstorage `: the objstorage - consists in an object storage solution (can be cloud-based or on local - filesystem like ZFS pools) and the :ref:`swh-objstorage` service, - -#. :ref:`How to deploy graph replayer services `: - :mod:`swh-devel:swh.objstorage.replayer.replay` service is responsible for - consuming the ``content`` topic from the |swh| kafka broker and filling the mirror - objstorage, retrieving blob objects from a |swh| objstarage, - -#. :ref:`How to deploy the storage `: the storage consists in a - database to store the graph of the |swh| archive (PostgreSQL or Cassandra) - and the :ref:`swh-devel:swh-storage` service, - -#. :ref:`How to deploy graph replayer services `: - :mod:`swh-devel:swh.storage.replay` service is responsible for consuming from - the |swh| kafka broker and fill the mirror storage, - -#. :ref:`How to deploy the frontend `: the :ref:`frontend - ` consists in a `django `_ - based application serving both the web API and the main UI for browsing the - Archive. - -#. :ref:`How to deploy the search engine `: the :ref:`search engine - ` consists in a `ElasticSearch `_ - based application used by the frontend. - -#. :ref:`How to deploy the vault service `: the :ref:`vault - service ` consists in a backend asynchronous service - allowing the user to ask for a zip archive of a given repository or git - history. - - - .. toctree:: :titlesonly: :hidden: docker - objstorage - storage - content-replayer - graph-replayer - frontend - search - vault diff --git a/sysadm/mirror-operations/frontend.rst b/sysadm/mirror-operations/frontend.rst deleted file mode 100644 index a0ae8c0..0000000 --- a/sysadm/mirror-operations/frontend.rst +++ /dev/null @@ -1,8 +0,0 @@ -.. _mirror_frontend: - -Frontend Services -================= - - -.. todo:: - This page is a work in progress. diff --git a/sysadm/mirror-operations/graph-replayer.rst b/sysadm/mirror-operations/graph-replayer.rst deleted file mode 100644 index 8a73a62..0000000 --- a/sysadm/mirror-operations/graph-replayer.rst +++ /dev/null @@ -1,8 +0,0 @@ -.. _mirror_graph_replayer: - -Graph Replayer Service -====================== - - -.. todo:: - This page is a work in progress. diff --git a/sysadm/mirror-operations/index.rst b/sysadm/mirror-operations/index.rst index 2e7fc49..62ee5a4 100644 --- a/sysadm/mirror-operations/index.rst +++ b/sysadm/mirror-operations/index.rst @@ -1,39 +1,130 @@ .. _mirror_operations: + Mirror Operations ================= +Description +----------- + A mirror is a full copy of the |swh| archive, operated independently from the -Software Heritage initiative. +Software Heritage initiative. A minimal mirror consists of two parts: + +- the graph storage (typically an instance of :ref:`swh.storage `), + which contains the Merkle DAG structure of the archive, *except* the + actual content of source code files (AKA blobs), + +- the object storage (typically an instance of :ref:`swh.objstorage `), + which contains all the blobs corresponding to archived source code files. + +However, a usable mirror needs also to be accessible by others. As such, a +proper mirror should also allow to: + +- navigate the archive copy using a Web browser and/or the Web API (typically + using the :ref:`the web application `), +- retrieve data from the copy of the archive (typically using the :ref:`the + vault service `) + +A mirror is initially populated and maintained up-to-date by consuming data +from the |swh| Kafka-based :ref:`journal ` and retrieving the +blob objects (file content) from the |swh| :ref:`object storage `. + +.. note:: It is not required that a mirror be deployed using the |swh| software + stack. Other technologies, including different storage methods, can be + used. But we will focus in this documentation to the case of mirror + deployment using the |swh| software stack. + + +.. thumbnail:: ../images/mirror-architecture.svg + + General view of the |swh| mirroring architecture. + +Mirroring the Graph Storage +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The replication of the graph is based on a journal using Kafka_ as event +streaming platform. + +On the Software Heritage side, every addition made to the archive consist of +the addition of a :ref:`data-model` object. The new object is also serialized +as a msgpack_ bytestring which is used as the value of a message added to a +Kafka topic dedicated to the object type. + +The main Kafka topics for the |swh| :ref:`data-model` are: + +- `swh.journal.objects.content` +- `swh.journal.objects.directory` +- `swh.journal.objects.extid` +- `swh.journal.objects.metadata_authority` +- `swh.journal.objects.metadata_fetcher` +- `swh.journal.objects.origin_visit_status` +- `swh.journal.objects.origin_visit` +- `swh.journal.objects.origin` +- `swh.journal.objects.raw_extrinsic_metadata` +- `swh.journal.objects.release` +- `swh.journal.objects.revision` +- `swh.journal.objects.skipped_content` +- `swh.journal.objects.snapshot` + +In order to set up a mirror of the graph, one needs to deploy a stack capable +of retrieving all these topics and store their content reliably. For example a +Kafka cluster configured as a replica of the main Kafka broker hosted by |swh| +would do the job (albeit not in a very useful manner by itself). + +A more useful mirror can be set up using the :ref:`storage ` +component with the help of the special service named `replayer` provided by the +:mod:`swh.storage.replay` module. + +.. TODO: replace this previous link by a link to the 'swh storage replay' + command once available, and ideally once + https://github.com/sphinx-doc/sphinx/issues/880 is fixed + + +Mirroring the Object Storage +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +File content (blobs) are *not* directly stored in messages of the +`swh.journal.objects.content` Kafka topic, which only contains metadata about +them, such as various kinds of cryptographic hashes. A separate component is in +charge of replicating blob objects from the archive and stored them in the +local object storage instance. + +A separate `swh-journal` client should subscribe to the +`swh.journal.objects.content` topic to get the stream of blob objects +identifiers, then retrieve corresponding blobs from the main Software Heritage +object storage, and store them in the local object storage. + +A reference implementation for this component is available in +:ref:`content replayer `. -A mirror should be able to: -- store a full copy of the archive, +Installation +------------ -- serve the data using the web UI, +When using the |swh| software stack to deploy a mirror, a number of |swh| +software components must be installed (cf. architecture diagram above). -- search the archive using the web UI, +A `docker-swarm `_ based deployment +solution is provided as a working example of the mirror stack, +see :ref:`mirror_deploy`. -- serve the data using the public API, +It is strongly recommended to start from there before planning a +production-like deployment. -- allow users to retrieve content from the archive using the :ref:`Vault - ` service. +.. _Kafka: https://kafka.apache.org/ +.. _msgpack: https://msgpack.org -See the :ref:`swh-devel:mirror` for a complete description of the mirror -architecture. -You may want to read: +You may also want to read: -- :ref:`mirror_deploy` if you want to deploy a mirror of the |swh| archive on - your infrastructure. - :ref:`mirror_monitor` to learn how to monitor your mirror and how to report its health back the |swh|. - :ref:`mirror_onboard` for the |swh| side view of adding a new mirror. .. toctree:: :hidden: deploy onboard monitor diff --git a/sysadm/mirror-operations/objstorage.rst b/sysadm/mirror-operations/objstorage.rst deleted file mode 100644 index 092afb3..0000000 --- a/sysadm/mirror-operations/objstorage.rst +++ /dev/null @@ -1,8 +0,0 @@ -.. _mirror_objstorage: - -Objstorage Service -================== - - -.. todo:: - This page is a work in progress. diff --git a/sysadm/mirror-operations/search.rst b/sysadm/mirror-operations/search.rst deleted file mode 100644 index d8e780d..0000000 --- a/sysadm/mirror-operations/search.rst +++ /dev/null @@ -1,8 +0,0 @@ -.. _mirror_search: - -Search Services -=============== - - -.. todo:: - This page is a work in progress. diff --git a/sysadm/mirror-operations/storage.rst b/sysadm/mirror-operations/storage.rst deleted file mode 100644 index 47a70e3..0000000 --- a/sysadm/mirror-operations/storage.rst +++ /dev/null @@ -1,8 +0,0 @@ -.. _mirror_storage: - -Storage Services -================ - - -.. todo:: - This page is a work in progress. diff --git a/sysadm/mirror-operations/vault.rst b/sysadm/mirror-operations/vault.rst deleted file mode 100644 index 0e05630..0000000 --- a/sysadm/mirror-operations/vault.rst +++ /dev/null @@ -1,8 +0,0 @@ -.. _mirror_vault: - -Vault Services -============== - - -.. todo:: - This page is a work in progress.