diff --git a/docs/architecture/overview.rst b/docs/architecture/overview.rst --- a/docs/architecture/overview.rst +++ b/docs/architecture/overview.rst @@ -17,45 +17,41 @@ General view of the |swh| architecture. -The front API components are: +.. _architecture-tier-1: -- :ref:`Storage API ` (including the Metadata Storage) -- :ref:`Deposit API ` -- :ref:`Vault API ` -- :ref:`Indexer API ` -- :ref:`Scheduler API ` +Core components +--------------- -On the back stage of this show, a celery_ based game of tasks and workers -occurs to perform all the required work to fill, maintain and update the |swh| -:term:`archive`. +The following components are the foundation of the entire |swh| architecture, +as they fetch data, store it, and make it available to every other service. -The main components involved in this choreography are: +Data storage +^^^^^^^^^^^^ -- :term:`Listers `: a lister is a type of task aiming at scraping a - web site, a forge, etc. to gather all the source code repositories it can - find. For each found source code repository, a :term:`loader` task is - created. +The :ref:`Storage ` provides an API to store and retrieve +elements of the :ref:`graph `, such as directory structure, +revision history, and their respective metadata. +It relies on the :ref:`Object Storage ` service to store +the content of source code file themselves. -- :term:`Loaders `: a loader is a type of task aiming at importing or - updating a source code repository. It is the one that inserts :term:`blob` - objects in the :term:`object storage`, and inserts nodes and edges in the - :ref:`graph `. +Task management +^^^^^^^^^^^^^^^ -- :term:`Indexers `: an indexer is a type of task aiming at crawling - the content of the :term:`archive` to extract derived information (mimetype, - etc.) +The :ref:`Scheduler ` manages the entire choreography of jobs/tasks +in |swh|, from detecting and ingesting repositories, to extracting metadata from them, +to repackaging repositories into small downloadable archives. -- :term:`Vault `: this type of celery task is responsible for cooking a - compressed archive (zip or tgz) of an archived object (typically a directory - or a repository). Since this can be a rather long process, it is delegated to - an asynchronous (celery) task. - - -Tasks ------ +It does this by managing its own database of tasks that need to run +(either periodically or only once), +and passing them to celery_ for execution on dedicated workers. Listers -+++++++ +^^^^^^^ + +:term:`Listers ` are type of task, run by the Scheduler, aiming at scraping a +web site, a forge, etc. to gather all the source code repositories it can +find, also known as :term:`origins `. +For each found source code repository, a :term:`loader` task is created. The following sequence diagram shows the interactions between these components when a new forge needs to be archived. This example depicts the case of a @@ -80,7 +76,13 @@ creating a new task. Loaders -+++++++ +^^^^^^^ + +:term:`Loaders ` are also a type of task, but aim at importing or +updating a source code repository. It is the one that inserts :term:`blob` +objects in the :term:`object storage`, and inserts nodes and edges in the +:ref:`graph `. + The sequence diagram below describe this second step of importing the content of a repository. Once again, we take the example of a git repository, but any @@ -89,6 +91,114 @@ .. thumbnail:: images/tasks-git-loader.svg +Journal +^^^^^^^ + +The last core component is the :term:`Journal `, which is a persistent logger +of every change in the archive, with publish-subscribe_ support, using Kafka. + +The Storage writes to it every time a new object is added to the archive; +and many components read from it to be notified of these changes. +For example, it allows the Scheduler to know how often software repositories are +updated by their developers, to decide when next to visit these repositories. + +It is also the foundation of the :ref:`mirror` infrastructure, as it allows +mirrors to stay up to date. + + +.. _architecture-tier-2: + +Other major components +---------------------- + +All the components we saw above are critical to the |swh| archive as they are +in charge of archiving source code. +But are not enough to provide another important features of |swh|: making +this archive accessible and searchable by anyone. + + +Archive website and API +^^^^^^^^^^^^^^^^^^^^^^^ + +First of all, the archive website and API, also known as :ref:`swh-web `, +is the main entry point of the archive. + +This is the component that serves https://archive.softwareheritage.org/, +which is the window into the entire archive, as it provides access to it +through a web browser or the HTTP API. + +It does so by querying most of the internal APIs of |swh|: +the Data Storage (to display source code repositories and their content), +the Scheduler (to allow manual scheduling of loader tasks through the +`Save Code Now `_ feature), +and many of the other services we will see below. + +Internal data mining +^^^^^^^^^^^^^^^^^^^^ + +:term:`Indexers ` are a type of task aiming at crawling +the content of the :term:`archive` to extract derived information. + +It ranges from detecting the MIME type or license of individual files, +to reading all types of metadata files at the root of repositories +and storing them together in a unified format, CodeMeta_ + +Vault +^^^^^ + +The :term:`Vault ` is an internal API, in charge of cooking +compressed archive (zip or tgz) of archived objects on request (via swh-web). +These compressed objects are typically directories or repositories. + +Since this can be a rather long process, it is delegated to +an asynchronous (celery) task, through the Scheduler. + +.. _architecture-tier-3: + +Extra services +-------------- + +Finally, |swh| provides additional tools that, although not necessary to operate +the archive, provide convenient interfaces or performance benefits. + +Search +^^^^^^ + +TODO (upgrade to tier 2?) + +Graph +^^^^^ + +TODO (upgrade to tier 2?) + +Deposit +^^^^^^^ + +TODO + +Web Client, Fuse, Scanner +^^^^^^^^^^^^^^^^^^^^^^^^^ + +TODO + +Counters +^^^^^^^^ + +TODO + +Auth +^^^^ + +TODO + +Replayers +^^^^^^^^^ + +TODO (merge with mirror or storage/objstorage?) + + + .. _celery: https://www.celeryproject.org .. _gitlab: https://gitlab.com - +.. _CodeMeta: https://codemeta.github.io/ +.. _publish-subscribe: https://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern diff --git a/docs/index.rst b/docs/index.rst --- a/docs/index.rst +++ b/docs/index.rst @@ -56,7 +56,11 @@ ---------- Here is brief overview of the most relevant software components in the Software -Heritage stack. Each component name is linked to the development documentation +Heritage stack, in alphabetical order. +For a better introduction to the architecture, see the :ref:`architecture-overview`, +which presents each of them in a didactical order. + +Each component name is linked to the development documentation of the corresponding Python module. :ref:`swh.auth `