diff --git a/docs/architecture/overview.rst b/docs/architecture/overview.rst index 7fa850f..df8f8b7 100644 --- a/docs/architecture/overview.rst +++ b/docs/architecture/overview.rst @@ -1,94 +1,275 @@ .. _architecture-overview: Software Architecture Overview ============================== From an end-user point of view, the |swh| platform consists in the :term:`archive`, which can be accessed using the web interface or its REST API. -Behind the scene (and the web app) are several components that expose +Behind the scene (and the web app) are several components/services that expose different aspects of the |swh| :term:`archive` as internal RPC APIs. -Each of these internal APIs have a dedicated (Postgresql) database. +These internal APIs have a dedicated database, usually PostgreSQL_. A global (and incomplete) view of this architecture looks like: .. thumbnail:: ../images/general-architecture.svg General view of the |swh| architecture. -The front API components are: +.. _architecture-tier-1: -- :ref:`Storage API ` (including the Metadata Storage) -- :ref:`Deposit API ` -- :ref:`Vault API ` -- :ref:`Indexer API ` -- :ref:`Scheduler API ` +Core components +--------------- -On the back stage of this show, a celery_ based game of tasks and workers -occurs to perform all the required work to fill, maintain and update the |swh| -:term:`archive`. +The following components are the foundation of the entire |swh| architecture, +as they fetch data, store it, and make it available to every other service. -The main components involved in this choreography are: +Data storage +^^^^^^^^^^^^ -- :term:`Listers `: a lister is a type of task aiming at scraping a - web site, a forge, etc. to gather all the source code repositories it can - find. For each found source code repository, a :term:`loader` task is - created. +The :ref:`Storage ` provides an API to store and retrieve +elements of the :ref:`graph `, such as directory structure, +revision history, and their respective metadata. +It relies on the :ref:`Object Storage ` service to store +the content of source code file themselves. -- :term:`Loaders `: a loader is a type of task aiming at importing or - updating a source code repository. It is the one that inserts :term:`blob` - objects in the :term:`object storage`, and inserts nodes and edges in the - :ref:`graph `. +Both the Storage and Object Storage are designed as abstractions over possible +backends. The former supports both PostgreSQL (the current solution in production) +and Cassandra (a more scalable option we are exploring). +The latter supports a large variety of "cloud" object storage as backends, +as well as a simple local filesystem. -- :term:`Indexers `: an indexer is a type of task aiming at crawling - the content of the :term:`archive` to extract derived information (mimetype, - etc.) +Task management +^^^^^^^^^^^^^^^ -- :term:`Vault `: this type of celery task is responsible for cooking a - compressed archive (zip or tgz) of an archived object (typically a directory - or a repository). Since this can be a rather long process, it is delegated to - an asynchronous (celery) task. +The :ref:`Scheduler ` manages the entire choreography of jobs/tasks +in |swh|, from detecting and ingesting repositories, to extracting metadata from them, +to repackaging repositories into small downloadable archives. - -Tasks ------ +It does this by managing its own database of tasks that need to run +(either periodically or only once), +and passing them to celery_ for execution on dedicated workers. Listers -+++++++ +^^^^^^^ + +:term:`Listers ` are type of task, run by the Scheduler, aiming at scraping a +web site, a forge, etc. to gather all the source code repositories it can +find, also known as :term:`origins `. +For each found source code repository, a :term:`loader` task is created. The following sequence diagram shows the interactions between these components when a new forge needs to be archived. This example depicts the case of a gitlab_ forge, but any other supported source type would be very similar. .. thumbnail:: images/tasks-lister.svg As one might observe in this diagram, it does two things: - it asks the forge (a gitlab_ instance in this case) the list of known repositories, and - it insert one :term:`loader` task for each source code repository that will be in charge of importing the content of that repository. Note that most listers usually work in incremental mode, meaning they store in a dedicated database the current state of the listing of the forge. Then, on a subsequent execution of the lister, it will ask only for new repositories. Also note that if the lister inserts a new loading task for a repository for which a loading task already exists, the existing task will be updated (if needed) instead of creating a new task. Loaders -+++++++ +^^^^^^^ + +:term:`Loaders ` are also a type of task, but aim at importing or +updating a source code repository. It is the one that inserts :term:`blob` +objects in the :term:`object storage`, and inserts nodes and edges in the +:ref:`graph `. + The sequence diagram below describe this second step of importing the content of a repository. Once again, we take the example of a git repository, but any other type of repository would be very similar. .. thumbnail:: images/tasks-git-loader.svg +Journal +^^^^^^^ + +The last core component is the :term:`Journal `, which is a persistent logger +of every change in the archive, with publish-subscribe_ support, using Kafka. + +The Storage writes to it every time a new object is added to the archive; +and many components read from it to be notified of these changes. +For example, it allows the Scheduler to know how often software repositories are +updated by their developers, to decide when next to visit these repositories. + +It is also the foundation of the :ref:`mirror` infrastructure, as it allows +mirrors to stay up to date. + + +.. _architecture-tier-2: + +Other major components +---------------------- + +All the components we saw above are critical to the |swh| archive as they are +in charge of archiving source code. +But are not enough to provide another important features of |swh|: making +this archive accessible and searchable by anyone. + + +Archive website and API +^^^^^^^^^^^^^^^^^^^^^^^ + +First of all, the archive website and API, also known as :ref:`swh-web `, +is the main entry point of the archive. + +This is the component that serves https://archive.softwareheritage.org/, +which is the window into the entire archive, as it provides access to it +through a web browser or the HTTP API. + +It does so by querying most of the internal APIs of |swh|: +the Data Storage (to display source code repositories and their content), +the Scheduler (to allow manual scheduling of loader tasks through the +`Save Code Now `_ feature), +and many of the other services we will see below. + +Internal data mining +^^^^^^^^^^^^^^^^^^^^ + +:term:`Indexers ` are a type of task aiming at crawling +the content of the :term:`archive` to extract derived information. + +It ranges from detecting the MIME type or license of individual files, +to reading all types of metadata files at the root of repositories +and storing them together in a unified format, CodeMeta_. + +All results computed by Indexers are stored in a PostgreSQL database, +the Indexer Storage. + + +Vault +^^^^^ + +The :term:`Vault ` is an internal API, in charge of cooking +compressed archive (zip or tgz) of archived objects on request (via swh-web). +These compressed objects are typically directories or repositories. + +Since this can be a rather long process, it is delegated to +an asynchronous (celery) task, through the Scheduler. + +.. _architecture-tier-3: + +Extra services +-------------- + +Finally, |swh| provides additional tools that, although not necessary to operate +the archive, provide convenient interfaces or performance benefits. + +It is therefore possible to have a fully-functioning archive without any of these +services (our :ref:`development Docker environment ` disables +most of these by default). + +Search +^^^^^^ + +The :ref:`swh-search ` service complements both the Storage +and the Indexer Storage, to provide efficient advanced reverse-index search queries, +such as full-text search on origin URLs and metadata. + +This service is a recent addition to the |swh| architecture based on ElasticSearch, +and is currently in use only for URL search. + +Graph +^^^^^ + +:ref:`swh-graph ` is also a recent addition to the architecture +designed to complement the Storage using a specialized backend. +It leverages WebGraph_ to store a compressed in-memory representation of the +entire graph, and provides fast implementations of graph traversal algorithms. + +Counters +^^^^^^^^ + +The `archive's landing page `_ features +counts of the total number of files/directories/revisions/... in the archive. +Perhaps surprisingly, counting unique objects at |swh|'s scale is hard, +and a performance bottleneck when implemented purely in the Storage's SQL database. + +:ref:`swh-counters ` provides an alternative design to solve this issue, +by reading new objects from the Journal and counting them using Redis_' HyperLogLog_ +feature; and keeps the history of these counters over time using Prometheus_. + +Deposit +^^^^^^^ + +The :ref:`Deposit ` is an alternative way to add content to the archive. +While listers and loaders, as we saw above, **discover** repositories +and **pull** artifacts into the archive, the Deposit allows trusted partners to +**push** the content of their repository directly to the archive, +and is internally loaded by the +:mod:`Deposit Loader ` + +The Deposit is centered on the SWORDv2_ protocol, which allows depositing archives +(usually TAR or ZIP) along with metadata in XML. + +The Deposit has its own HTTP interface, independent of swh-web. +It also has its own SWORD client, which is specialized to interact with the Deposit +server. + +Authentication +^^^^^^^^^^^^^^ + +While the archive itself is public, |swh| reserves some features +to authenticated clients, such as higher rate limits, access to experimental APIs +(currently: the Graph service), or the Deposit. + +This is managed centrally by :ref:`swh-auth ` using KeyCloak. + +Web Client, Fuse, Scanner +^^^^^^^^^^^^^^^^^^^^^^^^^ + +SWH provides a few tools to access the archive via the API: + +* :ref:`swh-web-client`, a command-line interface to authenticate with SWH + and a library to access the API from Python programs +* :ref:`swh-fuse`, a Filesystem in USErspace implementation, + that exposes the entire archive as a regular directory on your computer +* :ref:`swh-scanner`, a work-in-progress to check which of the files in + a project are already in the archive, without submitting them + +Replayers and backfillers +^^^^^^^^^^^^^^^^^^^^^^^^^ + +As the Journal and various databases may be out of sync for various reasons +(scrub of either of them, migration, database addition, ...), +and because some databases need to follow the content of the Journal (mirrors), +some places of the |swh| codebase contains tools known as "replayers" and "backfillers", +designed to keep them in sync: + +* the :ref:`Object Storage Replayer ` copies the content + of an objects storage to another one. It first performs a full copy, then streams + new objects using the Journal to stay up to date +* the Storage Replayer loads the entire content of the Journal into a Storage database, + and also keeps them in sync. + This is used for mirrors, and when creating a new database. +* the Storage Backfiller, which does the opposite. This was initially used to populate + the Journal from the database; and is occasionally when one needs to clear a topic + in the Journal and recreate it. + + .. _celery: https://www.celeryproject.org +.. _CodeMeta: https://codemeta.github.io/ .. _gitlab: https://gitlab.com - +.. _PostgreSQL: https://www.postgresql.org/ +.. _Prometheus: https://prometheus.io/ +.. _publish-subscribe: https://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern +.. _Redis: https://redis.io/ +.. _SWORDv2: http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html +.. _HyperLogLog: https://redislabs.com/redis-best-practices/counting/hyperloglog/ +.. _WebGraph: https://webgraph.di.unimi.it/ diff --git a/docs/index.rst b/docs/index.rst index 155b7f0..47ccc1b 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,218 +1,222 @@ .. _swh-docs: Software Heritage - Development Documentation ============================================= Getting started --------------- * :ref:`getting-started` → deploy a local copy of the Software Heritage software stack in less than 5 minutes, or * :ref:`developer-setup` → get a working development setup that allows to hack on the Software Heritage software stack Contributing ------------ * :ref:`patch-submission` → learn how to submit your patches to the Software Heritage codebase * :ref:`code-review` → rules and guidelines to review code in Software Heritage * :ref:`python-style-guide` → how to format the Python code you write Architecture ------------ * :ref:`architecture-overview` → get a glimpse of the Software Heritage software architecture * :ref:`mirror` → learn what a Software Heritage mirror is and how to set up one * :ref:`Keycloak ` → learn how to use Keycloak, the authentication system used by |swh|'s web interface and public APIs Data Model and Specifications ----------------------------- * :ref:`persistent-identifiers` Specifications of the SoftWare Heritage persistent IDentifiers (SWHID). * :ref:`data-model` Documentation of the main |swh| archive data model. * :ref:`journal-specs` Documentation of the Kafka journal of the |swh| archive. Tutorials --------- * :ref:`testing-guide` * :doc:`/tutorials/issue-debugging-monitoring` * :ref:`Listing the content of your favorite forge ` and :ref:`running a lister in Docker ` Roadmap ------- * :ref:`roadmap-2021` Components ---------- Here is brief overview of the most relevant software components in the Software -Heritage stack. Each component name is linked to the development documentation +Heritage stack, in alphabetical order. +For a better introduction to the architecture, see the :ref:`architecture-overview`, +which presents each of them in a didactical order. + +Each component name is linked to the development documentation of the corresponding Python module. :ref:`swh.auth ` low-level library used by modules needing keycloak authentication :ref:`swh.core ` low-level utilities and helpers used by almost all other modules in the stack :ref:`swh.counters ` service providing efficient estimates of the number of objects in the SWH archive, using Redis's Hyperloglog :ref:`swh.dataset ` public datasets and periodic data dumps of the archive released by Software Heritage :ref:`swh.deposit ` push-based deposit of software artifacts to the archive swh.docs developer documentation (used to generate this doc you are reading) :ref:`swh.fuse ` Virtual file system to browse the Software Heritage archive, based on `FUSE `_ :ref:`swh.graph ` Fast, compressed, in-memory representation of the archive, with tooling to generate and query it. :ref:`swh.indexer ` tools and workers used to crawl the content of the archive and extract derived information from any artifact stored in it :ref:`swh.journal ` persistent logger of changes to the archive, with publish-subscribe support :ref:`swh.lister ` collection of listers for all sorts of source code hosting and distribution places (forges, distributions, package managers, etc.) :ref:`swh.loader-core ` low-level loading utilities and helpers used by all other loaders :ref:`swh.loader-git ` loader for `Git `_ repositories :ref:`swh.loader-mercurial ` loader for `Mercurial `_ repositories :ref:`swh.loader-svn ` loader for `Subversion `_ repositories :ref:`swh.model ` implementation of the :ref:`data-model` to archive source code artifacts :ref:`swh.objstorage ` content-addressable object storage :ref:`swh.objstorage.replayer ` Object storage replication tool :ref:`swh.scanner ` source code scanner to analyze code bases and compare them with source code artifacts archived by Software Heritage :ref:`swh.scheduler ` task manager for asynchronous/delayed tasks, used for recurrent (e.g., listing a forge, loading new stuff from a Git repository) and one-off activities (e.g., loading a specific version of a source package) :ref:`swh.search ` search engine for the archive :ref:`swh.storage ` abstraction layer over the archive, allowing to access all stored source code artifacts as well as their metadata :ref:`swh.vault ` implementation of the vault service, allowing to retrieve parts of the archive as self-contained bundles (e.g., individual releases, entire repository snapshots, etc.) :ref:`swh.web ` Web application(s) to browse the archive, for both interactive (HTML UI) and mechanized (REST API) use :ref:`swh.web.client ` Python client for :ref:`swh.web ` Dependencies ------------ The dependency relationships among the various modules are depicted below. .. _py-deps-swh: .. figure:: images/py-deps-swh.svg :width: 1024px :align: center Dependencies among top-level Python modules (click to zoom). Archive ------- * :ref:`Archive ChangeLog `: notable changes to the archive over time Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * `URLs index `_ * :ref:`search` * :ref:`glossary` .. ensure sphinx does not complain about index files not being included .. toctree:: :maxdepth: 2 :caption: Contents: :titlesonly: :hidden: getting-started/index architecture/index contributing/index tutorials/index API documentation roadmap/roadmap-2021.rst swh.auth swh.core swh.counters swh.dataset swh.deposit swh.fuse swh.graph swh.indexer swh.journal swh.lister swh.loader swh.model swh.objstorage swh.objstorage.replayer swh.scanner swh.scheduler swh.search swh.storage swh.vault swh.web swh.web.client archive-changelog journal