diff --git a/docs/archive-copies.rst b/docs/archive-copies.rst index 271019e27..04f7e5ec3 100644 --- a/docs/archive-copies.rst +++ b/docs/archive-copies.rst @@ -1,11 +1,45 @@ .. _archive-copies: -Software Heritage archive copies -================================ +Archive copies +============== .. _swh-storage-copies-layout: .. figure:: images/swh-archive-copies.svg :width: 1024px :align: center Layout of Software Heritage archive copies (click to zoom). + +The Software Heritage archive exists in several copies, to minimize the risk of +losing archived source code artifacts. The layout of existing copies, their +relationships, as well as their geographical and administrative domains are +shown in the layout diagram above. + +We recall that the archive is conceptually organized as a graph, and +specifically a Merkle DAG, see :ref:`data-model` for more information. + +Ingested source code artifacts land directly on the **primary copy**, which is +updated live and also used as reference for deduplication purposes. There, +different parts of the Merkle DAG as stored using different backend +technologies. The leaves of the graph, i.e., *content objects* (or "blobs"), +are stored in a key-value object storage, using their SHA1 identifiers as keys +(see :ref:`persistent-identifiers`). SHA1 collision avoidance is enforced by +the :mod:`swh.storage` module. The *rest of the graph* is stored in a Postgres +database (see :ref:`sql-storage`). + +At the time of writing, the primary object storage contains about 5 billion +blobs with a median size of 3 KB---yes, that is *a lot of very small +files*---for a total compressed size of about 200 TB. The Postgres database +takes about 8 TB, half of which required by indexes. In terms of graph metrics, +the Merkle DAG has about 10 B nodes and 100 B edges. + +The **secondary copy** is hosted on Microsoft Azure cloud, using its native +blob storage for the object storage and a large virtual machine to run a +Postgres instance there. The database is kept up-to-date w.r.t. the primary +copy using Postgres WAL replication. The object storage is kept up-to-date +using :mod:`swh.archiver`. + +Archive copies (as opposed to archive mirrors) are operated by the Software +Heritage Team at Inria. The primary archived copy is geographically located at +Rocquencourt, France; the secondary copy hosted in the Europe West region of +the Azure cloud. diff --git a/docs/sql-storage.rst b/docs/sql-storage.rst index 86a797374..aa3d834a3 100644 --- a/docs/sql-storage.rst +++ b/docs/sql-storage.rst @@ -1,14 +1,14 @@ .. _sql-storage: -Software Heritage SQL storage -============================= +SQL storage +=========== Postgres DB schema ------------------ .. _swh-storage-db-schema: .. figure:: ../sql/doc/sql/db-schema.svg :width: 1024px :align: center Postgres DB schema of high-level Software Heritage storage (click to zoom).