Page MenuHomeSoftware Heritage

D2223.diff
No OneTemporary

D2223.diff

diff --git a/docs/index.rst b/docs/index.rst
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -140,6 +140,7 @@
getting-started
developer-setup
manual-setup
+ Infrastructure <infrastructure/storage_sites>
API documentation <apidoc/modules>
swh.core <swh-core/index>
swh.dataset <swh-dataset/index>
diff --git a/docs/elasticsearch.rst b/docs/infrastructure/elasticsearch.rst
rename from docs/elasticsearch.rst
rename to docs/infrastructure/elasticsearch.rst
--- a/docs/elasticsearch.rst
+++ b/docs/infrastructure/elasticsearch.rst
@@ -1,3 +1,5 @@
+.. _elasticsearch:
+
==============
Elasticsearch
==============
@@ -20,7 +22,7 @@
Architecture diagram
====================
-.. graphviz:: images/elasticsearch.dot
+.. graphviz:: ../images/elasticsearch.dot
Per-node storage
================
diff --git a/docs/infrastructure/hypervisors.rst b/docs/infrastructure/hypervisors.rst
new file mode 100644
--- /dev/null
+++ b/docs/infrastructure/hypervisors.rst
@@ -0,0 +1,26 @@
+===========
+Hypervisors
+===========
+
+Software Heritage uses a few hypervisors configured in a Proxmox cluster
+
+List of Proxmox nodes
+=====================
+
+- beaubourg: Xeon E7-4809 server, 16 cores/512 GB RAM, bought in 2015
+- hypervisor3: EPYC 7301 server, 32 cores/256 GB RAM, bought in 2018
+
+Per-node storage
+================
+
+The servers each have physically installed 2.5" SSDs (SAS or SATA), configured
+in mdadm RAID10 pools.
+A device mapper layer on top of these pools allows Proxmox to easily manage VM
+disk images.
+
+Network storage
+===============
+
+A :ref:`ceph_cluster` is setup as a shared storage resource.
+It can be used to temporarily transfer VM disk images from one hypervisor
+node to another, or to directly store virtual machine disk images.
diff --git a/docs/infrastructure/object_storage.rst b/docs/infrastructure/object_storage.rst
new file mode 100644
--- /dev/null
+++ b/docs/infrastructure/object_storage.rst
@@ -0,0 +1,73 @@
+==============
+Object storage
+==============
+
+There is not one but at least 4 different object stores directly managed
+by the Software Heritage group:
+
+- Main archive
+- Rocquencourt replica archive
+- Azure archive
+- AWS archive
+
+The Main archive
+================
+
+Uffizi
+Located in Rocquencourt
+
+Replica archive
+===============
+
+Banco
+Located in Rocquencourt, in a different building than the main one
+
+Azure archive
+=============
+
+The Azure archive uses an Azure Block Storage backend, implemented in the
+*swh.objstorage_backends.azure.AzureCloudObjStorage* Python class.
+
+Internally, that class uses the *block_blob_service* Azure API.
+
+AWS archive
+===========
+
+The AWS archive is stored in the *softwareheritage* Amazon S3 bucket, in the US-East
+ (N. Virginia) region. That bucket is public.
+
+It is being continously populated by the `content_replayer` program.
+
+Softwareheritage Python programs access it using a libcloud backend.
+
+URL
+---
+
+``s3://softwareheritage/content``
+
+content_replayer
+----------------
+
+A Python program which reads new objects from Kafka and then copies them from the
+ object storages on Banco and Uffizi.
+
+
+Implementation details
+----------------------
+
+* Uses *swh.objstorage.backends.libcloud*
+
+* Uses *libcloud.storage.drivers.s3*
+
+
+Architecture diagram
+====================
+
+.. graph:: swh_archives
+ "Main archive" -- "Replica archive";
+ "Azure archive";
+ "AWS archive";
+ "Main archive" [shape=rectangle];
+ "Replica archive" [shape=rectangle];
+ "Azure archive" [shape=rectangle];
+ "AWS archive" [shape=rectangle];
diff --git a/docs/infrastructure/storage_site_amazon.rst b/docs/infrastructure/storage_site_amazon.rst
new file mode 100644
--- /dev/null
+++ b/docs/infrastructure/storage_site_amazon.rst
@@ -0,0 +1,9 @@
+.. _storage_amazon:
+
+Amazon storage
+==============
+
+A *softwareheritage* object storage S3 bucket is hosted publicly in the
+US-east AWS region.
+
+Data is reachable from the *s3://softwareheritage/content* URL.
diff --git a/docs/infrastructure/storage_site_azure_euwest.rst b/docs/infrastructure/storage_site_azure_euwest.rst
new file mode 100644
--- /dev/null
+++ b/docs/infrastructure/storage_site_azure_euwest.rst
@@ -0,0 +1,38 @@
+Azure Euwest
+============
+
+virtual machines
+----------------
+
+- dbreplica0: contains a read-only instance of the *softwareheritage* database
+- dbreplica1: contains a read-only instance of the *softwareheritage-indexer* database
+- kafka01 to 06
+- mirror-node-1 to 3
+- storage0
+- vangogh (vault implementation)
+- webapp0
+- worker01 to 13
+
+The PostgreSQL databases are populated using wal streaming from *somerset*.
+
+storage accounts
+----------------
+
+16 Azure storage account (0euwestswh to feuwestswh) are dedicated to blob
+containers for object storage.
+The first hexadecimal digit of an account name is also the first digit of
+its content hashes.
+Blobs are storred in location names of the form *6euwestswh/contents*
+
+Other storage accounts:
+
+- archiveeuwestswh: mirrors of dead software forges like *code.google.com*
+- swhvaultstorage: cooked archives for the *vault* server running in azure.
+- swhcontent: object storage content (individual blobs)
+
+
+TODO: describe kafka* virtual machines
+TODO: describe mirror-node* virtual machines
+TODO: describe storage0 virtual machine
+TODO: describe webapp0 virtual machine
+TODO: describe worker* virtual machines
diff --git a/docs/infrastructure/storage_site_others.rst b/docs/infrastructure/storage_site_others.rst
new file mode 100644
--- /dev/null
+++ b/docs/infrastructure/storage_site_others.rst
@@ -0,0 +1,24 @@
+=========================================
+Other Software Heritage storage locations
+=========================================
+
+INRIA-provided storage at Rocquencourt
+======================================
+
+The *filer-backup:/swh1* NFS filesystem is used to store DAR backups.
+It is mounted on *uffizi:/srv/remote-backups*
+
+The *uffizi:/srv/remote-backups* filesystem is regularly snapshotted and the snapshots are visible in
+*uffizi:/srv/remote-backups/.snapshot/*.
+
+Workstations
+============
+
+Staff workstations are located at INRIA Paris. The most important one from a storage
+point of view is *giverny.paris.inria.fr* and has more than 10 TB of directly-attached
+storage, mostly used for research databases.
+
+Public website
+==============
+
+Hosted by Gandi, its storage (including Wordpress) is located in one or more Gandi datacenter(s).
diff --git a/docs/infrastructure/storage_site_rocquencourt_physical.rst b/docs/infrastructure/storage_site_rocquencourt_physical.rst
new file mode 100644
--- /dev/null
+++ b/docs/infrastructure/storage_site_rocquencourt_physical.rst
@@ -0,0 +1,65 @@
+Physical machines at Rocquencourt
+=================================
+
+hypervisors
+-----------
+
+The :doc:`hypervisors <hypervisors>` mostly use local storage on the form of internal
+SSDS but also have access to a :ref:`Ceph cluster`.
+
+NFS server
+----------
+
+There is only one NFS server managed by Software Heritage, *uffizi.internal.softwareheritage.org*.
+That machine is located at Rocquencourt and is directly attached to two SAS storage bays.
+
+NFS-exported data is present under these local filesystem paths: ::
+
+/srv/storage/space
+/srv/softwareheritage/objects
+
+belvedere
+---------
+
+This server is used for at least two separate PostgreSQL instances:
+
+- *softwareheritage* database (port 5433)
+- *swh-lister* and *softwareheritage-scheduler* databases (port 5434)
+
+Data is stored on local SSDs. The operating system lies on a LSI hardware RAID 1 volume and
+each PostgreSQL instance uses a dedicated set of drives in mdadm RAID10 volume(s).
+
+It also uses a single NFS volume:
+::
+
+ uffizi:/srv/storage/space/postgres-backups/prado
+
+banco
+-----
+
+This machine is located in its own building in Rocquencourt, along
+with a SAS storage bay.
+It is intended to serve as a backup for the main site on building 30.
+
+Elasticsearch cluster
+---------------------
+
+The :doc:`Elasticsearch cluster <elasticsearch>` only uses local storage on
+its nodes.
+
+Test / staging server
+---------------------
+
+There is also *orsay*, a refurbished machine only used for testing / staging
+new software versions.
+
+.. _ceph_cluster:
+
+Ceph cluster
+------------
+
+The Software Heritage Ceph cluster contains three nodes:
+
+- ceph-mon1
+- ceph-osd1
+- ceph-osd2
diff --git a/docs/infrastructure/storage_site_rocquencourt_virtual.rst b/docs/infrastructure/storage_site_rocquencourt_virtual.rst
new file mode 100644
--- /dev/null
+++ b/docs/infrastructure/storage_site_rocquencourt_virtual.rst
@@ -0,0 +1,45 @@
+Virtual machines at Rocquencourt
+================================
+
+The following virtual machines are hosted on Proxmox hypervisors located at Rocquencourt.
+All of them use local storage on their virtual hard drive.
+
+VMs without NFS mount points
+----------------------------
+
+- munin0
+- tate, used for public and private (intranet) wikis
+- getty
+- thyssen
+- jenkins-debian1.internal.softwareheritage.org
+- logstash0
+- kibana0
+- saatchi
+- louvre
+
+Containers and VMs with nfs storage:
+------------------------------------
+
+- somerset.internal.softwareheritage.org is a lxc container running on *beaubourg*
+ It serves as a host for the *softwareheritage* and *softwareheritage-indexer*
+ databases.
+
+- worker01 to worker16.internal.softwareheritage.org
+- pergamon
+- moma
+
+These VMs access one or more of these NFS volumes located on uffizi:
+
+::
+
+ uffizi:/srv/softwareheritage/objects
+ uffizi:/srv/storage/space
+ uffizi:/srv/storage/space/annex
+ uffizi:/srv/storage/space/annex/public
+ uffizi:/srv/storage/space/antelink
+ uffizi:/srv/storage/space/oversize-objects
+ uffizi:/srv/storage/space/personal
+ uffizi:/srv/storage/space/postgres-backups/somerset
+ uffizi:/srv/storage/space/provenance-index
+ uffizi:/srv/storage/space/swh-deposit
+
diff --git a/docs/infrastructure/storage_sites.rst b/docs/infrastructure/storage_sites.rst
new file mode 100644
--- /dev/null
+++ b/docs/infrastructure/storage_sites.rst
@@ -0,0 +1,51 @@
+===============================
+Software Heritage storage sites
+===============================
+
+.. toctree::
+ :maxdepth: 2
+ :hidden:
+
+ storage_site_rocquencourt_physical
+ storage_site_rocquencourt_virtual
+ storage_site_azure_euwest
+ storage_site_amazon
+ storage_site_others
+ elasticsearch
+ hypervisors
+ object_storage
+
+Physical machines at Rocquencourt
+=================================
+
+INRIA Rocquencourt is the main Software Heritage datacenter.
+It is the only one to contain
+:doc:`directly-managed physical machines <storage_site_rocquencourt_physical>`.
+
+Virtual machines at Rocquencourt
+================================
+
+The :doc:`virtual machines at Rocquencourt <storage_site_rocquencourt_virtual>`
+are directly managed by Software Heritage staff as well and run on
+:doc:`Software Heritage hypervisors <hypervisors>`.
+
+Azure Euwest
+============
+
+Various virtual machines and other services are hosted at
+:doc:`Azure Euwest <storage_site_azure_euwest>`
+
+Amazon S3
+=========
+
+Object storage
+==============
+
+Even though there are different object storage implementations in different
+locations, it has been deemed useful to regroup all object storage-related
+information in a :doc:`single document <object_storage>`.
+
+Other locations
+===============
+
+:doc:`Other locations <storage_site_others>`.

File Metadata

Mime Type
text/plain
Expires
Tue, Dec 17, 9:37 PM (2 d, 15 h ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3220038

Event Timeline