Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F7123102
D2223.diff
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
11 KB
Subscribers
None
D2223.diff
View Options
diff --git a/docs/index.rst b/docs/index.rst
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -140,6 +140,7 @@
getting-started
developer-setup
manual-setup
+ Infrastructure <infrastructure/storage_sites>
API documentation <apidoc/modules>
swh.core <swh-core/index>
swh.dataset <swh-dataset/index>
diff --git a/docs/elasticsearch.rst b/docs/infrastructure/elasticsearch.rst
rename from docs/elasticsearch.rst
rename to docs/infrastructure/elasticsearch.rst
--- a/docs/elasticsearch.rst
+++ b/docs/infrastructure/elasticsearch.rst
@@ -1,3 +1,5 @@
+.. _elasticsearch:
+
==============
Elasticsearch
==============
@@ -20,7 +22,7 @@
Architecture diagram
====================
-.. graphviz:: images/elasticsearch.dot
+.. graphviz:: ../images/elasticsearch.dot
Per-node storage
================
diff --git a/docs/infrastructure/hypervisors.rst b/docs/infrastructure/hypervisors.rst
new file mode 100644
--- /dev/null
+++ b/docs/infrastructure/hypervisors.rst
@@ -0,0 +1,26 @@
+===========
+Hypervisors
+===========
+
+Software Heritage uses a few hypervisors configured in a Proxmox cluster
+
+List of Proxmox nodes
+=====================
+
+- beaubourg: Xeon E7-4809 server, 16 cores/512 GB RAM, bought in 2015
+- hypervisor3: EPYC 7301 server, 32 cores/256 GB RAM, bought in 2018
+
+Per-node storage
+================
+
+The servers each have physically installed 2.5" SSDs (SAS or SATA), configured
+in mdadm RAID10 pools.
+A device mapper layer on top of these pools allows Proxmox to easily manage VM
+disk images.
+
+Network storage
+===============
+
+A :ref:`ceph_cluster` is setup as a shared storage resource.
+It can be used to temporarily transfer VM disk images from one hypervisor
+node to another, or to directly store virtual machine disk images.
diff --git a/docs/infrastructure/object_storage.rst b/docs/infrastructure/object_storage.rst
new file mode 100644
--- /dev/null
+++ b/docs/infrastructure/object_storage.rst
@@ -0,0 +1,73 @@
+==============
+Object storage
+==============
+
+There is not one but at least 4 different object stores directly managed
+by the Software Heritage group:
+
+- Main archive
+- Rocquencourt replica archive
+- Azure archive
+- AWS archive
+
+The Main archive
+================
+
+Uffizi
+Located in Rocquencourt
+
+Replica archive
+===============
+
+Banco
+Located in Rocquencourt, in a different building than the main one
+
+Azure archive
+=============
+
+The Azure archive uses an Azure Block Storage backend, implemented in the
+*swh.objstorage_backends.azure.AzureCloudObjStorage* Python class.
+
+Internally, that class uses the *block_blob_service* Azure API.
+
+AWS archive
+===========
+
+The AWS archive is stored in the *softwareheritage* Amazon S3 bucket, in the US-East
+ (N. Virginia) region. That bucket is public.
+
+It is being continously populated by the `content_replayer` program.
+
+Softwareheritage Python programs access it using a libcloud backend.
+
+URL
+---
+
+``s3://softwareheritage/content``
+
+content_replayer
+----------------
+
+A Python program which reads new objects from Kafka and then copies them from the
+ object storages on Banco and Uffizi.
+
+
+Implementation details
+----------------------
+
+* Uses *swh.objstorage.backends.libcloud*
+
+* Uses *libcloud.storage.drivers.s3*
+
+
+Architecture diagram
+====================
+
+.. graph:: swh_archives
+ "Main archive" -- "Replica archive";
+ "Azure archive";
+ "AWS archive";
+ "Main archive" [shape=rectangle];
+ "Replica archive" [shape=rectangle];
+ "Azure archive" [shape=rectangle];
+ "AWS archive" [shape=rectangle];
diff --git a/docs/infrastructure/storage_site_amazon.rst b/docs/infrastructure/storage_site_amazon.rst
new file mode 100644
--- /dev/null
+++ b/docs/infrastructure/storage_site_amazon.rst
@@ -0,0 +1,9 @@
+.. _storage_amazon:
+
+Amazon storage
+==============
+
+A *softwareheritage* object storage S3 bucket is hosted publicly in the
+US-east AWS region.
+
+Data is reachable from the *s3://softwareheritage/content* URL.
diff --git a/docs/infrastructure/storage_site_azure_euwest.rst b/docs/infrastructure/storage_site_azure_euwest.rst
new file mode 100644
--- /dev/null
+++ b/docs/infrastructure/storage_site_azure_euwest.rst
@@ -0,0 +1,38 @@
+Azure Euwest
+============
+
+virtual machines
+----------------
+
+- dbreplica0: contains a read-only instance of the *softwareheritage* database
+- dbreplica1: contains a read-only instance of the *softwareheritage-indexer* database
+- kafka01 to 06
+- mirror-node-1 to 3
+- storage0
+- vangogh (vault implementation)
+- webapp0
+- worker01 to 13
+
+The PostgreSQL databases are populated using wal streaming from *somerset*.
+
+storage accounts
+----------------
+
+16 Azure storage account (0euwestswh to feuwestswh) are dedicated to blob
+containers for object storage.
+The first hexadecimal digit of an account name is also the first digit of
+its content hashes.
+Blobs are storred in location names of the form *6euwestswh/contents*
+
+Other storage accounts:
+
+- archiveeuwestswh: mirrors of dead software forges like *code.google.com*
+- swhvaultstorage: cooked archives for the *vault* server running in azure.
+- swhcontent: object storage content (individual blobs)
+
+
+TODO: describe kafka* virtual machines
+TODO: describe mirror-node* virtual machines
+TODO: describe storage0 virtual machine
+TODO: describe webapp0 virtual machine
+TODO: describe worker* virtual machines
diff --git a/docs/infrastructure/storage_site_others.rst b/docs/infrastructure/storage_site_others.rst
new file mode 100644
--- /dev/null
+++ b/docs/infrastructure/storage_site_others.rst
@@ -0,0 +1,24 @@
+=========================================
+Other Software Heritage storage locations
+=========================================
+
+INRIA-provided storage at Rocquencourt
+======================================
+
+The *filer-backup:/swh1* NFS filesystem is used to store DAR backups.
+It is mounted on *uffizi:/srv/remote-backups*
+
+The *uffizi:/srv/remote-backups* filesystem is regularly snapshotted and the snapshots are visible in
+*uffizi:/srv/remote-backups/.snapshot/*.
+
+Workstations
+============
+
+Staff workstations are located at INRIA Paris. The most important one from a storage
+point of view is *giverny.paris.inria.fr* and has more than 10 TB of directly-attached
+storage, mostly used for research databases.
+
+Public website
+==============
+
+Hosted by Gandi, its storage (including Wordpress) is located in one or more Gandi datacenter(s).
diff --git a/docs/infrastructure/storage_site_rocquencourt_physical.rst b/docs/infrastructure/storage_site_rocquencourt_physical.rst
new file mode 100644
--- /dev/null
+++ b/docs/infrastructure/storage_site_rocquencourt_physical.rst
@@ -0,0 +1,65 @@
+Physical machines at Rocquencourt
+=================================
+
+hypervisors
+-----------
+
+The :doc:`hypervisors <hypervisors>` mostly use local storage on the form of internal
+SSDS but also have access to a :ref:`Ceph cluster`.
+
+NFS server
+----------
+
+There is only one NFS server managed by Software Heritage, *uffizi.internal.softwareheritage.org*.
+That machine is located at Rocquencourt and is directly attached to two SAS storage bays.
+
+NFS-exported data is present under these local filesystem paths: ::
+
+/srv/storage/space
+/srv/softwareheritage/objects
+
+belvedere
+---------
+
+This server is used for at least two separate PostgreSQL instances:
+
+- *softwareheritage* database (port 5433)
+- *swh-lister* and *softwareheritage-scheduler* databases (port 5434)
+
+Data is stored on local SSDs. The operating system lies on a LSI hardware RAID 1 volume and
+each PostgreSQL instance uses a dedicated set of drives in mdadm RAID10 volume(s).
+
+It also uses a single NFS volume:
+::
+
+ uffizi:/srv/storage/space/postgres-backups/prado
+
+banco
+-----
+
+This machine is located in its own building in Rocquencourt, along
+with a SAS storage bay.
+It is intended to serve as a backup for the main site on building 30.
+
+Elasticsearch cluster
+---------------------
+
+The :doc:`Elasticsearch cluster <elasticsearch>` only uses local storage on
+its nodes.
+
+Test / staging server
+---------------------
+
+There is also *orsay*, a refurbished machine only used for testing / staging
+new software versions.
+
+.. _ceph_cluster:
+
+Ceph cluster
+------------
+
+The Software Heritage Ceph cluster contains three nodes:
+
+- ceph-mon1
+- ceph-osd1
+- ceph-osd2
diff --git a/docs/infrastructure/storage_site_rocquencourt_virtual.rst b/docs/infrastructure/storage_site_rocquencourt_virtual.rst
new file mode 100644
--- /dev/null
+++ b/docs/infrastructure/storage_site_rocquencourt_virtual.rst
@@ -0,0 +1,45 @@
+Virtual machines at Rocquencourt
+================================
+
+The following virtual machines are hosted on Proxmox hypervisors located at Rocquencourt.
+All of them use local storage on their virtual hard drive.
+
+VMs without NFS mount points
+----------------------------
+
+- munin0
+- tate, used for public and private (intranet) wikis
+- getty
+- thyssen
+- jenkins-debian1.internal.softwareheritage.org
+- logstash0
+- kibana0
+- saatchi
+- louvre
+
+Containers and VMs with nfs storage:
+------------------------------------
+
+- somerset.internal.softwareheritage.org is a lxc container running on *beaubourg*
+ It serves as a host for the *softwareheritage* and *softwareheritage-indexer*
+ databases.
+
+- worker01 to worker16.internal.softwareheritage.org
+- pergamon
+- moma
+
+These VMs access one or more of these NFS volumes located on uffizi:
+
+::
+
+ uffizi:/srv/softwareheritage/objects
+ uffizi:/srv/storage/space
+ uffizi:/srv/storage/space/annex
+ uffizi:/srv/storage/space/annex/public
+ uffizi:/srv/storage/space/antelink
+ uffizi:/srv/storage/space/oversize-objects
+ uffizi:/srv/storage/space/personal
+ uffizi:/srv/storage/space/postgres-backups/somerset
+ uffizi:/srv/storage/space/provenance-index
+ uffizi:/srv/storage/space/swh-deposit
+
diff --git a/docs/infrastructure/storage_sites.rst b/docs/infrastructure/storage_sites.rst
new file mode 100644
--- /dev/null
+++ b/docs/infrastructure/storage_sites.rst
@@ -0,0 +1,51 @@
+===============================
+Software Heritage storage sites
+===============================
+
+.. toctree::
+ :maxdepth: 2
+ :hidden:
+
+ storage_site_rocquencourt_physical
+ storage_site_rocquencourt_virtual
+ storage_site_azure_euwest
+ storage_site_amazon
+ storage_site_others
+ elasticsearch
+ hypervisors
+ object_storage
+
+Physical machines at Rocquencourt
+=================================
+
+INRIA Rocquencourt is the main Software Heritage datacenter.
+It is the only one to contain
+:doc:`directly-managed physical machines <storage_site_rocquencourt_physical>`.
+
+Virtual machines at Rocquencourt
+================================
+
+The :doc:`virtual machines at Rocquencourt <storage_site_rocquencourt_virtual>`
+are directly managed by Software Heritage staff as well and run on
+:doc:`Software Heritage hypervisors <hypervisors>`.
+
+Azure Euwest
+============
+
+Various virtual machines and other services are hosted at
+:doc:`Azure Euwest <storage_site_azure_euwest>`
+
+Amazon S3
+=========
+
+Object storage
+==============
+
+Even though there are different object storage implementations in different
+locations, it has been deemed useful to regroup all object storage-related
+information in a :doc:`single document <object_storage>`.
+
+Other locations
+===============
+
+:doc:`Other locations <storage_site_others>`.
File Metadata
Details
Attached
Mime Type
text/plain
Expires
Tue, Dec 17, 9:37 PM (2 d, 8 h ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3220038
Attached To
D2223: swh-docs: Add storage sites documentation (v2)
Event Timeline
Log In to Comment