diff --git a/docs/index.rst b/docs/index.rst index ff58f2e..4e0c013 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,157 +1,158 @@ .. _swh-docs: Software Heritage - Development Documentation ============================================= Getting started --------------- * :ref:`getting-started` ← start here to get your own Software Heritage platform running in less than 5 minutes, or * :ref:`developer-setup` ← here to hack on the Software Heritage software stack Architecture ------------ * :ref:`architecture` ← go there to have a glimpse on the Software Heritage software architecture Components ---------- Here is brief overview of the most relevant software components in the Software Heritage stack. Each component name is linked to the development documentation of the corresponding Python module. :ref:`swh.core ` low-level utilities and helpers used by almost all other modules in the stack :ref:`swh.dataset ` public datasets and periodic data dumps of the archive released by Software Heritage :ref:`swh.deposit ` push-based deposit of software artifacts to the archive swh.docs developer documentation (used to generate this doc you are reading) :ref:`swh.graph ` Fast, compressed, in-memory representation of the archive, with tooling to generate and query it. :ref:`swh.indexer ` tools and workers used to crawl the content of the archive and extract derived information from any artifact stored in it :ref:`swh.journal ` persistent logger of changes to the archive, with publish-subscribe support :ref:`swh.lister ` collection of listers for all sorts of source code hosting and distribution places (forges, distributions, package managers, etc.) :ref:`swh.loader-core ` low-level loading utilities and helpers used by all other loaders :ref:`swh.loader-debian ` loader for `Debian `_ source packages :ref:`swh.loader-dir ` loader for source directories (e.g., expanded tarballs) :ref:`swh.loader-git ` loader for `Git `_ repositories :ref:`swh.loader-mercurial ` loader for `Mercurial `_ repositories :ref:`swh.loader-pypi ` loader for `PyPI `_ source code releases :ref:`swh.loader-svn ` loader for `Subversion `_ repositories :ref:`swh.loader-tar ` loader for source tarballs (including Tar, ZIP and other archive formats) :ref:`swh.model ` implementation of the :ref:`data-model` to archive source code artifacts :ref:`swh.objstorage ` content-addressable object storage :ref:`swh.scheduler ` task manager for asynchronous/delayed tasks, used for recurrent (e.g., listing a forge, loading new stuff from a Git repository) and one-off activities (e.g., loading a specific version of a source package) :ref:`swh.storage ` abstraction layer over the archive, allowing to access all stored source code artifacts as well as their metadata :ref:`swh.vault ` implementation of the vault service, allowing to retrieve parts of the archive as self-contained bundles (e.g., individual releases, entire repository snapshots, etc.) :ref:`swh.web ` Web application(s) to browse the archive, for both interactive (HTML UI) and mechanized (REST API) use Dependencies ------------ The dependency relationships among the various modules are depicted below. .. _py-deps-swh: .. figure:: images/py-deps-swh.svg :width: 1024px :align: center Dependencies among top-level Python modules (click to zoom). Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * `URLs index `_ * :ref:`search` * :ref:`glossary` .. ensure sphinx does not complain about index files not being included .. toctree:: :maxdepth: 2 :caption: Contents: :titlesonly: :hidden: architecture getting-started developer-setup manual-setup + Infrastructure API documentation swh.core swh.dataset swh.deposit swh.graph swh.indexer swh.journal swh.lister swh.loader swh.model swh.objstorage swh.scheduler swh.storage swh.vault swh.web diff --git a/docs/elasticsearch.rst b/docs/infrastructure/elasticsearch.rst similarity index 92% rename from docs/elasticsearch.rst rename to docs/infrastructure/elasticsearch.rst index ea402e3..d94c94d 100644 --- a/docs/elasticsearch.rst +++ b/docs/infrastructure/elasticsearch.rst @@ -1,36 +1,38 @@ +.. _elasticsearch: + ============== Elasticsearch ============== Software Heritage uses an Elasticsearch cluster for long-term log storage. Hardware implementation ======================= - 3x Xeon E3v6 (Skylake) servers with 32GB of RAM and 3x 2TB of hard drives each - 2x gigabit switches List of nodes ------------- * esnode1.internal.softwareheritage.org. * esnode2.internal.softwareheritage.org. * esnode3.internal.softwareheritage.org. Architecture diagram ==================== -.. graphviz:: images/elasticsearch.dot +.. graphviz:: ../images/elasticsearch.dot Per-node storage ================ - one root hard drive with a small filesystem - 3x 2TB hard drives in RAID0 - xfs filesystem on this volume, mounted on */srv/elasticsearch* Remark ====== The root hard drive of the Elasticsearch nodes is also used to store an ext4 `Kafka` dedicated filesystem mounted on */srv/kafka* . diff --git a/docs/infrastructure/hypervisors.rst b/docs/infrastructure/hypervisors.rst new file mode 100644 index 0000000..403779f --- /dev/null +++ b/docs/infrastructure/hypervisors.rst @@ -0,0 +1,26 @@ +=========== +Hypervisors +=========== + +Software Heritage uses a few hypervisors configured in a Proxmox cluster + +List of Proxmox nodes +===================== + +- beaubourg: Xeon E7-4809 server, 16 cores/512 GB RAM, bought in 2015 +- hypervisor3: EPYC 7301 server, 32 cores/256 GB RAM, bought in 2018 + +Per-node storage +================ + +The servers each have physically installed 2.5" SSDs (SAS or SATA), configured +in mdadm RAID10 pools. +A device mapper layer on top of these pools allows Proxmox to easily manage VM +disk images. + +Network storage +=============== + +A :ref:`ceph_cluster` is setup as a shared storage resource. +It can be used to temporarily transfer VM disk images from one hypervisor +node to another, or to directly store virtual machine disk images. diff --git a/docs/infrastructure/index.rst b/docs/infrastructure/index.rst new file mode 100644 index 0000000..ef797b5 --- /dev/null +++ b/docs/infrastructure/index.rst @@ -0,0 +1,51 @@ +=============================== +Software Heritage storage sites +=============================== + +.. toctree:: + :maxdepth: 2 + :hidden: + + storage_site_rocquencourt_physical + storage_site_rocquencourt_virtual + storage_site_azure_euwest + storage_site_amazon + storage_site_others + elasticsearch + hypervisors + object_storage + +Physical machines at Rocquencourt +================================= + +INRIA Rocquencourt is the main Software Heritage datacenter. +It is the only one to contain +:doc:`directly-managed physical machines `. + +Virtual machines at Rocquencourt +================================ + +The :doc:`virtual machines at Rocquencourt ` +are directly managed by Software Heritage staff as well and run on +:doc:`Software Heritage hypervisors `. + +Azure Euwest +============ + +Various virtual machines and other services are hosted at +:doc:`Azure Euwest ` + +Amazon S3 +========= + +Object storage +============== + +Even though there are different object storage implementations in different +locations, it has been deemed useful to regroup all object storage-related +information in a :doc:`single document `. + +Other locations +=============== + +:doc:`Other locations `. diff --git a/docs/infrastructure/object_storage.rst b/docs/infrastructure/object_storage.rst new file mode 100644 index 0000000..413fdec --- /dev/null +++ b/docs/infrastructure/object_storage.rst @@ -0,0 +1,75 @@ +============== +Object storage +============== + +There is not one but at least 4 different object stores directly managed +by the Software Heritage group: + +- Main archive +- Rocquencourt replica archive +- Azure archive +- AWS archive + +The Main archive +================ + +Uffizi +Located in Rocquencourt + +Replica archive +=============== + +Banco +Located in Rocquencourt, in a different building than the main one + +Azure archive +============= + +The Azure archive uses an Azure Block Storage backend, implemented in the +*swh.objstorage_backends.azure.AzureCloudObjStorage* Python class. + +Internally, that class uses the *block_blob_service* Azure API. + +AWS archive +=========== + +The AWS archive is stored in the *softwareheritage* Amazon S3 bucket, in the US-East + (N. Virginia) region. That bucket is public. + +It is being continously populated by the :ref:`content_replayer` program. + +Softwareheritage Python programs access it using a libcloud backend. + +URL +--- + +``s3://softwareheritage/content`` + +.. _content_replayer: + +content_replayer +---------------- + +A Python program which reads new objects from Kafka and then copies them from the + object storages on Banco and Uffizi. + + +Implementation details +---------------------- + +* Uses *swh.objstorage.backends.libcloud* + +* Uses *libcloud.storage.drivers.s3* + + +Architecture diagram +==================== + +.. graph:: swh_archives + "Main archive" -- "Replica archive"; + "Azure archive"; + "AWS archive"; + "Main archive" [shape=rectangle]; + "Replica archive" [shape=rectangle]; + "Azure archive" [shape=rectangle]; + "AWS archive" [shape=rectangle]; diff --git a/docs/infrastructure/storage_site_amazon.rst b/docs/infrastructure/storage_site_amazon.rst new file mode 100644 index 0000000..86fd02c --- /dev/null +++ b/docs/infrastructure/storage_site_amazon.rst @@ -0,0 +1,9 @@ +.. _storage_amazon: + +Amazon storage +============== + +A *softwareheritage* object storage S3 bucket is hosted publicly in the +US-east AWS region. + +Data is reachable from the *s3://softwareheritage/content* URL. diff --git a/docs/infrastructure/storage_site_azure_euwest.rst b/docs/infrastructure/storage_site_azure_euwest.rst new file mode 100644 index 0000000..7bf85d5 --- /dev/null +++ b/docs/infrastructure/storage_site_azure_euwest.rst @@ -0,0 +1,38 @@ +Azure Euwest +============ + +virtual machines +---------------- + +- dbreplica0: contains a read-only instance of the *softwareheritage* database +- dbreplica1: contains a read-only instance of the *softwareheritage-indexer* database +- kafka01 to 06 +- mirror-node-1 to 3 +- storage0 +- vangogh (vault implementation) +- webapp0 +- worker01 to 13 + +The PostgreSQL databases are populated using wal streaming from *somerset*. + +storage accounts +---------------- + +16 Azure storage account (0euwestswh to feuwestswh) are dedicated to blob +containers for object storage. +The first hexadecimal digit of an account name is also the first digit of +its content hashes. +Blobs are storred in location names of the form *6euwestswh/contents* + +Other storage accounts: + +- archiveeuwestswh: mirrors of dead software forges like *code.google.com* +- swhvaultstorage: cooked archives for the *vault* server running in azure. +- swhcontent: object storage content (individual blobs) + + +TODO: describe kafka* virtual machines +TODO: describe mirror-node* virtual machines +TODO: describe storage0 virtual machine +TODO: describe webapp0 virtual machine +TODO: describe worker* virtual machines diff --git a/docs/infrastructure/storage_site_others.rst b/docs/infrastructure/storage_site_others.rst new file mode 100644 index 0000000..c47975f --- /dev/null +++ b/docs/infrastructure/storage_site_others.rst @@ -0,0 +1,24 @@ +========================================= +Other Software Heritage storage locations +========================================= + +INRIA-provided storage at Rocquencourt +====================================== + +The *filer-backup:/swh1* NFS filesystem is used to store DAR backups. +It is mounted on *uffizi:/srv/remote-backups* + +The *uffizi:/srv/remote-backups* filesystem is regularly snapshotted and the snapshots are visible in +*uffizi:/srv/remote-backups/.snapshot/*. + +Workstations +============ + +Staff workstations are located at INRIA Paris. The most important one from a storage +point of view is *giverny.paris.inria.fr* and has more than 10 TB of directly-attached +storage, mostly used for research databases. + +Public website +============== + +Hosted by Gandi, its storage (including Wordpress) is located in one or more Gandi datacenter(s). diff --git a/docs/infrastructure/storage_site_rocquencourt_physical.rst b/docs/infrastructure/storage_site_rocquencourt_physical.rst new file mode 100644 index 0000000..5e9693c --- /dev/null +++ b/docs/infrastructure/storage_site_rocquencourt_physical.rst @@ -0,0 +1,64 @@ +Physical machines at Rocquencourt +================================= + +hypervisors +----------- + +The :doc:`hypervisors ` mostly use local storage on the form of internal +SSDS but also have access to a :ref:`Ceph cluster`. + +NFS server +---------- + +There is only one NFS server managed by Software Heritage, *uffizi.internal.softwareheritage.org*. +That machine is located at Rocquencourt and is directly attached to two SAS storage bays. + +NFS-exported data is present under these local filesystem paths:: + +/srv/storage/space +/srv/softwareheritage/objects + +belvedere +--------- + +This server is used for at least two separate PostgreSQL instances: + +- *softwareheritage* database (port 5433) +- *swh-lister* and *softwareheritage-scheduler* databases (port 5434) + +Data is stored on local SSDs. The operating system lies on a LSI hardware RAID 1 volume and +each PostgreSQL instance uses a dedicated set of drives in mdadm RAID10 volume(s). + +It also uses a single NFS volume:: + + uffizi:/srv/storage/space/postgres-backups/prado + +banco +----- + +This machine is located in its own building in Rocquencourt, along +with a SAS storage bay. +It is intended to serve as a backup for the main site on building 30. + +Elasticsearch cluster +--------------------- + +The :doc:`Elasticsearch cluster ` only uses local storage on +its nodes. + +Test / staging server +--------------------- + +There is also *orsay*, a refurbished machine only used for testing / staging +new software versions. + +.. _ceph_cluster: + +Ceph cluster +------------ + +The Software Heritage Ceph cluster contains three nodes: + +- ceph-mon1 +- ceph-osd1 +- ceph-osd2 diff --git a/docs/infrastructure/storage_site_rocquencourt_virtual.rst b/docs/infrastructure/storage_site_rocquencourt_virtual.rst new file mode 100644 index 0000000..99bf1a7 --- /dev/null +++ b/docs/infrastructure/storage_site_rocquencourt_virtual.rst @@ -0,0 +1,43 @@ +Virtual machines at Rocquencourt +================================ + +The following virtual machines are hosted on Proxmox hypervisors located at Rocquencourt. +All of them use local storage on their virtual hard drive. + +VMs without NFS mount points +---------------------------- + +- munin0 +- tate, used for public and private (intranet) wikis +- getty +- thyssen +- jenkins-debian1.internal.softwareheritage.org +- logstash0 +- kibana0 +- saatchi +- louvre + +Containers and VMs with nfs storage: +------------------------------------ + +- somerset.internal.softwareheritage.org is a lxc container running on *beaubourg* + It serves as a host for the *softwareheritage* and *softwareheritage-indexer* + databases. + +- worker01 to worker16.internal.softwareheritage.org +- pergamon +- moma + +These VMs access one or more of these NFS volumes located on uffizi:: + + uffizi:/srv/softwareheritage/objects + uffizi:/srv/storage/space + uffizi:/srv/storage/space/annex + uffizi:/srv/storage/space/annex/public + uffizi:/srv/storage/space/antelink + uffizi:/srv/storage/space/oversize-objects + uffizi:/srv/storage/space/personal + uffizi:/srv/storage/space/postgres-backups/somerset + uffizi:/srv/storage/space/provenance-index + uffizi:/srv/storage/space/swh-deposit +