diff --git a/docs/index.rst b/docs/index.rst index 9d06f64..ccfb1be 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,162 +1,161 @@ .. _swh-docs: Software Heritage - Development Documentation ============================================= Getting started --------------- * :ref:`getting-started` ← start here to get your own Software Heritage platform running in less than 5 minutes, or * :ref:`developer-setup` ← here to hack on the Software Heritage software stack Architecture ------------ * :ref:`architecture` ← go there to have a glimpse on the Software Heritage software architecture Components ---------- Here is brief overview of the most relevant software components in the Software Heritage stack. Each component name is linked to the development documentation of the corresponding Python module. :ref:`swh.core ` low-level utilities and helpers used by almost all other modules in the stack :ref:`swh.dataset ` public datasets and periodic data dumps of the archive released by Software Heritage :ref:`swh.deposit ` push-based deposit of software artifacts to the archive swh.docs developer documentation (used to generate this doc you are reading) :ref:`swh.fuse ` Virtual file system to browse the Software Heritage archive, based on `FUSE `_ :ref:`swh.graph ` Fast, compressed, in-memory representation of the archive, with tooling to generate and query it. :ref:`swh.indexer ` tools and workers used to crawl the content of the archive and extract derived information from any artifact stored in it :ref:`swh.journal ` persistent logger of changes to the archive, with publish-subscribe support :ref:`swh.lister ` collection of listers for all sorts of source code hosting and distribution places (forges, distributions, package managers, etc.) :ref:`swh.loader-core ` low-level loading utilities and helpers used by all other loaders :ref:`swh.loader-git ` loader for `Git `_ repositories :ref:`swh.loader-mercurial ` loader for `Mercurial `_ repositories :ref:`swh.loader-svn ` loader for `Subversion `_ repositories :ref:`swh.model ` implementation of the :ref:`data-model` to archive source code artifacts :ref:`swh.objstorage ` content-addressable object storage :ref:`swh.objstorage.replayer ` Object storage replication tool :ref:`swh.scanner ` source code scanner to analyze code bases and compare them with source code artifacts archived by Software Heritage :ref:`swh.scheduler ` task manager for asynchronous/delayed tasks, used for recurrent (e.g., listing a forge, loading new stuff from a Git repository) and one-off activities (e.g., loading a specific version of a source package) :ref:`swh.storage ` abstraction layer over the archive, allowing to access all stored source code artifacts as well as their metadata :ref:`swh.vault ` implementation of the vault service, allowing to retrieve parts of the archive as self-contained bundles (e.g., individual releases, entire repository snapshots, etc.) :ref:`swh.web ` Web application(s) to browse the archive, for both interactive (HTML UI) and mechanized (REST API) use :ref:`swh.web.client ` Python client for :ref:`swh.web ` Dependencies ------------ The dependency relationships among the various modules are depicted below. .. _py-deps-swh: .. figure:: images/py-deps-swh.svg :width: 1024px :align: center Dependencies among top-level Python modules (click to zoom). Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * `URLs index `_ * :ref:`search` * :ref:`glossary` .. ensure sphinx does not complain about index files not being included .. toctree:: :maxdepth: 2 :caption: Contents: :titlesonly: :hidden: architecture getting-started developer-setup - Infrastructure API documentation swh.core swh.dataset swh.deposit swh.fuse swh.graph swh.indexer swh.journal swh.lister swh.loader swh.model swh.objstorage swh.scanner swh.scheduler swh.storage swh.vault swh.web swh.web.client diff --git a/docs/infrastructure/elasticsearch.rst b/docs/infrastructure/elasticsearch.rst deleted file mode 100644 index d94c94d..0000000 --- a/docs/infrastructure/elasticsearch.rst +++ /dev/null @@ -1,38 +0,0 @@ -.. _elasticsearch: - -============== -Elasticsearch -============== - -Software Heritage uses an Elasticsearch cluster for long-term log storage. - -Hardware implementation -======================= - -- 3x Xeon E3v6 (Skylake) servers with 32GB of RAM and 3x 2TB of hard drives each -- 2x gigabit switches - -List of nodes -------------- - -* esnode1.internal.softwareheritage.org. -* esnode2.internal.softwareheritage.org. -* esnode3.internal.softwareheritage.org. - -Architecture diagram -==================== - -.. graphviz:: ../images/elasticsearch.dot - -Per-node storage -================ - -- one root hard drive with a small filesystem -- 3x 2TB hard drives in RAID0 -- xfs filesystem on this volume, mounted on */srv/elasticsearch* - -Remark -====== - -The root hard drive of the Elasticsearch nodes is also used to -store an ext4 `Kafka` dedicated filesystem mounted on */srv/kafka* . diff --git a/docs/infrastructure/hypervisors.rst b/docs/infrastructure/hypervisors.rst deleted file mode 100644 index 05fe26a..0000000 --- a/docs/infrastructure/hypervisors.rst +++ /dev/null @@ -1,29 +0,0 @@ -=========== -Hypervisors -=========== - -Software Heritage uses a few hypervisors configured in a Proxmox cluster - -List of Proxmox nodes -===================== - -- beaubourg: Xeon E7-4809 server, 16 cores/512 GB RAM, bought in 2015 -- hypervisor3: EPYC 7301 server, 32 cores/256 GB RAM, bought in 2018 -- orsay: Opteron 6172, 48 cores/128 GB RAM, refurbished (2010 vintage) - -Orsay is not a production machine, its purpose is to run throw-away development/staging VMs. - -Per-node storage -================ - -The servers each have physically installed 2.5" SSDs (SAS or SATA), configured -in mdadm RAID10 pools. -A device mapper layer on top of these pools allows Proxmox to easily manage VM -disk images. - -Network storage -=============== - -A :ref:`ceph_cluster` is setup as a shared storage resource. -It can be used to temporarily transfer VM disk images from one hypervisor -node to another, or to directly store virtual machine disk images. diff --git a/docs/infrastructure/index.rst b/docs/infrastructure/index.rst deleted file mode 100644 index ef797b5..0000000 --- a/docs/infrastructure/index.rst +++ /dev/null @@ -1,51 +0,0 @@ -=============================== -Software Heritage storage sites -=============================== - -.. toctree:: - :maxdepth: 2 - :hidden: - - storage_site_rocquencourt_physical - storage_site_rocquencourt_virtual - storage_site_azure_euwest - storage_site_amazon - storage_site_others - elasticsearch - hypervisors - object_storage - -Physical machines at Rocquencourt -================================= - -INRIA Rocquencourt is the main Software Heritage datacenter. -It is the only one to contain -:doc:`directly-managed physical machines `. - -Virtual machines at Rocquencourt -================================ - -The :doc:`virtual machines at Rocquencourt ` -are directly managed by Software Heritage staff as well and run on -:doc:`Software Heritage hypervisors `. - -Azure Euwest -============ - -Various virtual machines and other services are hosted at -:doc:`Azure Euwest ` - -Amazon S3 -========= - -Object storage -============== - -Even though there are different object storage implementations in different -locations, it has been deemed useful to regroup all object storage-related -information in a :doc:`single document `. - -Other locations -=============== - -:doc:`Other locations `. diff --git a/docs/infrastructure/object_storage.rst b/docs/infrastructure/object_storage.rst deleted file mode 100644 index ab82e8e..0000000 --- a/docs/infrastructure/object_storage.rst +++ /dev/null @@ -1,76 +0,0 @@ -============== -Object storage -============== - -There is not one but at least 4 different object stores directly managed -by the Software Heritage group: - -- Main archive -- Rocquencourt replica archive -- Azure archive -- AWS archive - -The Main archive -================ - -Uffizi -Located in Rocquencourt - -Replica archive -=============== - -Banco -Located in Rocquencourt, in a different building than the main one - -Azure archive -============= - -The Azure archive uses an Azure Block Storage backend, implemented in the -*swh.objstorage_backends.azure.AzureCloudObjStorage* Python class. - -Internally, that class uses the *block_blob_service* Azure API. - -AWS archive -=========== - -The AWS archive is stored in the *softwareheritage* Amazon S3 bucket, in the US-East - (N. Virginia) region. That bucket is public. - -It is being continuously populated by the :ref:`content_replayer` program. - -Softwareheritage Python programs access it using a libcloud backend. - -URL ---- - -``s3://softwareheritage/content`` - -.. _content_replayer: - -content_replayer ----------------- - -A Python program which reads new objects from Kafka and then copies them from the - object storages on Banco and Uffizi. - - -Implementation details ----------------------- - -* Uses *swh.objstorage.backends.libcloud* - -* Uses *libcloud.storage.drivers.s3* - - -Architecture diagram -==================== - -.. graph:: swh_archives - - "Main archive" -- "Replica archive"; - "Azure archive"; - "AWS archive"; - "Main archive" [shape=rectangle]; - "Replica archive" [shape=rectangle]; - "Azure archive" [shape=rectangle]; - "AWS archive" [shape=rectangle]; diff --git a/docs/infrastructure/storage_site_amazon.rst b/docs/infrastructure/storage_site_amazon.rst deleted file mode 100644 index 86fd02c..0000000 --- a/docs/infrastructure/storage_site_amazon.rst +++ /dev/null @@ -1,9 +0,0 @@ -.. _storage_amazon: - -Amazon storage -============== - -A *softwareheritage* object storage S3 bucket is hosted publicly in the -US-east AWS region. - -Data is reachable from the *s3://softwareheritage/content* URL. diff --git a/docs/infrastructure/storage_site_azure_euwest.rst b/docs/infrastructure/storage_site_azure_euwest.rst deleted file mode 100644 index 249c00e..0000000 --- a/docs/infrastructure/storage_site_azure_euwest.rst +++ /dev/null @@ -1,32 +0,0 @@ -Azure Euwest -============ - -virtual machines ----------------- - -- dbreplica0: contains a read-only instance of the *softwareheritage* database -- dbreplica1: contains a read-only instance of the *softwareheritage-indexer* database -- kafka01 to 06: journal nodes -- mirror-node-1 to 3 -- storage0: storage and object storage services used by the Azure workers -- vangogh: vault service and r/w database for the vault workers -- webapp0: webapp mirror using storage0 services to expose results -- worker01 to 10 and worker13: indexer workers -- worker11 to 12: vault workers (cooking) - -The PostgreSQL databases are populated using wal streaming from *somerset*. - -storage accounts ----------------- - -16 Azure storage account (0euwestswh to feuwestswh) are dedicated to blob -containers for object storage. -The first hexadecimal digit of an account name is also the first digit of -its content hashes. -Blobs are stored in location names of the form *6euwestswh/contents* - -Other storage accounts: - -- archiveeuwestswh: mirrors of dead software forges like *code.google.com* -- swhvaultstorage: cooked archives for the *vault* server running in azure. -- swhcontent: object storage content (individual blobs) diff --git a/docs/infrastructure/storage_site_others.rst b/docs/infrastructure/storage_site_others.rst deleted file mode 100644 index c47975f..0000000 --- a/docs/infrastructure/storage_site_others.rst +++ /dev/null @@ -1,24 +0,0 @@ -========================================= -Other Software Heritage storage locations -========================================= - -INRIA-provided storage at Rocquencourt -====================================== - -The *filer-backup:/swh1* NFS filesystem is used to store DAR backups. -It is mounted on *uffizi:/srv/remote-backups* - -The *uffizi:/srv/remote-backups* filesystem is regularly snapshotted and the snapshots are visible in -*uffizi:/srv/remote-backups/.snapshot/*. - -Workstations -============ - -Staff workstations are located at INRIA Paris. The most important one from a storage -point of view is *giverny.paris.inria.fr* and has more than 10 TB of directly-attached -storage, mostly used for research databases. - -Public website -============== - -Hosted by Gandi, its storage (including Wordpress) is located in one or more Gandi datacenter(s). diff --git a/docs/infrastructure/storage_site_rocquencourt_physical.rst b/docs/infrastructure/storage_site_rocquencourt_physical.rst deleted file mode 100644 index 1c4bbc8..0000000 --- a/docs/infrastructure/storage_site_rocquencourt_physical.rst +++ /dev/null @@ -1,64 +0,0 @@ -Physical machines at Rocquencourt -================================= - -hypervisors ------------ - -The :doc:`hypervisors ` mostly use local storage on the form of internal -SSDS but also have access to a :ref:`ceph_cluster`. - -NFS server ----------- - -There is only one NFS server managed by Software Heritage, *uffizi.internal.softwareheritage.org*. -That machine is located at Rocquencourt and is directly attached to two SAS storage bays. - -NFS-exported data is present under these local filesystem paths:: - -/srv/storage/space -/srv/softwareheritage/objects - -belvedere ---------- - -This server is used for at least two separate PostgreSQL instances: - -- *softwareheritage* database (port 5433) -- *swh-lister* and *softwareheritage-scheduler* databases (port 5434) - -Data is stored on local SSDs. The operating system lies on a LSI hardware RAID 1 volume and -each PostgreSQL instance uses a dedicated set of drives in mdadm RAID10 volume(s). - -It also uses a single NFS volume:: - - uffizi:/srv/storage/space/postgres-backups/prado - -banco ------ - -This machine is located in its own building in Rocquencourt, along -with a SAS storage bay. -It is intended to serve as a backup for the main site on building 30. - -Elasticsearch cluster ---------------------- - -The :doc:`Elasticsearch cluster ` only uses local storage on -its nodes. - -Test / staging server ---------------------- - -There is also *orsay*, a refurbished machine only used for testing / staging -new software versions. - -.. _ceph_cluster: - -Ceph cluster ------------- - -The Software Heritage Ceph cluster contains three nodes: - -- ceph-mon1 -- ceph-osd1 -- ceph-osd2 diff --git a/docs/infrastructure/storage_site_rocquencourt_virtual.rst b/docs/infrastructure/storage_site_rocquencourt_virtual.rst deleted file mode 100644 index 664f735..0000000 --- a/docs/infrastructure/storage_site_rocquencourt_virtual.rst +++ /dev/null @@ -1,43 +0,0 @@ -Virtual machines at Rocquencourt -================================ - -The following virtual machines are hosted on Proxmox hypervisors located at Rocquencourt. -All of them use local storage on their virtual hard drive. - -VMs without NFS mount points ----------------------------- - -- munin0 -- tate, used for public and private (intranet) wikis -- getty -- thyssen -- jenkins-debian1.internal.softwareheritage.org -- logstash0 -- kibana0 -- saatchi -- louvre - -Containers and VMs with nfs storage: ------------------------------------- - -- somerset.internal.softwareheritage.org is a lxc container running on *beaubourg* - It serves as a host for the *softwareheritage* and *softwareheritage-indexer* - databases. - -- worker01 to worker16.internal.softwareheritage.org: loader and lister workers -- pergamon: internal system administration services (puppet master, grafana, dns resolver, etc...) -- moma: webapp and deposit services exposed publicly - -These VMs access one or more of these NFS volumes located on uffizi:: - - uffizi:/srv/softwareheritage/objects - uffizi:/srv/storage/space - uffizi:/srv/storage/space/annex - uffizi:/srv/storage/space/annex/public - uffizi:/srv/storage/space/antelink - uffizi:/srv/storage/space/oversize-objects - uffizi:/srv/storage/space/personal - uffizi:/srv/storage/space/postgres-backups/somerset - uffizi:/srv/storage/space/provenance-index - uffizi:/srv/storage/space/swh-deposit -