[Parent task for all related tasks]
# Current status
An [scale out object storage](https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html) design was proposed. It has to be described in detail and benchmarks need to be written to verify it is efficient (space and speed) for the intended use cases. The hardware to run the benchmarks has to be specified and secured.
# Design
It scales out for readers and scales up for writers.
### Archictecture# Object storage
### Scale out write and read## Glossary
Every object is uniquely identified with the SHA256 of its content and the UUID of the shard that contains it. When an object is written into the object storage, a UUID is allocated to it. It is the responsibility of the caller to save the object identifier. There is no guarantee that two objects with the same content will be deduplicated.
* A readonly ceph pool containing (machines with HDD):* Object: an opaque sequence of bytes.
* An RBD image for each shard (named after the shard UUID) in a custom SST format (SHA256 => object)* Object HASH (HASH for short): the hash of an Object, i.e. the intrinsic identifier in a [SWHID](https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#core-identifiers).
* A read/write database that contains newly written o* Shard: a group of Objects (machines with SSD)s.
* If the SHA256 exists in readwrite index* Shard [UUID](https://en.wikipedia.org/wiki/Universally_unique_identifier) (UUID for short): the unique identifier of a Shard.
* Object ID (ID for short): the Object HASH and the Shard UUID containing the object.
* Read Storage: the unlimited size storage from which clients can only read Objects. It only contains Objects up to a given point in time.
* Write Storage: the fixed size storage from which clients can read or write. If an Object is not found in the Write storage, it must be retrieved from the Read Storage.
* Object Storage: the content of the Write Storage and the Read Storage combined.
* [LSM](https://en.wikipedia.org/wiki/Log-structured_merge-tree): an embedded database for key-value data such as [RocksDB](https://en.wikipedia.org/wiki/RocksDB).
* [SST](https://forge.softwareheritage.org/T3068): a readonly Sorted String Table that consists of a sequence of key/values and an index.
* Database: [PostgreSQL](https://en.wikipedia.org/wiki/PostgreSQL), [Cassandra](https://en.wikipedia.org/wiki/Apache_Cassandra), returnetc.
* [Ceph](https://en.wikipedia.org/wiki/Ceph_(software)): a self-healing distributed storage.
* [RBD](https://docs.ceph.com/en/latest/rbd/) image: a Ceph block storage that can either be used via the librbd library or as a block device from /dev/rbd.
The key concepts are:
* Packing millions of Objects together in Shards to:
* When a object is written, it is associated with the current uniq shard UUID* save space and,
* When the shard threshold is reached (100GB for instance), a new uniq shard UUID is allocated and becomes the current shard UUID.* efficiently perform bulk actions such as mirroring or enumerations.
* Two different architectures:
* All objects from the former shard UUID are moved to the readonly ceph pool and removed from the database.
* **Writing:** Adding the same object that already exists in the readonly pool will create a duplicate.Read Storage that takes advantage of the fact that Objects are immutable and never deleted and,
* **Packing to readonly:** When the database grows bigger than a threshold (for instance 100GB), the current shard UUID is no longer used, another one is allocated. A shard is created in the ceph readonly pool and objects with the same UUID are sorted and copied to it. When the readonly shard is complete, the objects are removed from the database * Write Storage from which Shards are moved to the Read Storage when they are full.
* **Reading:** Given the id of an object (SHA256 + UUID of the shard), the shard is looked up in the readonly pool first. If it does not exist there, it is looked up in the read/write database. If the reader is not interested in the most up to date content from Software Heritage, it can limit its search to the readonly poolIdentifying an object by its Object HASH and the Shard UUID that contains it so that its location can be determined from the Object ID.
### IndexingWhile the architecture based on these concepts scales out for writing and reading, it cannot be used to address Objects with their Object HASH alone which is inconvenient for a number of use cases. An index mapping the Object HASH to the Shard UUID must be added to provide this feature, but it does not scale out writes.
An index of all SHA256 in the object storage is added for (i) deduplicationThe content of the Object Storage (i.e the Write Storage and the Read Storage combined) is **strongly consistent** at any point in time. As soon as an Object is written (i.e. the write operation returns to the process), (ii) retrieval using only the SHA256 of an object as a keya reader can get the Object content from the Object Storage (it may require looking up the object from both the Write Storage and Read Storage).
* The read/write database has an index mapping the SHA256 of the objects to its content.
* The readonly index maps the SHA256 of objects to the shard UUID that contains them.The Read Storage is **eventually consistent**. It does not contain the latest Objects inserted in the Write Storage but it will, eventually. It contains all objects inserted in the Object Storage, It is a sorted list of fixed size entries in a RDB fileup to a given date.
* **Writing:**
* If## Objects lookup require the SHA256 exists in readonly index, returnObject ID
### Architecture
* Write Storage:
* Insert the SHA256 in the database index
* **Packing to readonly:** When the database grows bigger than a threshold (for instance 100GB):A fixed number of Databases
* Read Storage:
* An index for all objects in the shard is exported.* Shards implemented as Ceph RBD images named after their Shard UUID
* The exported index is merged into the global readonly index.content of the Shard is formatted as a SST
### Writing
The Object is stored in one of the Databases from the Write Storage. The Database is chosen at random. A database has a current Shard UUID, chosen at random. The current Shard UUID will be associated to all Objects written to the Database.
A successful Object write returns the Object ID. Writing the same object twice may return different Object ID. The Object HASH will be the same because it is based on the content of the Object. But the Shard in which the Object is stored may be different and the Shard UUID will therefore be different.
### Packing
When a Database grows bigger than a threshold (for instance 100GB), the current Shard UUID is no longer used, another one is allocated at random. A Shard is created in the Read Storage and Objects that belong to the former Shard UUID are sorted copied to it. When the Shard is complete, the Objects are removed from the database.
### Reading
The Shard UUID is extracted from the Object ID. If it exists in the Read Storage, the Object HASH is used to lookup its content. Otherwise the Database that owns the Shard UUID is looked up from the Write Storage and the Object HASH is used to lookup its content. If the reader is not interested in the most up to date content, it can limit its search to the Read Storage.
## Objects lookup with the Object HASH
An index mapping the Object HASH of all known Objects to the Shard UUID is used to:
* allow clients to fetch Objects using their Object HASH only instead of their Object ID.
* deduplicate identical Objects based on their Object HASH
### Architecture
* Write Storage:
* **Reading:** * The index of Object IDs from the Write Storage and the Read Storage are stored in a LSM
* Read Storage:
* If the SHA256 exists in readonly index, return* The index of Object IDs from the object from the readonly shardRead Storage are stored in a LSM
* Otherwise return the object from the database* Copies of the index are stored in Ceph RDB images, formatted as SST
### ConsistencyWriting
If the Object HASH exists in Read Storage index, do nothing. Otherwise perform the write and add the Object ID to the Write Storage index.
### Packing
During packing, each Object HASH is looked up in the Read Storage index. If it exists, the object is discarded. Otherwise its Object ID is inserted in the index. When packing is complete:
* copies of the Read Storage index are updated with the newly added Object ID.
* Object HASH that were found to be duplicate are updated in the Write Storage using the Shard UUID found in the Read Storage index
The content of both the database and the readonly pool is strongly consistent at any point in time. As soon as an object is written (i.e. the write operation returns to the process), a reader can get the object from either the readonly pool or the database.### Reading
The readonly pool is eventually consistent. It does not contain the latest objects inserted in the object storage but it will, eventually.If the Object HASH is found in the Read Storage index, It contains all objects inserted inuse the Shard UUID to read the content from the object sRead Storage,. Otherwise lookup to a given datehe Shard UUID from the Write Storage index and perform a read with the Object ID.
# Use cases
### API implementation
* [swh-objstorage](https://docs.softwareheritage.org/devel/apidoc/swh.objstorage.objstorage.html)
* [swh-web API](https://docs.softwareheritage.org/devel/swh-web/uri-scheme-api.html)
### Adding objects
* A loader:
* gets a source code file from the net
* calculate the SHA256 from the content
* looks up the SHA256 index and stops if it already exists
* publishes the newly added object SWHID to the kafka bus
### Reading objects
* Serving objects from the web interface:
* GET the object using the SHA256
* Reconstruct an archived tarball or repo for the "vault":
* Identical to serving objects from the web interface.
* Push objects mirroring:
* Takes the UUID of one of the remaining shards at random from the readonly pool
* For each object in the shard, sends the object to the mirror via S3
* When done adds the UUID of the shard to the list of shards already mirrored.
* Pull file mirroring:
* The files are available at mirror.softwareheritage.com as mapped RBD devices /dev/rbd/readonly/UUID
* Pick a UUID not already mirrored locally
* Create a RBD image by the same UUID
* GET the shard and write it to /dev/rbd/readonly/UUID
* Full enumeration for mining purposes:
* Examples:
* find all files whose names are LICENSE/COPYRIGHT/COPYING/etc. pointed by any commit in the archive and retrieve them
* find all archived files referenced by any commit in the top-1k (by number of stars) GitHub repos
* Start enumeration from UUID:SHA256 up to N objects
* N objects are returned starting from UUID:SHA256, in order
* The client is responsible for keeping track of the UUID:SHA256 from which to resume the enumeration
# Explorations
* Scale out data and metadata
* T3064 [[ https://github.com/linkedin/ambry | ambry ]]
* T3052 RADOS [[ https://forge.softwareheritage.org/T3052#58917 | space benchmark ]] (requires development to reduce the space overhead and maintain performances)
* ??? [[ https://docs.ceph.com/en/latest/radosgw/ | RGW ]]
* Object packing
* T3066 [RocksDB SST](https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats)
* [ambry partition format](https://forge.softwareheritage.org/T3064) (append only)
* T3068 [Sorted String Table](https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf) (read only)
* T3050 libcephsqlite or SQlite on top of RBD (read write)
* T3046 Using xz-file-format for 1TB archive
* T3045 Using pixz for 1TB archives
* T3048 Using a custom format for 1TB archive
* T3069 Using MZ as a file format
* Scale out data and scale up metadata. The metadata is in a database (Rocksdb, etc.) that must be looked up to figure out where the data is to be found, as described in the [[ https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf| Finding a needle in Haystack: Facebook’s photo storage ]].
* T3049 Distributed database + RBD [[ https://forge.softwareheritage.org/T3014#57836 | space benchmark ]] (requires development on top of these building blocks)
* Storage systems with blockers
* T3051 EOS is too complex (uses RBD + Paxos + QuarkDB for namespace)
* T3057 [[ https://github.com/chrislusf/seaweedfs | Seaweedfs ]] is not yet mature (uses large files to pack objects + Paxos + internal database for metadata)
* https://github.com/open-io replication is a proprietary feature https://docs.openio.io/latest/source/admin-guide/configuration_replicator.html
* https://ipfs.io/ does not provide replication or self-healing. Performances and space overhead are probably the same as the current Software Heritage storage system.
* https://www.rozosystems.com/about claims a software patent on the implementation
* http://www.orangefs.org/ or http://beegfs.io/ have a focus on high-end computing
* https://www.lustre.org/ https://moosefs.com/ are distributed file systems, not object / block storage
* [[ https://min.io/ | min.io ]] stores each object in an individual file on a file system, a space overhead that is identical to the current Software Heritage storage system.
* [[ https://docs.openstack.org/swift/latest/ | Swift ]] stores [[ https://docs.openstack.org/swift/latest/overview_architecture.html#object-server | each object in an individual file on a file system]], a space overhead that is identical to the current Software Heritage storage system.
* Inspiration
* T3065 git partial clone (in part because it does packing, in part because it is source code related)
* Hardware
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]]
# Discussions
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html | Scale out object storage design (take 1) ]]
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]]
* [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JSG2TXKNXPXEKZOJZGYF2ZPTQHOB4LHJ/ | Storing 20 billions of immutable objects in Ceph, 75% <16KB ]]
* [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ | Small RGW objects and RADOS 64KB minimun size ]]
* [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/ | Using RBD to pack billions of small files ]]
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00055.html | Benchmarking RBD to store artifacts ]]
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-01/msg00026.html | Durable self healing distributed append only storage ]]
# Quantitative data
## Current
* I/O limits writes at 10MB/s
* reads are currently performing at ~300 objects per second, 25MB/s and performed at ~500 objects per second, 44MB/s in the past
* 50TB (30TB ZFS compressed) objects added every month
* Available space exhausted by the end of 2021
* 10 billions objects
* Objects occupy 750TB (350TB ZFS compressed) (see [[ https://forge.softwareheritage.org/T3054#58868 | statistics as of February 2021 ]] )
# Goals
* Write > 100MB/s, ~3,000 objects/s
* Read > 100MB/s, ~3,000 objects/s
* Durability overhead (erasure coding) 50% (2+1, 4+2)
* Storage overhead (storage system) < 20%
* Time to first bite (i.e. how long does it take for a client to get the first byte of an object after sending a read request to the server) < 100ms
* 100 billions objects