[Parent task for all related tasks]
# Current status
An [scale out object storage](https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html) design was proposed. It has to be described in detail and benchmarks need to be written to verify it is efficient (space and speed) for the intended use cases. The hardware to run the benchmarks has to be specified and secured.
# Design
It scales out for readers and scales up for writers.
# Object storage
## Glossary
* Object: an opaque sequence of bytes.
* Object HASH (HASH for short): the hash of an Object, i.e. the intrinsic identifier in a [SWHID](https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#core-identifiers).
* Shard: a group of Objects.
* Shard [UUID](https://en.wikipedia.org/wiki/Universally_unique_identifier) (UUID for short): the unique identifier of a Shard.
* Object ID (ID for short): the Object HASH and the Shard UUID containing the object.
* Read Storage: the unlimited size storage from which clients can only read Objects. It only contains Objects up to a given point in time.
* Write Storage: the fixed size storage from which clients can read or write. If an Object is not found in the Write storage, it must be retrieved from the Read Storage.
* Object Storage: the content of the Write Storage and the Read Storage combined.
* [LSM](https://en.wikipedia.org/wiki/Log-structured_merge-tree): an embedded database for key-value data such as [RocksDB](https://en.wikipedia.org/wiki/RocksDB).
* [SST](https://forge.softwareheritage.org/T3068): a readonly Sorted String Table that consists of a sequence of key/values and an index.
* Database: [PostgreSQL](https://en.wikipedia.org/wiki/PostgreSQL), [Cassandra](https://en.wikipedia.org/wiki/Apache_Cassandra), etc.
* [Ceph](https://en.wikipedia.org/wiki/Ceph_(software)): a self-healing distributed storage.
* [RBD](https://docs.ceph.com/en/latest/rbd/) image: a Ceph block storage that can either be used via the librbd library or as a block device from /dev/rbd.
The key concepts are:
* Packing millions of Objects together in Shards to:
* save space and,
* efficiently perform bulk actions such as mirroring or enumerations.
* Two different architectures:
* Read Storage that takes advantage of the fact that Objects are immutable and never deleted and,
* Write Storage from which Shards are moved to the Read Storage when they are full.
* Identifying an object by its Object HASH and the Shard UUID that contains it so that its location can be determined from the Object ID.
While the architecture based on these concepts scales out for writing and reading, it cannot be used to address Objects with their Object HASH alone which is inconvenient for a number of use cases. An index mapping the Object HASH to the Shard UUID must be added to provide this feature, but it does not scale out writes.
The content of the Object Storage (i.e the Write Storage and the Read Storage combined) is **strongly consistent** at any point in time. As soon as an Object is written (i.e. the write operation returns to the process), a reader can get the Object content from the Object Storage (it may require looking up the object from both the Write Storage and Read Storage).
The Read Storage is **eventually consistent**. It does not contain the latest Objects inserted in the Write Storage but it will, eventually. It contains all objects inserted in the Object Storage, up to a given date.
## Objects lookup require the Object ID
### Architecture
* Write Storage:
* A fixed number of Databases
* Read Storage:
* Shards implemented as Ceph RBD images named after their Shard UUID
* The content of the Shard is formatted as a SST
### Writing
The Object is stored in one of the Databases from the Write Storage. The Database is chosen at random. A database has a current Shard UUID, chosen at random. The current Shard UUID will be associated to all Objects written to the Database.
A successful Object write returns the Object ID. Writing the same object twice may return different Object ID. The Object HASH will be the same because it is based on the content of the Object. But the Shard in which the Object is stored may be different and the Shard UUID will therefore be different.
### Packing
When a Database grows bigger than a threshold (for instance 100GB), the current Shard UUID is no longer used, another one is allocated at random. A Shard is created in the Read Storage and Objects that belong to the former Shard UUID are sorted copied to it. When the Shard is complete, the Objects are removed from the database.
### Reading
The Shard UUID is extracted from the Object ID. If it exists in the Read Storage, the Object HASH is used to lookup its content. Otherwise the Database that owns the Shard UUID is looked up from the Write Storage and the Object HASH is used to lookup its content. If the reader is not interested in the most up to date content, it can limit its search to the Read Storage.
## Objects lookup with the Object HASH
An index mapping the Object HASH of all known Objects to the Shard UUID is used to:
* allow clients to fetch Objects using their Object HASH only instead of their Object ID.
* deduplicate identical Objects based on their Object HASH
### Architecture
* Write Storage:
* The index of Object IDs from the Write Storage and the Read Storage are stored in a LSM
* Read Storage:
* The index of Object IDs from the Read Storage are stored in a LSM
* Copies of the index are stored in Ceph RDB images, formatted as SST
### Writing
If the Object HASH exists in Read Storage index, do nothing. Otherwise perform the write and add the Object ID to the Write Storage index.
### Packing
During packing, each Object HASH is looked up in the Read Storage index. If it exists, the object is discarded. Otherwise its Object ID is inserted in the index. When packing is complete:
* copies of the Read Storage index are updated with the newly added Object ID.
* Object HASH that were found to be duplicate are updated in the Write Storage using the Shard UUID found in the Read Storage index
### Reading
If the Object HASH is found in the Read Storage index, use the Shard UUID to read the content from the Read Storage. Otherwise lookup the Shard UUID from the Write Storage index and perform a read with the Object ID.
# FAQ
## How durable is the Object Storage?
The durability of the Write Storage depends on the chosen Database. It could be designed with a replicated [PostgresQL](https://www.postgresql.org/docs/13/different-replication-solutions.html) or [Cassandra](https://cassandra.apache.org/doc/latest/architecture/dynamo.html#replication-strategy) if the failure domain is the host running the database. If the only concern is a disk failure, using a RAID5 or RAID6 is enough.
The durability of the Read Storage is implemented via Ceph, for instance by configuring RBD image data pool as a k=4,m=2 erasure coded pool that can sustain the loss of two hosts.
## How does packing Objects save space?
The short answer is: it does not when Objects are big enough, but it does when there are a lot of small Objects.
If there are billions of objects (i.e. less than one billion is not a lot) and 50% of them have a size smaller than 4KB and 75% of them have a size smaller than 16KB (i.e. bigger than 16KB is not small), then packing will save space.
In the simplest method of packing (i.e. appending each Object after another in a file) and since the Object HASH has a fixed size, the only overhead for each object is the size of the Object (8 bytes). Assuming the Shard containing the Objects is handled as a single 100GB Ceph RBD Image, it adds R bytes. If the underlying Ceph pool is erasure coded k=4,m=2 an additional 50% must be added.
Retrieving an Object from a Shard would be O(n) in this case because there is no index. It is better to add an index in the Shard so that finding an object is O(log(n)) instead. That optimization requires an additional 8 bytes per Object to store their offset, i.e. a total of 16 bytes per object.
If Objects are not packed together, each of them requires at least B bytes, which is the minimum space overhead imposed by the underlying storage system. And an additional 50% for durability. The space used by Objects that are smaller than a given threshold will be amplified, depending on the underlying storage. For instance all objects in Ceph have a minimum size of 4KB, therefore the size of a 1KB Object will be amplified to 4KB. Since packing is considered in the context where there are many small objects, let's assume a space applification overhead of 25% (which is less than the 35% in the [case of Ceph](https://forge.softwareheritage.org/T3052#58864) but probably more than what can be observed with other object storage such as Swift).
To summarize, the overhead of storing M Objects totaling S bytes when M is 100 billions and S 10PB is:
* **packed:** ~15PB
* (S / 100GB) * R == (10PB / 100GB) * R bytes = 10,000 * R bytes
* (M * 16) = 100G Objects * 16 bytes = 1.6TB
* 50% for durability = 10PB * 0.5 = 5PB
* **not packed:** ~17.5PB
* (B * 100G) = ??? what is the value of B for Ceph ? for ZFS ? for Swift ? ???
* 25% for space amplification = 10PB * 0.25 = 2.5PB
* 50% for durability = 10PB * 0.5 = 5PB
## How does packing Objects help with enumeration?
For mirroring or running an algorithm on all objects, they must be enumerated. When they are packed together, the reader can get all Objects contained in a Shard without looking them up individually. If looking up an individual Object takes 10 milliseconds and the Shard contains 10 millions Objects, that's 100,000 seconds, i.e. ~24h. Assuming the Object Storage can be read from at 100 MB/s and those 10 millions Objects amount to 100GB, they can be transfered in ~15 minutes when packed instead of ~24h.
# Use cases
## Glossary
* APIs used with an Object HASH instead of an Object ID are called HASH APIs
* APIs used with an Object ID are called ID APIs
## API implementation
The existing API (HASH APIs):
* [swh-web REST API](https://docs.softwareheritage.org/devel/swh-web/uri-scheme-api.html)
* [swh-objstorage Python API](https://docs.softwareheritage.org/devel/apidoc/swh.objstorage.objstorage.html)
New APIs (ID APIs) can be added or modified to get objects using their Object ID instead of their Object HASH so that there is no need to lookup the Shard UUID from an index using the Object HASH. The Shard UUID is part of the Object ID and can be extracted from it.
* [swh-web REST API](https://docs.softwareheritage.org/devel/swh-web/uri-scheme-api.html) could support a new **hash_type** that would be the **Object ID** instead of the **Object HASH**. For instance **GET /api/1/content/[(hash_type):](hash)/** could support the **objectid** hash_type to look like this **GET /api/1/content/objectid:Object HASH:Shard UUID/**.
* [swh-objstorage Python API](https://docs.softwareheritage.org/devel/apidoc/swh.objstorage.objstorage.html) could be derived into a module where all **obj_id** arguments are tuples of (Object HASH, Shard UUID) but providing the same interface.
## Adding objects
* A loader:
* Gets a source code file from the net
* Calculate the Object HASH from the content
* Uses the HASH API to add the Object
## Reading objects
* Serving objects from the REST interface:
* Existing clients GET object using the Object HASH using the HASH API
* New clients may GET object using the Object ID using the ID API
* Reconstruct an archived tarball or repo for the "vault":
* Identical to serving objects from the web interface.
* Push Objects mirroring:
* Existing mirroring software use the HASH API
* Assuming API endpoints are added to:
* Retrieve the list of Shard UUID
* Retrieve all Objects from a Shard (Object ID and content)
* New clients can:
* Retrieve the list of all Shard UUID from the Read Storage
* Retrieve all Objects from a Shard, mirror the Object ID and its content via S3
* When done adds the UUID of the Shard to the list of Shards already mirrored.
* Pull Shard mirroring:
* The Shards are available at mirror.softwareheritage.com as mapped RBD devices "/dev/rbd/read/Shard UUID"
* Pick a Shard not already mirrored locally
* Create a RBD image by the same Shard UUID
* GET the Shard from mirror.softwareheritage.com/ and copy it to /dev/rbd/read/Shard UUID
* Full enumeration for mining purposes:
* Examples:
* find all files whose names are LICENSE/COPYRIGHT/COPYING/etc. pointed by any commit in the archive and retrieve them
* find all archived files referenced by any commit in the top-1k (by number of stars) GitHub repos
* Start enumeration from the Object ID of the last object from the previous ieration, to N objects
* N objects are returned starting Object ID, in order
* The client is responsible for keeping track of the Object ID from which to resume the enumeration
# Choosing the underlying storage for the Read Storage
The Free Software self healing distributed storage running on commodity hardware on top of which the Read Storage is built must:
* Provide object packing
* Provide detailed documentation and community support for system administrators operating the storage
* Be thoroughly tested before a stable release is published
* Be packaged for at least one well known distribution
* Have stable releases maintained for at least two years
* A sound approach to address vulnerabilities
If one aspect is missing, it must be implemented and increases the [Total Cost of Ownership](https://en.wikipedia.org/wiki/Total_cost_of_ownership).
| Name | Ceph | EOS | Seaweedfs | MinIO | Swift | Ambry |
|-----------------|------|---------|-----------|---------|-------|-------|
| Packing | Yes | Yes | Yes | No | No | Yes |
| Documentation | Good | Average | Terse | Good | Good | Terse |
| Tests | Good | Few | Few | Average | Good | Few |
| Packages | Yes | No | No | No | Yes | No |
| Stable releases | Yes | No | No | Yes | Yes | No |
| CVE | Yes | No | No | Yes | Yes | No |
# Explorations
* Scale out data and metadata
* T3064 [[ https://github.com/linkedin/ambry | ambry ]]
* T3052 RADOS [[ https://forge.softwareheritage.org/T3052#58917 | space benchmark ]] (requires development to reduce the space overhead and maintain performances)
* ??? [[ https://docs.ceph.com/en/latest/radosgw/ | RGW ]]
* Object packing
* T3066 [RocksDB SST](https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats)
* [ambry partition format](https://forge.softwareheritage.org/T3064) (append only)
* T3068 [Sorted String Table](https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf) (read only)
* T3050 libcephsqlite or SQlite on top of RBD (read write)
* T3046 Using xz-file-format for 1TB archive
* T3045 Using pixz for 1TB archives
* T3048 Using a custom format for 1TB archive
* T3069 Using MZ as a file format
* Scale out data and scale up metadata. The metadata is in a database (Rocksdb, etc.) that must be looked up to figure out where the data is to be found, as described in the [[ https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf| Finding a needle in Haystack: Facebook’s photo storage ]].
* T3049 Distributed database + RBD [[ https://forge.softwareheritage.org/T3014#57836 | space benchmark ]] (requires development on top of these building blocks)
* Storage systems with blockers
* T3051 EOS is too complex (uses RBD + Paxos + QuarkDB for namespace)
* T3057 [[ https://github.com/chrislusf/seaweedfs | Seaweedfs ]] is not yet mature (uses large files to pack objects + Paxos + internal database for metadata)
* https://github.com/open-io replication is a proprietary feature https://docs.openio.io/latest/source/admin-guide/configuration_replicator.html
* https://ipfs.io/ does not provide replication or self-healing. Performances and space overhead are probably the same as the current Software Heritage storage system.
* https://www.rozosystems.com/about claims a software patent on the implementation
* http://www.orangefs.org/ or http://beegfs.io/ have a focus on high-end computing
* https://www.lustre.org/ https://moosefs.com/ are distributed file systems, not object / block storage
* [[ https://min.io/ | min.io ]] stores each object in an individual file on a file system, a space overhead that is identical to the current Software Heritage storage system.
* [[ https://docs.openstack.org/swift/latest/ | Swift ]] stores [[ https://docs.openstack.org/swift/latest/overview_architecture.html#object-server | each object in an individual file on a file system]], a space overhead that is identical to the current Software Heritage storage system.
* Inspiration
* T3065 git partial clone (in part because it does packing, in part because it is source code related)
* Hardware
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]]
# Discussions
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html | Scale out object storage design (take 1) ]]
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]]
* [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JSG2TXKNXPXEKZOJZGYF2ZPTQHOB4LHJ/ | Storing 20 billions of immutable objects in Ceph, 75% <16KB ]]
* [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ | Small RGW objects and RADOS 64KB minimun size ]]
* [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/ | Using RBD to pack billions of small files ]]
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00055.html | Benchmarking RBD to store artifacts ]]
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-01/msg00026.html | Durable self healing distributed append only storage ]]
# Quantitative data
## Current
* I/O limits writes at 10MB/s
* reads are currently performing at ~300 objects per second, 25MB/s and performed at ~500 objects per second, 44MB/s in the past
* 50TB (30TB ZFS compressed) objects added every month
* Available space exhausted by the end of 2021
* 10 billions objects
* Objects occupy 750TB (350TB ZFS compressed) (see [[ https://forge.softwareheritage.org/T3054#58868 | statistics as of February 2021 ]] )
# Goals
* Write > 100MB/s, ~3,000 objects/s
* Read > 100MB/s, ~3,000 objects/s
* Durability overhead (erasure coding) 50% (2+1, 4+2)
* Storage overhead (storage system) < 20%
* Time to first bite (i.e. how long does it take for a client to get the first byte of an object after sending a read request to the server) < 100ms
* 100 billions objects