[Parent task for all related tasks]
# Current status
An object storage design was [discussed](https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html) and [described](https://hedgedoc.softwareheritage.org/EBmGBSMpS1esahFRASggFg?view). Benchmarks need to be written to verify it is efficient (space and speed) for the intended use cases. The [hardware to run the benchmarks has to be secured](https://sympa.inria.fr/sympa/arc/swh-devel/2021-03/msg00007.html).
# Description
The draft of the design for the object storage can be found [here](https://hedgedoc.softwareheritage.org/EBmGBSMpS1esahFRASggFg#)
# Explorations
* Scale out data and metadata
* T3064 [[ https://github.com/linkedin/ambry | ambry ]]
* T3052 RADOS [[ https://forge.softwareheritage.org/T3052#58917 | space benchmark ]] (requires development to reduce the space overhead and maintain performances)
* ??? [[ https://docs.ceph.com/en/latest/radosgw/ | RGW ]]
* Object packing
* T3066 [RocksDB SST](https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats)
* [ambry partition format](https://forge.softwareheritage.org/T3064) (append only)
* T3068 [Sorted String Table](https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf) (read only)
* T3050 libcephsqlite or SQlite on top of RBD (read write)
* T3046 Using xz-file-format for 1TB archive
* T3045 Using pixz for 1TB archives
* T3048 Using a custom format for 1TB archive
* T3069 Using MZ as a file format
* Scale out data and scale up metadata. The metadata is in a database (Rocksdb, etc.) that must be looked up to figure out where the data is to be found, as described in the [[ https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf| Finding a needle in Haystack: Facebook’s photo storage ]].
* T3049 Distributed database + RBD [[ https://forge.softwareheritage.org/T3014#57836 | space benchmark ]] (requires development on top of these building blocks)
* Storage systems with blockers
* T3051 EOS is too complex (uses RBD + Paxos + QuarkDB for namespace)
* T3057 [[ https://github.com/chrislusf/seaweedfs | Seaweedfs ]] is not yet mature (uses large files to pack objects + Paxos + internal database for metadata)
* https://github.com/open-io replication is a proprietary feature https://docs.openio.io/latest/source/admin-guide/configuration_replicator.html
* https://ipfs.io/ does not provide replication or self-healing. Performances and space overhead are probably the same as the current Software Heritage storage system.
* https://www.rozosystems.com/about claims a software patent on the implementation
* http://www.orangefs.org/ or http://beegfs.io/ have a focus on high-end computing
* https://www.lustre.org/ https://moosefs.com/ are distributed file systems, not object / block storage
* [[ https://min.io/ | min.io ]] stores each object in an individual file on a file system, a space overhead that is identical to the current Software Heritage storage system.
* [[ https://docs.openstack.org/swift/latest/ | Swift ]] stores [[ https://docs.openstack.org/swift/latest/overview_architecture.html#object-server | each object in an individual file on a file system]], a space overhead that is identical to the current Software Heritage storage system.
* Inspiration
* T3065 git partial clone (in part because it does packing, in part because it is source code related)
* Hardware
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]]
# Discussions
* [Redis as a K/V store for billions of objects](https://sympa.inria.fr/sympa/arc/swh-devel/2021-03/msg00010.html)
* [Looking for hardware to benchmark the object storage design](https://sympa.inria.fr/sympa/arc/swh-devel/2021-03/msg00007.html)
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html | Scale out object storage design (take 1) ]]
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]]
* [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JSG2TXKNXPXEKZOJZGYF2ZPTQHOB4LHJ/ | Storing 20 billions of immutable objects in Ceph, 75% <16KB ]]
* [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ | Small RGW objects and RADOS 64KB minimun size ]]
* [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/ | Using RBD to pack billions of small files ]]
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00055.html | Benchmarking RBD to store artifacts ]]
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-01/msg00026.html | Durable self healing distributed append only storage ]]
# Quantitative data
## Current
* I/O limits writes at 10MB/s
* reads are currently performing at ~300 objects per second, 25MB/s and performed at ~500 objects per second, 44MB/s in the past
* 50TB (30TB ZFS compressed) objects added every month
* Available space exhausted by the end of 2021
* 10 billions objects
* Objects occupy 750TB (350TB ZFS compressed) (see [[ https://forge.softwareheritage.org/T3054#58868 | statistics as of February 2021 ]] )
# Goals
* Write > 100MB/s, ~3,000 objects/s
* Read > 100MB/s, ~3,000 objects/s
* Durability overhead (erasure coding) 50% (2+1, 4+2)
* Storage overhead (storage system) < 20%
* Time to first bite (i.e. how long does it take for a client to get the first byte of an object after sending a read request to the server) < 100ms
* 100 billions objects