[Parent task for all related tasks]
# Current status
Experimenting with T3049 RBD is paused in favor of experimenting with T3052 RADOS because it turns out the [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ | space amplification problem on spinners is fixed ]] in the Ceph Pacific release due March 2021.
# Explorations
* Scale out data and metadata
* T3052 RADOS [[ https://forge.softwareheritage.org/T3052#58917 | space benchmark ]] (requires development to reduce the space overhead and maintain performances)
* ??? [[ https://docs.ceph.com/en/latest/radosgw/ | RGW ]]
* Scale out data and scale up metadata. The metadata is in a database (Rocksdb, etc.) that must be looked up to figure out where the data is to be found, as described in the [[ https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf| Finding a needle in Haystack: Facebook’s photo storage ]].
* T3049 Distributed database + RBD [[ https://forge.softwareheritage.org/T3014#57836 | space benchmark ]] (requires development on top of these building blocks)
* Storage systems with blockers
* T3051 EOS is too complex (uses RBD + Paxos + Rocksdb for metadata)
* T3050 libcephsqlite has a hard limit at ~300TB
* T3057 [[ https://github.com/chrislusf/seaweedfs | Seaweedfs ]] is not yet mature (uses large files to pack objects + Paxos + internal database for metadata)
* https://github.com/open-io replication is a proprietary feature https://docs.openio.io/latest/source/admin-guide/configuration_replicator.html
* https://ipfs.io/ does not provide replication or self-healing. Performances and space overhead are probably the same as the current Software Heritage storage system.
* https://www.rozosystems.com/about claims a software patent on the implementation
* http://www.orangefs.org/ or http://beegfs.io/ have a focus on high-end computing
* https://www.lustre.org/ https://moosefs.com/ are distributed file systems, not object / block storage
* [[ https://min.io/ | min.io ]] stores each object in an individual file on a file system, a space overhead that is identical to the current Software Heritage storage system.
* [[ https://docs.openstack.org/swift/latest/ | Swift ]] [[ stores each object in an individual file on a file system | https://docs.openstack.org/swift/latest/overview_architecture.html#object-server ]], a space overhead that is identical to the current Software Heritage storage system.
# Discussions
* [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ | Small RGW objects and RADOS 64KB minimun size ]]
* [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/ | Using RBD to pack billions of small files ]]
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00055.html | Benchmarking RBD to store artifacts ]]
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-01/msg00026.html | Durable self healing distributed append only storage ]]
# Quantitative data
## Current
* I/O limits writes at 10MB/s
* reads are currently performing at ~300 objects per second, 25MB/s and performed at ~500 objects per second, 44MB/s in the past
* 50TB (30TB ZFS compressed) objects added every month
* Available space exhausted by the end of 2021
* 10 billions objects
* Objects occupy 750TB (350TB ZFS compressed) (see [[ https://forge.softwareheritage.org/T3054#58868 | statistics as of February 2021 ]] )
# Goals
* Write > 100MB/s
* Read > 100MB/s
* Durability overhead (erasure coding) 50% (2+1, 4+2)
* Storage overhead (storage system) < 20%
* Time to first bite (i.e. how long does it take for a client to get the first byte of an object after sending a read request to the server) < 100ms