[Parent task for all related tasks]
# Current status
A "full scale out" (e.g. SWIFT or Ceph) requires optimizing the internals of the object storage to be friendly to the "small immutable objects" workload. And "scale up metadata & scale out data" (e.g. EOS or seaweed) requires writing glue to nicely bundle the database and the object storage. Which direction is more advisable? (see the list of [[ https://forge.softwareheritage.org/T3054#58977 | pros & cons ]]).
# Explorations
* Scale out data and metadata
* T3052 RADOS [[ https://forge.softwareheritage.org/T3052#58917 | space benchmark ]] (requires development to reduce the space overhead and maintain performances)
* ??? [[ https://docs.ceph.com/en/latest/radosgw/ | RGW ]]
* Scale out data and scale up metadata. The metadata is in a database (Rocksdb, etc.) that must be looked up to figure out where the data is to be found, as described in the [[ https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf| Finding a needle in Haystack: Facebook’s photo storage ]].
* T3049 Distributed database + RBD [[ https://forge.softwareheritage.org/T3014#57836 | space benchmark ]] (requires development on top of these building blocks)
* Storage systems with blockers
* T3051 EOS is too complex (uses RBD + Paxos + Rocksdb for metadata)
* T3050 libcephsqlite has a hard limit at ~300TB
* T3057 [[ https://github.com/chrislusf/seaweedfs | Seaweedfs ]] is not yet mature (uses large files to pack objects + Paxos + internal database for metadata)
* https://github.com/open-io replication is a proprietary feature https://docs.openio.io/latest/source/admin-guide/configuration_replicator.html
* https://ipfs.io/ does not provide replication or self-healing. Performances and space overhead are probably the same as the current Software Heritage storage system.
* https://www.rozosystems.com/about claims a software patent on the implementation
* http://www.orangefs.org/ or http://beegfs.io/ have a focus on high-end computing
* https://www.lustre.org/ https://moosefs.com/ are distributed file systems, not object / block storage
* [[ https://min.io/ | min.io ]] stores each object in an individual file on a file system, a space overhead that is identical to the current Software Heritage storage system.
* [[ https://docs.openstack.org/swift/latest/ | Swift ]] stores [[ https://docs.openstack.org/swift/latest/overview_architecture.html#object-server | each object in an individual file on a file system]], a space overhead that is identical to the current Software Heritage storage system.
# Discussions
* [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JSG2TXKNXPXEKZOJZGYF2ZPTQHOB4LHJ/ | Storing 20 billions of immutable objects in Ceph, 75% <16KB ]]
* [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ | Small RGW objects and RADOS 64KB minimun size ]]
* [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/ | Using RBD to pack billions of small files ]]
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00055.html | Benchmarking RBD to store artifacts ]]
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-01/msg00026.html | Durable self healing distributed append only storage ]]
# Quantitative data
## Current
* I/O limits writes at 10MB/s
* reads are currently performing at ~300 objects per second, 25MB/s and performed at ~500 objects per second, 44MB/s in the past
* 50TB (30TB ZFS compressed) objects added every month
* Available space exhausted by the end of 2021
* 10 billions objects
* Objects occupy 750TB (350TB ZFS compressed) (see [[ https://forge.softwareheritage.org/T3054#58868 | statistics as of February 2021 ]] )
# Goals
* Write > 100MB/s
* Read > 100MB/s
* Durability overhead (erasure coding) 50% (2+1, 4+2)
* Storage overhead (storage system) < 20%
* Time to first bite (i.e. how long does it take for a client to get the first byte of an object after sending a read request to the server) < 100ms