Change Details

[Parent task for all related tasks] # Current status An object storage design was [discussed](https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html) and [described](https://hedgedoc.softwareheritage.org/EBmGBSMpS1esahFRASggFg?view). Benchmarks need to be written to verify it is efficient (space and speed) for the intended use cases. The [hardware to run the benchmarks has to be secured](https://sympa.inria.fr/sympa/arc/swh-devel/2021-03/msg00007.html). # Description The draft of the design for the object storage can be found [here](https://hedgedoc.softwareheritage.org/EBmGBSMpS1esahFRASggFg#) # Explorations * Scale out data and metadata * T3064 [[ https://github.com/linkedin/ambry | ambry ]] * T3052 RADOS [[ https://forge.softwareheritage.org/T3052#58917 | space benchmark ]] (requires development to reduce the space overhead and maintain performances) * ??? [[ https://docs.ceph.com/en/latest/radosgw/ | RGW ]] * Object packing * T3066 [RocksDB SST](https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats) * [ambry partition format](https://forge.softwareheritage.org/T3064) (append only) * T3068 [Sorted String Table](https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf) (read only) * T3050 libcephsqlite or SQlite on top of RBD (read write) * T3046 Using xz-file-format for 1TB archive * T3045 Using pixz for 1TB archives * T3048 Using a custom format for 1TB archive * T3069 Using MZ as a file format * Scale out data and scale up metadata. The metadata is in a database (Rocksdb, etc.) that must be looked up to figure out where the data is to be found, as described in the [[ https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf| Finding a needle in Haystack: Facebook’s photo storage ]]. * T3049 Distributed database + RBD [[ https://forge.softwareheritage.org/T3014#57836 | space benchmark ]] (requires development on top of these building blocks) * Storage systems with blockers * T3051 EOS is too complex (uses RBD + Paxos + QuarkDB for namespace) * T3057 [[ https://github.com/chrislusf/seaweedfs | Seaweedfs ]] is not yet mature (uses large files to pack objects + Paxos + internal database for metadata) * https://github.com/open-io replication is a proprietary feature https://docs.openio.io/latest/source/admin-guide/configuration_replicator.html * https://ipfs.io/ does not provide replication or self-healing. Performances and space overhead are probably the same as the current Software Heritage storage system. * https://www.rozosystems.com/about claims a software patent on the implementation * http://www.orangefs.org/ or http://beegfs.io/ have a focus on high-end computing * https://www.lustre.org/ https://moosefs.com/ are distributed file systems, not object / block storage * [[ https://min.io/ | min.io ]] stores each object in an individual file on a file system, a space overhead that is identical to the current Software Heritage storage system. * [[ https://docs.openstack.org/swift/latest/ | Swift ]] stores [[ https://docs.openstack.org/swift/latest/overview_architecture.html#object-server | each object in an individual file on a file system]], a space overhead that is identical to the current Software Heritage storage system. * Inspiration * T3065 git partial clone (in part because it does packing, in part because it is source code related) * Hardware * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]] # Discussions * [Redis as a K/V store for billions of objects](https://sympa.inria.fr/sympa/arc/swh-devel/2021-03/msg00010.html) * [Looking for hardware to benchmark the object storage design](https://sympa.inria.fr/sympa/arc/swh-devel/2021-03/msg00007.html) * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html | Scale out object storage design (take 1) ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JSG2TXKNXPXEKZOJZGYF2ZPTQHOB4LHJ/ | Storing 20 billions of immutable objects in Ceph, 75% <16KB ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ | Small RGW objects and RADOS 64KB minimun size ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/ | Using RBD to pack billions of small files ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00055.html | Benchmarking RBD to store artifacts ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-01/msg00026.html | Durable self healing distributed append only storage ]] # Quantitative data ## Current * I/O limits writes at 10MB/s * reads are currently performing at ~300 objects per second, 25MB/s and performed at ~500 objects per second, 44MB/s in the past * 50TB (30TB ZFS compressed) objects added every month * Available space exhausted by the end of 2021 * 10 billions objects * Objects occupy 750TB (350TB ZFS compressed) (see [[ https://forge.softwareheritage.org/T3054#58868 | statistics as of February 2021 ]] ) # Goals * Write > 100MB/s, ~3,000 objects/s * Read > 100MB/s, ~3,000 objects/s * Durability overhead (erasure coding) 50% (2+1, 4+2) * Storage overhead (storage system) < 20% * Time to first bite (i.e. how long does it take for a client to get the first byte of an object after sending a read request to the server) < 100ms * 100 billions objects

[Parent task for all related tasks] # Current status An object storage design was [discussed](https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html) and [described](https://hedgedoc.softwareheritage.org/EBmGBSMpS1esahFRASggFg?view). Benchmarks need to be written to verify it is efficient (space and speed) for the intended use cases. The [hardware to run the benchmarks has to be secured](https://sympa.inria.fr/sympa/arc/swh-devel/2021-03/msg00007.html). # Descriptionign The draft of the design for the object storage can be found [here](https://hedgedoc.softwareheritage.org/EBmGBSMpS1esahFRASggFg#) # Explorations * Scale out data and metadata * T3064 [[ https://github.com/linkedin/ambry | ambry ]] * T3052 RADOS [[ https://forge.softwareheritage.org/T3052#58917 | space benchmark ]] (requires development to reduce the space overhead and maintain performances) * ??? [[ https://docs.ceph.com/en/latest/radosgw/ | RGW ]] * Object packing * T3066 [RocksDB SST](https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats) * [ambry partition format](https://forge.softwareheritage.org/T3064) (append only) * T3068 [Sorted String Table](https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf) (read only) * T3050 libcephsqlite or SQlite on top of RBD (read write) * T3046 Using xz-file-format for 1TB archive * T3045 Using pixz for 1TB archives * T3048 Using a custom format for 1TB archive * T3069 Using MZ as a file format * Scale out data and scale up metadata. The metadata is in a database (Rocksdb, etc.) that must be looked up to figure out where the data is to be found, as described in the [[ https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf| Finding a needle in Haystack: Facebook’s photo storage ]]. * T3049 Distributed database + RBD [[ https://forge.softwareheritage.org/T3014#57836 | space benchmark ]] (requires development on top of these building blocks) * Storage systems with blockers * T3051 EOS is too complex (uses RBD + Paxos + QuarkDB for namespace) * T3057 [[ https://github.com/chrislusf/seaweedfs | Seaweedfs ]] is not yet mature (uses large files to pack objects + Paxos + internal database for metadata) * https://github.com/open-io replication is a proprietary feature https://docs.openio.io/latest/source/admin-guide/configuration_replicator.html * https://ipfs.io/ does not provide replication or self-healing. Performances and space overhead are probably the same as the current Software Heritage storage system. * https://www.rozosystems.com/about claims a software patent on the implementation * http://www.orangefs.org/ or http://beegfs.io/ have a focus on high-end computing * https://www.lustre.org/ https://moosefs.com/ are distributed file systems, not object / block storage * [[ https://min.io/ | min.io ]] stores each object in an individual file on a file system, a space overhead that is identical to the current Software Heritage storage system. * [[ https://docs.openstack.org/swift/latest/ | Swift ]] stores [[ https://docs.openstack.org/swift/latest/overview_architecture.html#object-server | each object in an individual file on a file system]], a space overhead that is identical to the current Software Heritage storage system. * Inspiration * T3065 git partial clone (in part because it does packing, in part because it is source code related) * Hardware * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]] # Discussions * [Redis as a K/V store for billions of objects](https://sympa.inria.fr/sympa/arc/swh-devel/2021-03/msg00010.html) * [Looking for hardware to benchmark the object storage design](https://sympa.inria.fr/sympa/arc/swh-devel/2021-03/msg00007.html) * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html | Scale out object storage design (take 1) ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JSG2TXKNXPXEKZOJZGYF2ZPTQHOB4LHJ/ | Storing 20 billions of immutable objects in Ceph, 75% <16KB ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ | Small RGW objects and RADOS 64KB minimun size ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/ | Using RBD to pack billions of small files ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00055.html | Benchmarking RBD to store artifacts ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-01/msg00026.html | Durable self healing distributed append only storage ]] # Quantitative data ## Current * I/O limits writes at 10MB/s * reads are currently performing at ~300 objects per second, 25MB/s and performed at ~500 objects per second, 44MB/s in the past * 50TB (30TB ZFS compressed) objects added every month * Available space exhausted by the end of 2021 * 10 billions objects * Objects occupy 750TB (350TB ZFS compressed) (see [[ https://forge.softwareheritage.org/T3054#58868 | statistics as of February 2021 ]] ) # Goals * Write > 100MB/s, ~3,000 objects/s * Read > 100MB/s, ~3,000 objects/s * Durability overhead (erasure coding) 50% (2+1, 4+2) * Storage overhead (storage system) < 20% * Time to first bite (i.e. how long does it take for a client to get the first byte of an object after sending a read request to the server) < 100ms * 100 billions objects(TBD)