Change Details

[Parent task for all related tasks] # Current status An [scale out object storage](https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html) design was proposed. It has to be described in detail and benchmarks need to be written to verify it is efficient (space and speed) for the intended use cases. The hardware to run the benchmarks has to be specified and secured. # Description The draft of the design for the object storage can be found [here](https://hedgedoc.softwareheritage.org/EBmGBSMpS1esahFRASggFg#) # Explorations * Scale out data and metadata * T3064 [[ https://github.com/linkedin/ambry | ambry ]] * T3052 RADOS [[ https://forge.softwareheritage.org/T3052#58917 | space benchmark ]] (requires development to reduce the space overhead and maintain performances) * ??? [[ https://docs.ceph.com/en/latest/radosgw/ | RGW ]] * Object packing * T3066 [RocksDB SST](https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats) * [ambry partition format](https://forge.softwareheritage.org/T3064) (append only) * T3068 [Sorted String Table](https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf) (read only) * T3050 libcephsqlite or SQlite on top of RBD (read write) * T3046 Using xz-file-format for 1TB archive * T3045 Using pixz for 1TB archives * T3048 Using a custom format for 1TB archive * T3069 Using MZ as a file format * Scale out data and scale up metadata. The metadata is in a database (Rocksdb, etc.) that must be looked up to figure out where the data is to be found, as described in the [[ https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf| Finding a needle in Haystack: Facebook’s photo storage ]]. * T3049 Distributed database + RBD [[ https://forge.softwareheritage.org/T3014#57836 | space benchmark ]] (requires development on top of these building blocks) * Storage systems with blockers * T3051 EOS is too complex (uses RBD + Paxos + QuarkDB for namespace) * T3057 [[ https://github.com/chrislusf/seaweedfs | Seaweedfs ]] is not yet mature (uses large files to pack objects + Paxos + internal database for metadata) * https://github.com/open-io replication is a proprietary feature https://docs.openio.io/latest/source/admin-guide/configuration_replicator.html * https://ipfs.io/ does not provide replication or self-healing. Performances and space overhead are probably the same as the current Software Heritage storage system. * https://www.rozosystems.com/about claims a software patent on the implementation * http://www.orangefs.org/ or http://beegfs.io/ have a focus on high-end computing * https://www.lustre.org/ https://moosefs.com/ are distributed file systems, not object / block storage * [[ https://min.io/ | min.io ]] stores each object in an individual file on a file system, a space overhead that is identical to the current Software Heritage storage system. * [[ https://docs.openstack.org/swift/latest/ | Swift ]] stores [[ https://docs.openstack.org/swift/latest/overview_architecture.html#object-server | each object in an individual file on a file system]], a space overhead that is identical to the current Software Heritage storage system. * Inspiration * T3065 git partial clone (in part because it does packing, in part because it is source code related) * Hardware * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]] # Discussions * [Looking for hardware to benchmark the object storage design](https://sympa.inria.fr/sympa/arc/swh-devel/2021-03/msg00007.html) * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html | Scale out object storage design (take 1) ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JSG2TXKNXPXEKZOJZGYF2ZPTQHOB4LHJ/ | Storing 20 billions of immutable objects in Ceph, 75% <16KB ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ | Small RGW objects and RADOS 64KB minimun size ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/ | Using RBD to pack billions of small files ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00055.html | Benchmarking RBD to store artifacts ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-01/msg00026.html | Durable self healing distributed append only storage ]] # Quantitative data ## Current * I/O limits writes at 10MB/s * reads are currently performing at ~300 objects per second, 25MB/s and performed at ~500 objects per second, 44MB/s in the past * 50TB (30TB ZFS compressed) objects added every month * Available space exhausted by the end of 2021 * 10 billions objects * Objects occupy 750TB (350TB ZFS compressed) (see [[ https://forge.softwareheritage.org/T3054#58868 | statistics as of February 2021 ]] ) # Goals * Write > 100MB/s, ~3,000 objects/s * Read > 100MB/s, ~3,000 objects/s * Durability overhead (erasure coding) 50% (2+1, 4+2) * Storage overhead (storage system) < 20% * Time to first bite (i.e. how long does it take for a client to get the first byte of an object after sending a read request to the server) < 100ms * 100 billions objects

[Parent task for all related tasks] # Current status An object storage design was [discussed](https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html) and [described](https://hedgedoc.softwareheritage.org/EBmGBSMpS1esahFRASggFg?view). Benchmarks need to be written to verify it is efficient (space and speed) for the intended use cases. The [hardware to run the benchmarks has to be secured](https://sympa.inria.fr/sympa/arc/swh-devel/2021-03/msg00007.html). # Description The draft of the design for the object storage can be found [here](https://hedgedoc.softwareheritage.org/EBmGBSMpS1esahFRASggFg#) # Explorations * Scale out data and metadata * T3064 [[ https://github.com/linkedin/ambry | ambry ]] * T3052 RADOS [[ https://forge.softwareheritage.org/T3052#58917 | space benchmark ]] (requires development to reduce the space overhead and maintain performances) * ??? [[ https://docs.ceph.com/en/latest/radosgw/ | RGW ]] * Object packing * T3066 [RocksDB SST](https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats) * [ambry partition format](https://forge.softwareheritage.org/T3064) (append only) * T3068 [Sorted String Table](https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf) (read only) * T3050 libcephsqlite or SQlite on top of RBD (read write) * T3046 Using xz-file-format for 1TB archive * T3045 Using pixz for 1TB archives * T3048 Using a custom format for 1TB archive * T3069 Using MZ as a file format * Scale out data and scale up metadata. The metadata is in a database (Rocksdb, etc.) that must be looked up to figure out where the data is to be found, as described in the [[ https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf| Finding a needle in Haystack: Facebook’s photo storage ]]. * T3049 Distributed database + RBD [[ https://forge.softwareheritage.org/T3014#57836 | space benchmark ]] (requires development on top of these building blocks) * Storage systems with blockers * T3051 EOS is too complex (uses RBD + Paxos + QuarkDB for namespace) * T3057 [[ https://github.com/chrislusf/seaweedfs | Seaweedfs ]] is not yet mature (uses large files to pack objects + Paxos + internal database for metadata) * https://github.com/open-io replication is a proprietary feature https://docs.openio.io/latest/source/admin-guide/configuration_replicator.html * https://ipfs.io/ does not provide replication or self-healing. Performances and space overhead are probably the same as the current Software Heritage storage system. * https://www.rozosystems.com/about claims a software patent on the implementation * http://www.orangefs.org/ or http://beegfs.io/ have a focus on high-end computing * https://www.lustre.org/ https://moosefs.com/ are distributed file systems, not object / block storage * [[ https://min.io/ | min.io ]] stores each object in an individual file on a file system, a space overhead that is identical to the current Software Heritage storage system. * [[ https://docs.openstack.org/swift/latest/ | Swift ]] stores [[ https://docs.openstack.org/swift/latest/overview_architecture.html#object-server | each object in an individual file on a file system]], a space overhead that is identical to the current Software Heritage storage system. * Inspiration * T3065 git partial clone (in part because it does packing, in part because it is source code related) * Hardware * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]] # Discussions * [Looking for hardware to benchmark the object storage design](https://sympa.inria.fr/sympa/arc/swh-devel/2021-03/msg00007.html) * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html | Scale out object storage design (take 1) ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JSG2TXKNXPXEKZOJZGYF2ZPTQHOB4LHJ/ | Storing 20 billions of immutable objects in Ceph, 75% <16KB ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ | Small RGW objects and RADOS 64KB minimun size ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/ | Using RBD to pack billions of small files ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00055.html | Benchmarking RBD to store artifacts ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-01/msg00026.html | Durable self healing distributed append only storage ]] # Quantitative data ## Current * I/O limits writes at 10MB/s * reads are currently performing at ~300 objects per second, 25MB/s and performed at ~500 objects per second, 44MB/s in the past * 50TB (30TB ZFS compressed) objects added every month * Available space exhausted by the end of 2021 * 10 billions objects * Objects occupy 750TB (350TB ZFS compressed) (see [[ https://forge.softwareheritage.org/T3054#58868 | statistics as of February 2021 ]] ) # Goals * Write > 100MB/s, ~3,000 objects/s * Read > 100MB/s, ~3,000 objects/s * Durability overhead (erasure coding) 50% (2+1, 4+2) * Storage overhead (storage system) < 20% * Time to first bite (i.e. how long does it take for a client to get the first byte of an object after sending a read request to the server) < 100ms * 100 billions objects

[Parent task for all related tasks] # Current status An [scale out object storageAn object storage design was [discussed](https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html) design was proposedand [described](https://hedgedoc.softwareheritage.org/EBmGBSMpS1esahFRASggFg?view). It has to be described in detail and bBenchmarks need to be written to verify it is efficient (space and speed) for the intended use cases. The [hardware to run the benchmarks has to be specified and securedsecured](https://sympa.inria.fr/sympa/arc/swh-devel/2021-03/msg00007.html). # Description The draft of the design for the object storage can be found [here](https://hedgedoc.softwareheritage.org/EBmGBSMpS1esahFRASggFg#) # Explorations * Scale out data and metadata * T3064 [[ https://github.com/linkedin/ambry | ambry ]] * T3052 RADOS [[ https://forge.softwareheritage.org/T3052#58917 | space benchmark ]] (requires development to reduce the space overhead and maintain performances) * ??? [[ https://docs.ceph.com/en/latest/radosgw/ | RGW ]] * Object packing * T3066 [RocksDB SST](https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats) * [ambry partition format](https://forge.softwareheritage.org/T3064) (append only) * T3068 [Sorted String Table](https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf) (read only) * T3050 libcephsqlite or SQlite on top of RBD (read write) * T3046 Using xz-file-format for 1TB archive * T3045 Using pixz for 1TB archives * T3048 Using a custom format for 1TB archive * T3069 Using MZ as a file format * Scale out data and scale up metadata. The metadata is in a database (Rocksdb, etc.) that must be looked up to figure out where the data is to be found, as described in the [[ https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf| Finding a needle in Haystack: Facebook’s photo storage ]]. * T3049 Distributed database + RBD [[ https://forge.softwareheritage.org/T3014#57836 | space benchmark ]] (requires development on top of these building blocks) * Storage systems with blockers * T3051 EOS is too complex (uses RBD + Paxos + QuarkDB for namespace) * T3057 [[ https://github.com/chrislusf/seaweedfs | Seaweedfs ]] is not yet mature (uses large files to pack objects + Paxos + internal database for metadata) * https://github.com/open-io replication is a proprietary feature https://docs.openio.io/latest/source/admin-guide/configuration_replicator.html * https://ipfs.io/ does not provide replication or self-healing. Performances and space overhead are probably the same as the current Software Heritage storage system. * https://www.rozosystems.com/about claims a software patent on the implementation * http://www.orangefs.org/ or http://beegfs.io/ have a focus on high-end computing * https://www.lustre.org/ https://moosefs.com/ are distributed file systems, not object / block storage * [[ https://min.io/ | min.io ]] stores each object in an individual file on a file system, a space overhead that is identical to the current Software Heritage storage system. * [[ https://docs.openstack.org/swift/latest/ | Swift ]] stores [[ https://docs.openstack.org/swift/latest/overview_architecture.html#object-server | each object in an individual file on a file system]], a space overhead that is identical to the current Software Heritage storage system. * Inspiration * T3065 git partial clone (in part because it does packing, in part because it is source code related) * Hardware * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]] # Discussions * [Looking for hardware to benchmark the object storage design](https://sympa.inria.fr/sympa/arc/swh-devel/2021-03/msg00007.html) * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html | Scale out object storage design (take 1) ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JSG2TXKNXPXEKZOJZGYF2ZPTQHOB4LHJ/ | Storing 20 billions of immutable objects in Ceph, 75% <16KB ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ | Small RGW objects and RADOS 64KB minimun size ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/ | Using RBD to pack billions of small files ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00055.html | Benchmarking RBD to store artifacts ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-01/msg00026.html | Durable self healing distributed append only storage ]] # Quantitative data ## Current * I/O limits writes at 10MB/s * reads are currently performing at ~300 objects per second, 25MB/s and performed at ~500 objects per second, 44MB/s in the past * 50TB (30TB ZFS compressed) objects added every month * Available space exhausted by the end of 2021 * 10 billions objects * Objects occupy 750TB (350TB ZFS compressed) (see [[ https://forge.softwareheritage.org/T3054#58868 | statistics as of February 2021 ]] ) # Goals * Write > 100MB/s, ~3,000 objects/s * Read > 100MB/s, ~3,000 objects/s * Durability overhead (erasure coding) 50% (2+1, 4+2) * Storage overhead (storage system) < 20% * Time to first bite (i.e. how long does it take for a client to get the first byte of an object after sending a read request to the server) < 100ms * 100 billions objects