Change Details

[Parent task for all related tasks] # Current status Experimenting with T3049 RBD is paused in favor of experimenting with T3052 RADOS because it turns out the [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ | space amplification problem on spinners is fixed ]] in the Ceph Pacific release due March 2021. # Explorations * Scale out data and metadata * T3052 RADOS [[ https://forge.softwareheritage.org/T3052#58917 | space benchmark ]] (requires development to reduce the space overhead and maintain performances) * ??? [[ https://docs.ceph.com/en/latest/radosgw/ | RGW ]] * Scale out data and scale up metadata. The metadata is in a database (Rocksdb, etc.) that must be looked up to figure out where the data is to be found, as described in the [[ https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf| Finding a needle in Haystack: Facebook’s photo storage ]]. * T3049 Distributed database + RBD [[ https://forge.softwareheritage.org/T3014#57836 | space benchmark ]] (requires development on top of these building blocks) * Storage systems with blockers * T3051 EOS is too complex (uses RBD + Paxos + Rocksdb for metadata) * T3050 libcephsqlite has a hard limit at ~300TB * T3057 [[ https://github.com/chrislusf/seaweedfs | Seaweedfs ]] is not yet mature (uses large files to pack objects + Paxos + internal database for metadata) * https://github.com/open-io replication is a proprietary feature https://docs.openio.io/latest/source/admin-guide/configuration_replicator.html * https://ipfs.io/ does not provide replication or self-healing. Performances and space overhead are probably the same as the current Software Heritage storage system. * https://www.rozosystems.com/about claims a software patent on the implementation * http://www.orangefs.org/ or http://beegfs.io/ have a focus on high-end computing * https://www.lustre.org/ https://moosefs.com/ are distributed file systems, not object / block storage * [[ https://min.io/ | min.io ]] stores each object in an individual file on a file system, a space overhead that is identical to the current Software Heritage storage system. * [[ https://docs.openstack.org/swift/latest/ | Swift ]] [[ stores each object in an individual file on a file system | https://docs.openstack.org/swift/latest/overview_architecture.html#object-server ]], a space overhead that is identical to the current Software Heritage storage system. # Discussions * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ | Small RGW objects and RADOS 64KB minimun size ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/ | Using RBD to pack billions of small files ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00055.html | Benchmarking RBD to store artifacts ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-01/msg00026.html | Durable self healing distributed append only storage ]] # Quantitative data ## Current * I/O limits writes at 10MB/s * reads are currently performing at ~300 objects per second, 25MB/s and performed at ~500 objects per second, 44MB/s in the past * 50TB (30TB ZFS compressed) objects added every month * Available space exhausted by the end of 2021 * 10 billions objects * Objects occupy 750TB (350TB ZFS compressed) (see [[ https://forge.softwareheritage.org/T3054#58868 | statistics as of February 2021 ]] ) # Goals * Write > 100MB/s * Read > 100MB/s * Durability overhead (erasure coding) 50% (2+1, 4+2) * Storage overhead (storage system) < 20% * Time to first bite (i.e. how long does it take for a client to get the first byte of an object after sending a read request to the server) < 100ms

[Parent task for all related tasks] # Current status Experimenting with T3049 RBD is paused in favor of experimenting with T3052 RADOS because it turns out the [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ | space amplification problem on spinners is fixed ]] in the Ceph Pacific release due March 2021. # Explorations * Scale out data and metadata * T3052 RADOS [[ https://forge.softwareheritage.org/T3052#58917 | space benchmark ]] (requires development to reduce the space overhead and maintain performances) * ??? [[ https://docs.ceph.com/en/latest/radosgw/ | RGW ]] * Scale out data and scale up metadata. The metadata is in a database (Rocksdb, etc.) that must be looked up to figure out where the data is to be found, as described in the [[ https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf| Finding a needle in Haystack: Facebook’s photo storage ]]. * T3049 Distributed database + RBD [[ https://forge.softwareheritage.org/T3014#57836 | space benchmark ]] (requires development on top of these building blocks) * Storage systems with blockers * T3051 EOS is too complex (uses RBD + Paxos + Rocksdb for metadata) * T3050 libcephsqlite has a hard limit at ~300TB * T3057 [[ https://github.com/chrislusf/seaweedfs | Seaweedfs ]] is not yet mature (uses large files to pack objects + Paxos + internal database for metadata) * https://github.com/open-io replication is a proprietary feature https://docs.openio.io/latest/source/admin-guide/configuration_replicator.html * https://ipfs.io/ does not provide replication or self-healing. Performances and space overhead are probably the same as the current Software Heritage storage system. * https://www.rozosystems.com/about claims a software patent on the implementation * http://www.orangefs.org/ or http://beegfs.io/ have a focus on high-end computing * https://www.lustre.org/ https://moosefs.com/ are distributed file systems, not object / block storage * [[ https://min.io/ | min.io ]] stores each object in an individual file on a file system, a space overhead that is identical to the current Software Heritage storage system. * [[ https://docs.openstack.org/swift/latest/ | Swift ]] [[ https://docs.openstack.org/swift/latest/overview_architecture.html#object-server | stores each object in an individual file on a file system]], a space overhead that is identical to the current Software Heritage storage system. # Discussions * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ | Small RGW objects and RADOS 64KB minimun size ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/ | Using RBD to pack billions of small files ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00055.html | Benchmarking RBD to store artifacts ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-01/msg00026.html | Durable self healing distributed append only storage ]] # Quantitative data ## Current * I/O limits writes at 10MB/s * reads are currently performing at ~300 objects per second, 25MB/s and performed at ~500 objects per second, 44MB/s in the past * 50TB (30TB ZFS compressed) objects added every month * Available space exhausted by the end of 2021 * 10 billions objects * Objects occupy 750TB (350TB ZFS compressed) (see [[ https://forge.softwareheritage.org/T3054#58868 | statistics as of February 2021 ]] ) # Goals * Write > 100MB/s * Read > 100MB/s * Durability overhead (erasure coding) 50% (2+1, 4+2) * Storage overhead (storage system) < 20% * Time to first bite (i.e. how long does it take for a client to get the first byte of an object after sending a read request to the server) < 100ms

[Parent task for all related tasks] # Current status Experimenting with T3049 RBD is paused in favor of experimenting with T3052 RADOS because it turns out the [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ | space amplification problem on spinners is fixed ]] in the Ceph Pacific release due March 2021. # Explorations * Scale out data and metadata * T3052 RADOS [[ https://forge.softwareheritage.org/T3052#58917 | space benchmark ]] (requires development to reduce the space overhead and maintain performances) * ??? [[ https://docs.ceph.com/en/latest/radosgw/ | RGW ]] * Scale out data and scale up metadata. The metadata is in a database (Rocksdb, etc.) that must be looked up to figure out where the data is to be found, as described in the [[ https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf| Finding a needle in Haystack: Facebook’s photo storage ]]. * T3049 Distributed database + RBD [[ https://forge.softwareheritage.org/T3014#57836 | space benchmark ]] (requires development on top of these building blocks) * Storage systems with blockers * T3051 EOS is too complex (uses RBD + Paxos + Rocksdb for metadata) * T3050 libcephsqlite has a hard limit at ~300TB * T3057 [[ https://github.com/chrislusf/seaweedfs | Seaweedfs ]] is not yet mature (uses large files to pack objects + Paxos + internal database for metadata) * https://github.com/open-io replication is a proprietary feature https://docs.openio.io/latest/source/admin-guide/configuration_replicator.html * https://ipfs.io/ does not provide replication or self-healing. Performances and space overhead are probably the same as the current Software Heritage storage system. * https://www.rozosystems.com/about claims a software patent on the implementation * http://www.orangefs.org/ or http://beegfs.io/ have a focus on high-end computing * https://www.lustre.org/ https://moosefs.com/ are distributed file systems, not object / block storage * [[ https://min.io/ | min.io ]] stores each object in an individual file on a file system, a space overhead that is identical to the current Software Heritage storage system. * [[ https://docs.openstack.org/swift/latest/ | Swift ]] [[ stores each object in an individual file on a file system | https://docs.openstack.org/swift/latest/overview_architecture.html#object-server | stores each object in an individual file on a file system]], a space overhead that is identical to the current Software Heritage storage system. # Discussions * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ | Small RGW objects and RADOS 64KB minimun size ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/ | Using RBD to pack billions of small files ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00055.html | Benchmarking RBD to store artifacts ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-01/msg00026.html | Durable self healing distributed append only storage ]] # Quantitative data ## Current * I/O limits writes at 10MB/s * reads are currently performing at ~300 objects per second, 25MB/s and performed at ~500 objects per second, 44MB/s in the past * 50TB (30TB ZFS compressed) objects added every month * Available space exhausted by the end of 2021 * 10 billions objects * Objects occupy 750TB (350TB ZFS compressed) (see [[ https://forge.softwareheritage.org/T3054#58868 | statistics as of February 2021 ]] ) # Goals * Write > 100MB/s * Read > 100MB/s * Durability overhead (erasure coding) 50% (2+1, 4+2) * Storage overhead (storage system) < 20% * Time to first bite (i.e. how long does it take for a client to get the first byte of an object after sending a read request to the server) < 100ms