Change Details

[Parent task for all related tasks] # Current status An [scale out object storage](https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html) design was proposed. It has to be described in detail and benchmarks need to be written to verify it is efficient (space and speed) for the intended use cases. The hardware to run the benchmarks has to be specified and secured. # Design ### Archictecture There are two pools: read/write and readonly. The read/write pool has a limited number of 100GB files (for instance 200 RDB images containing a SQLite database) and relies on fast hardware (for instance SSD hosted by machines on a 10GB network). The readonly pool has an ever growing number of read only 100GB files (for instance RBD images formated as RocksDB SST files). Files in all pools are identified by a unique UUID. Objects are identified by the SHA256 of their content and the UUID of the file that contains them. An index of the SHA256 of all objects in the object storage is in the read/write pool for deduplication purposes (for instance a RocksDB and a bloom filter to quickly figure out if a SHA256 is not in the index). ### Writing Before writing an object, the SHA256 of its content is looked up in the index of the SHA256 of all objects. If it exists, nothing happens. Otherwise a new object is written in one of the files from the read/write pool, chosen at random. The id of the object (SHA256 + UUID of the file) is returned once it is persisted in the file and in the index. ### Packing When a file is bigger than a threshold (for instance 100GB), it becomes readonly. The readonly files from the read/write pool are periodically packed into a readonly format (for instance RBD images formated as RocksDB SST files) and moved over to the readonly pool in the process. ### Reading Given the id of an object (SHA256 + UUID of the file), the file is looked up in the readonly pool first. If it does not exist there, it is looked up in the read/write pool. Once the file is located, the object content is retrieved from it. If the reader is not interested in the most up to date content from Software Heritage, it can limit its search to the readonly pool. # Consistency The content of both pools (read/write and readonly) is strongly consistent at any point in time. As soon as an object is written (i.e. the write operation returns to the process), a reader can get the object from either the readonly pool or the read/write pool. The readonly pool is eventually consistent. It does not contain the latest objects inserted in the object storage but it will, eventually. It contains all objects inserted in the object storage, up to a given date. # Use cases ### Adding objects * A loader: * gets a source code file from the net * calculate the SHA256 from the content * looks up the SHA256 index and stops if it already exists * picks a file at random * stores the object in the file and add the UUID of the file into the SWHID * publishes the newly added object SWHID to the kafka bus ### Reading objects * Serving objects from the web interface: * Takes the file UUID from the SWHID * GET the object using the SHA256 from the SWHID from the readonly pool file UUID * If the object is found return * GET the object using the SHA256 from the SWHID from the read/write pool file UUID * Reconstruct an archived tarball or repo for the "vault": * Identical to serving objects from the web interface. * Push objects mirroring: * Takes the UUID of one of the remaining files at random from the readonly pool * For each object in the file, sends the object to the mirror via S3 * When done adds the UUID of the file in the list of files already mirrored. * Pull file mirroring: * The files are available at mirror.softwareheritage.com as mapped RBD devices /dev/rbd/readonly/UUID * Pick a UUID not already mirrored locally * Create a RBD image by the same UUID * GET the file and write it to /dev/rbd/readonly/UUID # Explorations * Scale out data and metadata * T3064 [[ https://github.com/linkedin/ambry | ambry ]] * T3052 RADOS [[ https://forge.softwareheritage.org/T3052#58917 | space benchmark ]] (requires development to reduce the space overhead and maintain performances) * ??? [[ https://docs.ceph.com/en/latest/radosgw/ | RGW ]] * Object packing * [RocksDB SST](https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats) * [ambry partition format](https://forge.softwareheritage.org/T3064) (append only) * [Sorted String Table](https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf) (read only) used [by ambry](https://engineering.linkedin.com/blog/2019/05/introducing-data-compaction-in-ambry) * T3050 libcephsqlite or SQlite on top of RBD (read write) * T3046 Using xz-file-format for 1TB archive * T3045 Using pixz for 1TB archives * T3048 Using a custom format for 1TB archive * Scale out data and scale up metadata. The metadata is in a database (Rocksdb, etc.) that must be looked up to figure out where the data is to be found, as described in the [[ https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf| Finding a needle in Haystack: Facebook’s photo storage ]]. * T3049 Distributed database + RBD [[ https://forge.softwareheritage.org/T3014#57836 | space benchmark ]] (requires development on top of these building blocks) * Storage systems with blockers * T3051 EOS is too complex (uses RBD + Paxos + QuarkDB for namespace) * T3050 libcephsqlite has a hard limit at ~300TB * T3057 [[ https://github.com/chrislusf/seaweedfs | Seaweedfs ]] is not yet mature (uses large files to pack objects + Paxos + internal database for metadata) * https://github.com/open-io replication is a proprietary feature https://docs.openio.io/latest/source/admin-guide/configuration_replicator.html * https://ipfs.io/ does not provide replication or self-healing. Performances and space overhead are probably the same as the current Software Heritage storage system. * https://www.rozosystems.com/about claims a software patent on the implementation * http://www.orangefs.org/ or http://beegfs.io/ have a focus on high-end computing * https://www.lustre.org/ https://moosefs.com/ are distributed file systems, not object / block storage * [[ https://min.io/ | min.io ]] stores each object in an individual file on a file system, a space overhead that is identical to the current Software Heritage storage system. * [[ https://docs.openstack.org/swift/latest/ | Swift ]] stores [[ https://docs.openstack.org/swift/latest/overview_architecture.html#object-server | each object in an individual file on a file system]], a space overhead that is identical to the current Software Heritage storage system. * Inspiration * T3065 git partial clone (in part because it does packing, in part because it is source code related) * Hardware * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]] # Discussions * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html | Scale out object storage design (take 1) ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JSG2TXKNXPXEKZOJZGYF2ZPTQHOB4LHJ/ | Storing 20 billions of immutable objects in Ceph, 75% <16KB ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ | Small RGW objects and RADOS 64KB minimun size ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/ | Using RBD to pack billions of small files ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00055.html | Benchmarking RBD to store artifacts ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-01/msg00026.html | Durable self healing distributed append only storage ]] # Quantitative data ## Current * I/O limits writes at 10MB/s * reads are currently performing at ~300 objects per second, 25MB/s and performed at ~500 objects per second, 44MB/s in the past * 50TB (30TB ZFS compressed) objects added every month * Available space exhausted by the end of 2021 * 10 billions objects * Objects occupy 750TB (350TB ZFS compressed) (see [[ https://forge.softwareheritage.org/T3054#58868 | statistics as of February 2021 ]] ) # Goals * Write > 100MB/s * Read > 100MB/s * Durability overhead (erasure coding) 50% (2+1, 4+2) * Storage overhead (storage system) < 20% * Time to first bite (i.e. how long does it take for a client to get the first byte of an object after sending a read request to the server) < 100ms

[Parent task for all related tasks] # Current status An [scale out object storage](https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html) design was proposed. It has to be described in detail and benchmarks need to be written to verify it is efficient (space and speed) for the intended use cases. The hardware to run the benchmarks has to be specified and secured. # Design ### Archictecture There are two pools: read/write and readonly. The read/write pool has a limited number of 100GB files (for instance 200 RDB images containing a SQLite database) and relies on fast hardware (for instance SSD hosted by machines on a 10GB network). The readonly pool has an ever growing number of read only 100GB files (for instance RBD images formated as RocksDB SST files). Files in all pools are identified by a unique UUID. Objects are identified by the SHA256 of their content and the UUID of the file that contains them. An index of the SHA256 of all objects in the object storage is in the read/write pool for deduplication purposes (for instance a RocksDB and a bloom filter to quickly figure out if a SHA256 is not in the index). ### Writing Before writing an object, the SHA256 of its content is looked up in the index of the SHA256 of all objects. If it exists, nothing happens. Otherwise a new object is written in one of the files from the read/write pool, chosen at random. The id of the object (SHA256 + UUID of the file) is returned once it is persisted in the file and in the index. ### Packing When a file is bigger than a threshold (for instance 100GB), it becomes readonly. The readonly files from the read/write pool are periodically packed into a readonly format (for instance RBD images formated as RocksDB SST files) and moved over to the readonly pool in the process. ### Reading Given the id of an object (SHA256 + UUID of the file), the file is looked up in the readonly pool first. If it does not exist there, it is looked up in the read/write pool. Once the file is located, the object content is retrieved from it. If the reader is not interested in the most up to date content from Software Heritage, it can limit its search to the readonly pool. ### Consistency The content of both pools (read/write and readonly) is strongly consistent at any point in time. As soon as an object is written (i.e. the write operation returns to the process), a reader can get the object from either the readonly pool or the read/write pool. The readonly pool is eventually consistent. It does not contain the latest objects inserted in the object storage but it will, eventually. It contains all objects inserted in the object storage, up to a given date. # Use cases ### Adding objects * A loader: * gets a source code file from the net * calculate the SHA256 from the content * looks up the SHA256 index and stops if it already exists * picks a file at random * stores the object in the file and add the UUID of the file into the SWHID * publishes the newly added object SWHID to the kafka bus ### Reading objects * Serving objects from the web interface: * Takes the file UUID from the SWHID * GET the object using the SHA256 from the SWHID from the readonly pool file UUID * If the object is found return * GET the object using the SHA256 from the SWHID from the read/write pool file UUID * Reconstruct an archived tarball or repo for the "vault": * Identical to serving objects from the web interface. * Push objects mirroring: * Takes the UUID of one of the remaining files at random from the readonly pool * For each object in the file, sends the object to the mirror via S3 * When done adds the UUID of the file in the list of files already mirrored. * Pull file mirroring: * The files are available at mirror.softwareheritage.com as mapped RBD devices /dev/rbd/readonly/UUID * Pick a UUID not already mirrored locally * Create a RBD image by the same UUID * GET the file and write it to /dev/rbd/readonly/UUID # Explorations * Scale out data and metadata * T3064 [[ https://github.com/linkedin/ambry | ambry ]] * T3052 RADOS [[ https://forge.softwareheritage.org/T3052#58917 | space benchmark ]] (requires development to reduce the space overhead and maintain performances) * ??? [[ https://docs.ceph.com/en/latest/radosgw/ | RGW ]] * Object packing * [RocksDB SST](https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats) * [ambry partition format](https://forge.softwareheritage.org/T3064) (append only) * [Sorted String Table](https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf) (read only) used [by ambry](https://engineering.linkedin.com/blog/2019/05/introducing-data-compaction-in-ambry) * T3050 libcephsqlite or SQlite on top of RBD (read write) * T3046 Using xz-file-format for 1TB archive * T3045 Using pixz for 1TB archives * T3048 Using a custom format for 1TB archive * Scale out data and scale up metadata. The metadata is in a database (Rocksdb, etc.) that must be looked up to figure out where the data is to be found, as described in the [[ https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf| Finding a needle in Haystack: Facebook’s photo storage ]]. * T3049 Distributed database + RBD [[ https://forge.softwareheritage.org/T3014#57836 | space benchmark ]] (requires development on top of these building blocks) * Storage systems with blockers * T3051 EOS is too complex (uses RBD + Paxos + QuarkDB for namespace) * T3050 libcephsqlite has a hard limit at ~300TB * T3057 [[ https://github.com/chrislusf/seaweedfs | Seaweedfs ]] is not yet mature (uses large files to pack objects + Paxos + internal database for metadata) * https://github.com/open-io replication is a proprietary feature https://docs.openio.io/latest/source/admin-guide/configuration_replicator.html * https://ipfs.io/ does not provide replication or self-healing. Performances and space overhead are probably the same as the current Software Heritage storage system. * https://www.rozosystems.com/about claims a software patent on the implementation * http://www.orangefs.org/ or http://beegfs.io/ have a focus on high-end computing * https://www.lustre.org/ https://moosefs.com/ are distributed file systems, not object / block storage * [[ https://min.io/ | min.io ]] stores each object in an individual file on a file system, a space overhead that is identical to the current Software Heritage storage system. * [[ https://docs.openstack.org/swift/latest/ | Swift ]] stores [[ https://docs.openstack.org/swift/latest/overview_architecture.html#object-server | each object in an individual file on a file system]], a space overhead that is identical to the current Software Heritage storage system. * Inspiration * T3065 git partial clone (in part because it does packing, in part because it is source code related) * Hardware * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]] # Discussions * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html | Scale out object storage design (take 1) ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JSG2TXKNXPXEKZOJZGYF2ZPTQHOB4LHJ/ | Storing 20 billions of immutable objects in Ceph, 75% <16KB ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ | Small RGW objects and RADOS 64KB minimun size ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/ | Using RBD to pack billions of small files ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00055.html | Benchmarking RBD to store artifacts ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-01/msg00026.html | Durable self healing distributed append only storage ]] # Quantitative data ## Current * I/O limits writes at 10MB/s * reads are currently performing at ~300 objects per second, 25MB/s and performed at ~500 objects per second, 44MB/s in the past * 50TB (30TB ZFS compressed) objects added every month * Available space exhausted by the end of 2021 * 10 billions objects * Objects occupy 750TB (350TB ZFS compressed) (see [[ https://forge.softwareheritage.org/T3054#58868 | statistics as of February 2021 ]] ) # Goals * Write > 100MB/s * Read > 100MB/s * Durability overhead (erasure coding) 50% (2+1, 4+2) * Storage overhead (storage system) < 20% * Time to first bite (i.e. how long does it take for a client to get the first byte of an object after sending a read request to the server) < 100ms