Change Details

[Parent task for all related tasks] # Current status An [scale out object storage](https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html) design was proposed. It has to be described in detail and benchmarks need to be written to verify it is efficient (space and speed) for the intended use cases. The hardware to run the benchmarks has to be specified and secured. # Design It scales out for readers and scales up for writers. ### Archictecture * **Two pools:** read/write and readonly. Both pools contain multiples //shards//, which are large files identified by UUID where objects identified by the SHA256 of their content are packed together. When shards in the read/write pool grow over a given threshold, they are moved over to the readonly pool. * read/write pool * Contains a limited number of shards (for instance 200 RDB images formatted as a SQLite database) * Runs on fast hardware (for instance SSD hosted by machines on a 10GB network) * readonly pool * Contains an unlimited number of shards (for instance formatted as [[ https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats | RocksDB SST files ]]) * Runs on slower hardware * **Writing:** A shard from the read/write pool is selected at random and the object is written to it. * **Packing to readonly:** When a shard from the read/write pool grows bigger than a threshold (for instance 100GB), it becomes readonly. The readonly shard from the read/write pool are periodically packed into a readonly format (for instance RBD images formatted as RocksDB SST files) and moved over to the readonly pool in the process. * **Reading:** Given the id of an object (SHA256 + UUID of the shard), the shard is looked up in the readonly pool first. If it does not exist there, it is looked up in the read/write pool. Once the shard is located, the object content is retrieved from it. If the reader is not interested in the most up to date content from Software Heritage, it can limit its search to the readonly pool. ### Indexing There are two indexes mapping the SHA256 of objects to the UUID of the shard that contains them. * The read/write index maps the SHA256 of objects from both pools. * The readonly index maps the SHA256 of objects in the readonly pool. They are used to (i) locate the shard in which an object is stored, (ii) remove duplicate objects. * **Writing:** The index is updated with the UUID of the shard where the object is placed. If there already is an entry for the object, it is overriden. * **Packing to readonly:** When a file is moved to the readonly pool: * If the SHA256 of an object already exists in the readonly index, the object is discarded. * Otherwise it is copied to the readonly shard and inserted in the index of the readonly pool. * **Reading:** If the UUID of the object is not available, the SHA256 is used to find the UUID in the readonly index. If it is not found, the index of the read/write pool is looked up. ### Consistency The content of both pools (read/write and readonly) is strongly consistent at any point in time. As soon as an object is written (i.e. the write operation returns to the process), a reader can get the object from either the readonly pool or the read/write pool. The readonly pool is eventually consistent. It does not contain the latest objects inserted in the object storage but it will, eventually. It contains all objects inserted in the object storage, up to a given date. # Use cases ### API implementation * [swh-objstorage](https://docs.softwareheritage.org/devel/apidoc/swh.objstorage.objstorage.html) * [swh-web API](https://docs.softwareheritage.org/devel/swh-web/uri-scheme-api.html) ### Adding objects * A loader: * gets a source code file from the net * calculate the SHA256 from the content * looks up the SHA256 index and stops if it already exists * picks a shard at random * stores the object in the shard and add the UUID of the file into the SWHID * publishes the newly added object SWHID to the kafka bus ### Reading objects * Serving objects from the web interface: * Takes the shard UUID from the SWHID * GET the object using the SHA256 from the SWHID from the readonly pool shard UUID * If the object is found return * GET the object using the SHA256 from the SWHID from the read/write pool shard UUID * Reconstruct an archived tarball or repo for the "vault": * Identical to serving objects from the web interface. * Push objects mirroring: * Takes the UUID of one of the remaining shards at random from the readonly pool * For each object in the shard, sends the object to the mirror via S3 * When done adds the UUID of the shard to the list of shards already mirrored. * Pull file mirroring: * The files are available at mirror.softwareheritage.com as mapped RBD devices /dev/rbd/readonly/UUID * Pick a UUID not already mirrored locally * Create a RBD image by the same UUID * GET the shard and write it to /dev/rbd/readonly/UUID * Full enumeration for mining purposes: * Examples: * find all files whose names are LICENSE/COPYRIGHT/COPYING/etc. pointed by any commit in the archive and retrieve them * find all archived files referenced by any commit in the top-1k (by number of stars) GitHub repos * Start enumeration from UUID:SHA256 up to N objects * N objects are returned starting from UUID:SHA256, in order * The client is responsible for keeping track of the UUID:SHA256 from which to resume the enumeration # Explorations * Scale out data and metadata * T3064 [[ https://github.com/linkedin/ambry | ambry ]] * T3052 RADOS [[ https://forge.softwareheritage.org/T3052#58917 | space benchmark ]] (requires development to reduce the space overhead and maintain performances) * ??? [[ https://docs.ceph.com/en/latest/radosgw/ | RGW ]] * Object packing * T3066 [RocksDB SST](https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats) * [ambry partition format](https://forge.softwareheritage.org/T3064) (append only) * T3068 [Sorted String Table](https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf) (read only) * T3050 libcephsqlite or SQlite on top of RBD (read write) * T3046 Using xz-file-format for 1TB archive * T3045 Using pixz for 1TB archives * T3048 Using a custom format for 1TB archive * T3069 Using MZ as a file format * Scale out data and scale up metadata. The metadata is in a database (Rocksdb, etc.) that must be looked up to figure out where the data is to be found, as described in the [[ https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf| Finding a needle in Haystack: Facebook’s photo storage ]]. * T3049 Distributed database + RBD [[ https://forge.softwareheritage.org/T3014#57836 | space benchmark ]] (requires development on top of these building blocks) * Storage systems with blockers * T3051 EOS is too complex (uses RBD + Paxos + QuarkDB for namespace) * T3057 [[ https://github.com/chrislusf/seaweedfs | Seaweedfs ]] is not yet mature (uses large files to pack objects + Paxos + internal database for metadata) * https://github.com/open-io replication is a proprietary feature https://docs.openio.io/latest/source/admin-guide/configuration_replicator.html * https://ipfs.io/ does not provide replication or self-healing. Performances and space overhead are probably the same as the current Software Heritage storage system. * https://www.rozosystems.com/about claims a software patent on the implementation * http://www.orangefs.org/ or http://beegfs.io/ have a focus on high-end computing * https://www.lustre.org/ https://moosefs.com/ are distributed file systems, not object / block storage * [[ https://min.io/ | min.io ]] stores each object in an individual file on a file system, a space overhead that is identical to the current Software Heritage storage system. * [[ https://docs.openstack.org/swift/latest/ | Swift ]] stores [[ https://docs.openstack.org/swift/latest/overview_architecture.html#object-server | each object in an individual file on a file system]], a space overhead that is identical to the current Software Heritage storage system. * Inspiration * T3065 git partial clone (in part because it does packing, in part because it is source code related) * Hardware * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]] # Discussions * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html | Scale out object storage design (take 1) ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JSG2TXKNXPXEKZOJZGYF2ZPTQHOB4LHJ/ | Storing 20 billions of immutable objects in Ceph, 75% <16KB ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ | Small RGW objects and RADOS 64KB minimun size ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/ | Using RBD to pack billions of small files ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00055.html | Benchmarking RBD to store artifacts ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-01/msg00026.html | Durable self healing distributed append only storage ]] # Quantitative data ## Current * I/O limits writes at 10MB/s * reads are currently performing at ~300 objects per second, 25MB/s and performed at ~500 objects per second, 44MB/s in the past * 50TB (30TB ZFS compressed) objects added every month * Available space exhausted by the end of 2021 * 10 billions objects * Objects occupy 750TB (350TB ZFS compressed) (see [[ https://forge.softwareheritage.org/T3054#58868 | statistics as of February 2021 ]] ) # Goals * Write > 100MB/s, ~3,000 objects/s * Read > 100MB/s, ~3,000 objects/s * Durability overhead (erasure coding) 50% (2+1, 4+2) * Storage overhead (storage system) < 20% * Time to first bite (i.e. how long does it take for a client to get the first byte of an object after sending a read request to the server) < 100ms * 100 billions objects

[Parent task for all related tasks] # Current status An [scale out object storage](https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html) design was proposed. It has to be described in detail and benchmarks need to be written to verify it is efficient (space and speed) for the intended use cases. The hardware to run the benchmarks has to be specified and secured. # Design It scales out for readers and scales up for writers. ### Archictecture ### Scale out write and read Every object is uniquely identified with the SHA256 of its content and the UUID of the shard that contains it. When an object is written into the object storage, a UUID is allocated to it. It is the responsibility of the caller to save the object identifier. There is no guarantee that two objects with the same content will be deduplicated. * A readonly ceph pool containing (machines with HDD): * An RBD image for each shard (named after the shard UUID) in a custom SST format (SHA256 => object) * A read/write database that contains newly written objects (machines with SSD) * If the SHA256 exists in readwrite index, return * When a object is written, it is associated with the current uniq shard UUID * When the shard threshold is reached (100GB for instance), a new uniq shard UUID is allocated and becomes the current shard UUID. * All objects from the former shard UUID are moved to the readonly ceph pool and removed from the database. * **Writing:** Adding the same object that already exists in the readonly pool will create a duplicate. * **Packing to readonly:** When the database grows bigger than a threshold (for instance 100GB), the current shard UUID is no longer used, another one is allocated. A shard is created in the ceph readonly pool and objects with the same UUID are sorted and copied to it. When the readonly shard is complete, the objects are removed from the database. * **Reading:** Given the id of an object (SHA256 + UUID of the shard), the shard is looked up in the readonly pool first. If it does not exist there, it is looked up in the read/write database. If the reader is not interested in the most up to date content from Software Heritage, it can limit its search to the readonly pool. ### Indexing An index of all SHA256 in the object storage is added for (i) deduplication, (ii) retrieval using only the SHA256 of an object as a key. * The read/write database has an index mapping the SHA256 of the objects to its content. * The readonly index maps the SHA256 of objects to the shard UUID that contains them. It is a sorted list of fixed size entries in a RDB file. * **Writing:** * If the SHA256 exists in readonly index, return * Insert the SHA256 in the database index * **Packing to readonly:** When the database grows bigger than a threshold (for instance 100GB): * An index for all objects in the shard is exported. * The exported index is merged into the global readonly index. * **Reading:** * If the SHA256 exists in readonly index, return the object from the readonly shard * Otherwise return the object from the database ### Consistency The content of both the database and the readonly pool is strongly consistent at any point in time. As soon as an object is written (i.e. the write operation returns to the process), a reader can get the object from either the readonly pool or the database. The readonly pool is eventually consistent. It does not contain the latest objects inserted in the object storage but it will, eventually. It contains all objects inserted in the object storage, up to a given date. # Use cases ### API implementation * [swh-objstorage](https://docs.softwareheritage.org/devel/apidoc/swh.objstorage.objstorage.html) * [swh-web API](https://docs.softwareheritage.org/devel/swh-web/uri-scheme-api.html) ### Adding objects * A loader: * gets a source code file from the net * calculate the SHA256 from the content * looks up the SHA256 index and stops if it already exists * publishes the newly added object SWHID to the kafka bus ### Reading objects * Serving objects from the web interface: * GET the object using the SHA256 * Reconstruct an archived tarball or repo for the "vault": * Identical to serving objects from the web interface. * Push objects mirroring: * Takes the UUID of one of the remaining shards at random from the readonly pool * For each object in the shard, sends the object to the mirror via S3 * When done adds the UUID of the shard to the list of shards already mirrored. * Pull file mirroring: * The files are available at mirror.softwareheritage.com as mapped RBD devices /dev/rbd/readonly/UUID * Pick a UUID not already mirrored locally * Create a RBD image by the same UUID * GET the shard and write it to /dev/rbd/readonly/UUID * Full enumeration for mining purposes: * Examples: * find all files whose names are LICENSE/COPYRIGHT/COPYING/etc. pointed by any commit in the archive and retrieve them * find all archived files referenced by any commit in the top-1k (by number of stars) GitHub repos * Start enumeration from UUID:SHA256 up to N objects * N objects are returned starting from UUID:SHA256, in order * The client is responsible for keeping track of the UUID:SHA256 from which to resume the enumeration # Explorations * Scale out data and metadata * T3064 [[ https://github.com/linkedin/ambry | ambry ]] * T3052 RADOS [[ https://forge.softwareheritage.org/T3052#58917 | space benchmark ]] (requires development to reduce the space overhead and maintain performances) * ??? [[ https://docs.ceph.com/en/latest/radosgw/ | RGW ]] * Object packing * T3066 [RocksDB SST](https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats) * [ambry partition format](https://forge.softwareheritage.org/T3064) (append only) * T3068 [Sorted String Table](https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf) (read only) * T3050 libcephsqlite or SQlite on top of RBD (read write) * T3046 Using xz-file-format for 1TB archive * T3045 Using pixz for 1TB archives * T3048 Using a custom format for 1TB archive * T3069 Using MZ as a file format * Scale out data and scale up metadata. The metadata is in a database (Rocksdb, etc.) that must be looked up to figure out where the data is to be found, as described in the [[ https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf| Finding a needle in Haystack: Facebook’s photo storage ]]. * T3049 Distributed database + RBD [[ https://forge.softwareheritage.org/T3014#57836 | space benchmark ]] (requires development on top of these building blocks) * Storage systems with blockers * T3051 EOS is too complex (uses RBD + Paxos + QuarkDB for namespace) * T3057 [[ https://github.com/chrislusf/seaweedfs | Seaweedfs ]] is not yet mature (uses large files to pack objects + Paxos + internal database for metadata) * https://github.com/open-io replication is a proprietary feature https://docs.openio.io/latest/source/admin-guide/configuration_replicator.html * https://ipfs.io/ does not provide replication or self-healing. Performances and space overhead are probably the same as the current Software Heritage storage system. * https://www.rozosystems.com/about claims a software patent on the implementation * http://www.orangefs.org/ or http://beegfs.io/ have a focus on high-end computing * https://www.lustre.org/ https://moosefs.com/ are distributed file systems, not object / block storage * [[ https://min.io/ | min.io ]] stores each object in an individual file on a file system, a space overhead that is identical to the current Software Heritage storage system. * [[ https://docs.openstack.org/swift/latest/ | Swift ]] stores [[ https://docs.openstack.org/swift/latest/overview_architecture.html#object-server | each object in an individual file on a file system]], a space overhead that is identical to the current Software Heritage storage system. * Inspiration * T3065 git partial clone (in part because it does packing, in part because it is source code related) * Hardware * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]] # Discussions * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html | Scale out object storage design (take 1) ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JSG2TXKNXPXEKZOJZGYF2ZPTQHOB4LHJ/ | Storing 20 billions of immutable objects in Ceph, 75% <16KB ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ | Small RGW objects and RADOS 64KB minimun size ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/ | Using RBD to pack billions of small files ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00055.html | Benchmarking RBD to store artifacts ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-01/msg00026.html | Durable self healing distributed append only storage ]] # Quantitative data ## Current * I/O limits writes at 10MB/s * reads are currently performing at ~300 objects per second, 25MB/s and performed at ~500 objects per second, 44MB/s in the past * 50TB (30TB ZFS compressed) objects added every month * Available space exhausted by the end of 2021 * 10 billions objects * Objects occupy 750TB (350TB ZFS compressed) (see [[ https://forge.softwareheritage.org/T3054#58868 | statistics as of February 2021 ]] ) # Goals * Write > 100MB/s, ~3,000 objects/s * Read > 100MB/s, ~3,000 objects/s * Durability overhead (erasure coding) 50% (2+1, 4+2) * Storage overhead (storage system) < 20% * Time to first bite (i.e. how long does it take for a client to get the first byte of an object after sending a read request to the server) < 100ms * 100 billions objects

[Parent task for all related tasks] # Current status An [scale out object storage](https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html) design was proposed. It has to be described in detail and benchmarks need to be written to verify it is efficient (space and speed) for the intended use cases. The hardware to run the benchmarks has to be specified and secured. # Design It scales out for readers and scales up for writers. ### Archictecture * **Two pools:** read/write and readonly### Scale out write and read Every object is uniquely identified with the SHA256 of its content and the UUID of the shard that contains it. Both pools contain multiples //shards//When an object is written into the object storage, which are large files identified bya UUID where objects identified by the SHA256 of their content are packis allocated together it. When shards inIt is the responsibility of the read/write pool grow over a given threshold,caller to save the object identifier. they are moved over to the readonly pool.There is no guarantee that two objects with the same content will be deduplicated. * A readonly ceph pool containing (machines with HDD): * read/write pool * Contains a limited number of* An RBD image for each shard (named after the shards (for instance 200 RDB images UUID) in a custom SST formatted as a SQLite databaset (SHA256 => object) * Runs on fast hardware (for instance SSD hosted by * A read/write database that contains newly written objects (machines on a 10GB networkwith SSD) ** If the SHA256 exists in readonly poolwrite index, return * Contains an unlimited number of shards (for instance formatted as [[ https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats | RocksDB SST files ]])* When a object is written, it is associated with the current uniq shard UUID * Runs on slower hardware* When the shard threshold is reached (100GB for instance), a new uniq shard UUID is allocated and becomes the current shard UUID. * **Writing:** A shard from the read/write pool is selected at random and the object is written to it * All objects from the former shard UUID are moved to the readonly ceph pool and removed from the database. * **Writing:** Adding the same object that already exists in the readonly pool will create a duplicate. * **Packing to readonly:** When a shard from the read/write pool grows bigger than a threshold (for instance 100GB),the database grows bigger than a threshold (for instance 100GB), the current shard UUID is no longer used, another one is allocated. A shard is created in the ceph readonly pool and objects with the same UUID are sorted and copied to it. it becomes readonly.When the readonly shard is complete, The readonly shard from the read/write pool are periodically packed into a readonly format (for instance RBD images formatted as RocksDB SST files) and moved over to the readonly pool in the processthe objects are removed from the database. * **Reading:** Given the id of an object (SHA256 + UUID of the shard), the shard is looked up in the readonly pool first. If it does not exist there, it is looked up in the read/write pool. Once the shard is located, the object content is retrieved from itdatabase. If the reader is not interested in the most up to date content from Software Heritage, it can limit its search to the readonly pool. ### Indexing There are two indexes mapping the SHA256 of objects to the UUID of the shard that contains themAn index of all SHA256 in the object storage is added for (i) deduplication, (ii) retrieval using only the SHA256 of an object as a key. * The read/write database has an index mapspping the SHA256 of the objects from both poolsto its content. * The readonly index maps the SHA256 of objects in the readonly pool. They are used to (i) locate the shard in which an object is stored,to the shard UUID that contains them. (ii) remove duplicate objectsIt is a sorted list of fixed size entries in a RDB file. * **Writing:** The index is updated with * If the UUID of the shard where the object is placed.SHA256 exists in readonly index, If there already is an entry forreturn * Insert the object, it is overriden.SHA256 in the database index * **Packing to readonly:** When a file is moved to the readonly pool:the database grows bigger than a threshold (for instance 100GB): * If the SHA256 of an* An index for all object already exists in the readonly index, the objectshard is discardedexported. * Otherwise it is copied to the readonly shard and insert* The exported index is merged in the index of theto the global readonly poolindex. * **Reading:** If the UUID of the object is not available, * If the SHA256 is used to find the UUID in theexists in readonly index., If it is not found,return the object from the index of threadonly shard * Otherwise read/write pool is looked up.turn the object from the database ### Consistency The content of both pools (read/writthe database and the readonly) pool is strongly consistent at any point in time. As soon as an object is written (i.e. the write operation returns to the process), a reader can get the object from either the readonly pool or the read/write pooldatabase. The readonly pool is eventually consistent. It does not contain the latest objects inserted in the object storage but it will, eventually. It contains all objects inserted in the object storage, up to a given date. # Use cases ### API implementation * [swh-objstorage](https://docs.softwareheritage.org/devel/apidoc/swh.objstorage.objstorage.html) * [swh-web API](https://docs.softwareheritage.org/devel/swh-web/uri-scheme-api.html) ### Adding objects * A loader: * gets a source code file from the net * calculate the SHA256 from the content * looks up the SHA256 index and stops if it already exists * picks a shard at random * stores the object in the shard and add the UUID of the file into the SWHID * publishes the newly added object SWHID to the kafka bus ### Reading objects * Serving objects from the web interface: * Takes the shard UUID from the SWHID * GET the object using the SHA256 from the SWHID from the readonly pool shard UUID * If the object is found return * GET the object using the SHA256 from the SWHID from the read/write pool shard UUID * Reconstruct an archived tarball or repo for the "vault": * Identical to serving objects from the web interface. * Push objects mirroring: * Takes the UUID of one of the remaining shards at random from the readonly pool * For each object in the shard, sends the object to the mirror via S3 * When done adds the UUID of the shard to the list of shards already mirrored. * Pull file mirroring: * The files are available at mirror.softwareheritage.com as mapped RBD devices /dev/rbd/readonly/UUID * Pick a UUID not already mirrored locally * Create a RBD image by the same UUID * GET the shard and write it to /dev/rbd/readonly/UUID * Full enumeration for mining purposes: * Examples: * find all files whose names are LICENSE/COPYRIGHT/COPYING/etc. pointed by any commit in the archive and retrieve them * find all archived files referenced by any commit in the top-1k (by number of stars) GitHub repos * Start enumeration from UUID:SHA256 up to N objects * N objects are returned starting from UUID:SHA256, in order * The client is responsible for keeping track of the UUID:SHA256 from which to resume the enumeration # Explorations * Scale out data and metadata * T3064 [[ https://github.com/linkedin/ambry | ambry ]] * T3052 RADOS [[ https://forge.softwareheritage.org/T3052#58917 | space benchmark ]] (requires development to reduce the space overhead and maintain performances) * ??? [[ https://docs.ceph.com/en/latest/radosgw/ | RGW ]] * Object packing * T3066 [RocksDB SST](https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats) * [ambry partition format](https://forge.softwareheritage.org/T3064) (append only) * T3068 [Sorted String Table](https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf) (read only) * T3050 libcephsqlite or SQlite on top of RBD (read write) * T3046 Using xz-file-format for 1TB archive * T3045 Using pixz for 1TB archives * T3048 Using a custom format for 1TB archive * T3069 Using MZ as a file format * Scale out data and scale up metadata. The metadata is in a database (Rocksdb, etc.) that must be looked up to figure out where the data is to be found, as described in the [[ https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf| Finding a needle in Haystack: Facebook’s photo storage ]]. * T3049 Distributed database + RBD [[ https://forge.softwareheritage.org/T3014#57836 | space benchmark ]] (requires development on top of these building blocks) * Storage systems with blockers * T3051 EOS is too complex (uses RBD + Paxos + QuarkDB for namespace) * T3057 [[ https://github.com/chrislusf/seaweedfs | Seaweedfs ]] is not yet mature (uses large files to pack objects + Paxos + internal database for metadata) * https://github.com/open-io replication is a proprietary feature https://docs.openio.io/latest/source/admin-guide/configuration_replicator.html * https://ipfs.io/ does not provide replication or self-healing. Performances and space overhead are probably the same as the current Software Heritage storage system. * https://www.rozosystems.com/about claims a software patent on the implementation * http://www.orangefs.org/ or http://beegfs.io/ have a focus on high-end computing * https://www.lustre.org/ https://moosefs.com/ are distributed file systems, not object / block storage * [[ https://min.io/ | min.io ]] stores each object in an individual file on a file system, a space overhead that is identical to the current Software Heritage storage system. * [[ https://docs.openstack.org/swift/latest/ | Swift ]] stores [[ https://docs.openstack.org/swift/latest/overview_architecture.html#object-server | each object in an individual file on a file system]], a space overhead that is identical to the current Software Heritage storage system. * Inspiration * T3065 git partial clone (in part because it does packing, in part because it is source code related) * Hardware * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]] # Discussions * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html | Scale out object storage design (take 1) ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JSG2TXKNXPXEKZOJZGYF2ZPTQHOB4LHJ/ | Storing 20 billions of immutable objects in Ceph, 75% <16KB ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ | Small RGW objects and RADOS 64KB minimun size ]] * [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/ | Using RBD to pack billions of small files ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00055.html | Benchmarking RBD to store artifacts ]] * [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-01/msg00026.html | Durable self healing distributed append only storage ]] # Quantitative data ## Current * I/O limits writes at 10MB/s * reads are currently performing at ~300 objects per second, 25MB/s and performed at ~500 objects per second, 44MB/s in the past * 50TB (30TB ZFS compressed) objects added every month * Available space exhausted by the end of 2021 * 10 billions objects * Objects occupy 750TB (350TB ZFS compressed) (see [[ https://forge.softwareheritage.org/T3054#58868 | statistics as of February 2021 ]] ) # Goals * Write > 100MB/s, ~3,000 objects/s * Read > 100MB/s, ~3,000 objects/s * Durability overhead (erasure coding) 50% (2+1, 4+2) * Storage overhead (storage system) < 20% * Time to first bite (i.e. how long does it take for a client to get the first byte of an object after sending a read request to the server) < 100ms * 100 billions objects