[Parent task for all related tasks]
# Current status
An [scale out object storage](https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html) design was proposed. It has to be described in detail and benchmarks need to be written to verify it is efficient (space and speed) for the intended use cases. The hardware to run the benchmarks has to be specified and secured.
# Design
It scales out for readers and scales up for writers.
# Object storage
## Glossary
* Object: an opaque sequence of bytes.
* Object HASH (HASH for short): the hash of an Object, i.e. the intrinsic identifier in a [SWHID](https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#core-identifiers).
* Shard: a group of Objects.
* Shard [UUID](https://en.wikipedia.org/wiki/Universally_unique_identifier) (UUID for short): the unique identifier of a Shard.
* Object ID (ID for short): the Object HASH and the Shard UUID containing the object.
* Read Storage: the unlimited size storage from which clients can only read Objects. It only contains Objects up to a given point in time.
* Write Storage: the fixed size storage from which clients can read or write. If an Object is not found in the Write storage, it must be retrieved from the Read Storage.
* Object Storage: the content of the Write Storage and the Read Storage combined.
* [LSM](https://en.wikipedia.org/wiki/Log-structured_merge-tree): an embedded database for key-value data such as [RocksDB](https://en.wikipedia.org/wiki/RocksDB).
* [SST](https://forge.softwareheritage.org/T3068): a readonly Sorted String Table that consists of a sequence of key/values and an index.
* Database: [PostgreSQL](https://en.wikipedia.org/wiki/PostgreSQL), [Cassandra](https://en.wikipedia.org/wiki/Apache_Cassandra), etc.
* [Ceph](https://en.wikipedia.org/wiki/Ceph_(software)): a self-healing distributed storage.
* [RBD](https://docs.ceph.com/en/latest/rbd/) image: a Ceph block storage that can either be used via the librbd library or as a block device from /dev/rbd.
The key concepts are:
* Packing millions of Objects together in Shards to:
* save space and,
* efficiently perform bulk actions such as mirroring or enumerations.
* Two different architectures:
* Read Storage that takes advantage of the fact that Objects are immutable and never deleted and,
* Write Storage from which Shards are moved to the Read Storage when they are full.
* Identifying an object by its Object HASH and the Shard UUID that contains it so that its location can be determined from the Object ID.
While the architecture based on these concepts scales out for writing and reading, it cannot be used to address Objects with their Object HASH alone which is inconvenient for a number of use cases. An index mapping the Object HASH to the Shard UUID must be added to provide this feature, but it does not scale out writes.
The content of the Object Storage (i.e the Write Storage and the Read Storage combined) is **strongly consistent** at any point in time. As soon as an Object is written (i.e. the write operation returns to the process), a reader can get the Object content from the Object Storage (it may require looking up the object from both the Write Storage and Read Storage).
The Read Storage is **eventually consistent**. It does not contain the latest Objects inserted in the Write Storage but it will, eventually. It contains all objects inserted in the Object Storage, up to a given date.
## Objects lookup require the Object ID
### Architecture
* Write Storage:
* A fixed number of Databases
* Read Storage:
* Shards implemented as Ceph RBD images named after their Shard UUID
* The content of the Shard is formatted as a SST
### Writing
The Object is stored in one of the Databases from the Write Storage. The Database is chosen at random. A database has a current Shard UUID, chosen at random. The current Shard UUID will be associated to all Objects written to the Database.
A successful Object write returns the Object ID. Writing the same object twice may return different Object ID. The Object HASH will be the same because it is based on the content of the Object. But the Shard in which the Object is stored may be different and the Shard UUID will therefore be different.
### Packing
When a Database grows bigger than a threshold (for instance 100GB), the current Shard UUID is no longer used, another one is allocated at random. A Shard is created in the Read Storage and Objects that belong to the former Shard UUID are sorted copied to it. When the Shard is complete, the Objects are removed from the database.
### Reading
The Shard UUID is extracted from the Object ID. If it exists in the Read Storage, the Object HASH is used to lookup its content. Otherwise the Database that owns the Shard UUID is looked up from the Write Storage and the Object HASH is used to lookup its content. If the reader is not interested in the most up to date content, it can limit its search to the Read Storage.
## Objects lookup with the Object HASH
An index mapping the Object HASH of all known Objects to the Shard UUID is used to:
* allow clients to fetch Objects using their Object HASH only instead of their Object ID.
* deduplicate identical Objects based on their Object HASH
### Architecture
* Write Storage:
* The index of Object IDs from the Write Storage and the Read Storage are stored in a LSM
* Read Storage:
* The index of Object IDs from the Read Storage are stored in a LSM
* Copies of the index are stored in Ceph RDB images, formatted as SST
### Writing
If the Object HASH exists in Read Storage index, do nothing. Otherwise perform the write and add the Object ID to the Write Storage index.
### Packing
During packing, each Object HASH is looked up in the Read Storage index. If it exists, the object is discarded. Otherwise its Object ID is inserted in the index. When packing is complete:
* copies of the Read Storage index are updated with the newly added Object ID.
* Object HASH that were found to be duplicate are updated in the Write Storage using the Shard UUID found in the Read Storage index
### Reading
If the Object HASH is found in the Read Storage index, use the Shard UUID to read the content from the Read Storage. Otherwise lookup the Shard UUID from the Write Storage index and perform a read with the Object ID.
# Use case# Glossary
* APIs used with an Object HASH instead of an Object ID are called HASH APIs
* APIs used with an Object ID are called ID APIs
### API implementation
The existing API (HASH APIs):
* [swh-objstorageweb REST API](https://docs.softwareheritage.org/devel/apidoc/swh.objstorage.objstorage-web/uri-scheme-api.html)
* [swh-webobjstorage Python API](https://docs.softwareheritage.org/devel/swh-web/uri-scheme-apiapidoc/swh.objstorage.objstorage.html)
### Adding objectsNew APIs (ID APIs) can be added or modified to get objects using their Object ID instead of their Object HASH so that there is no need to lookup the Shard UUID from an index using the Object HASH. The Shard UUID is part of the Object ID and can be extracted from it.
* A loader:[swh-web REST API](https://docs.softwareheritage.org/devel/swh-web/uri-scheme-api.html) could support a new **hash_type** that would be the **Object ID** instead of the **Object HASH**. For instance **GET /api/1/content/[(hash_type):](hash)/** could support the **objectid** hash_type to look like this **GET /api/1/content/objectid:Object HASH:Shard UUID/**.
* gets a source code file from the net* [swh-objstorage Python API](https://docs.softwareheritage.org/devel/apidoc/swh.objstorage.objstorage.html) could be derived into a module where all **obj_id** arguments are tuples of (Object HASH, Shard UUID) but providing the same interface.
## Adding objects
* A loader:
* calculate the SHA256* Gets a source code file from the contentnet
* looks up the SHA256 index and stops if it already exists* Calculate the Object HASH from the content
* publish* Uses the newly added object SWHID toHASH API to add the kafka busObject
### Reading objects
* Serving objects from the webREST interface:
* Existing clients GET object using the Object HASH using the HASH API
* GET* New clients may GET object using the oObject ID using the SHA256ID API
* Reconstruct an archived tarball or repo for the "vault":
* Identical to serving objects from the web interface.
* Push oObjects mirroring:
* Takes the UUID of one of the remain* Existing mirroring shards at random fromsoftware use the readonly poolHASH API
* For each object in the shard, sends the object to the mirror via S3Assuming API endpoints are added to:
* Retrieve the list of Shard UUID
* Retrieve all Objects from a Shard (Object ID and content)
* New clients can:
* Retrieve the list of all Shard UUID from the Read Storage
* Retrieve all Objects from a Shard, mirror the Object ID and its content via S3
* When done adds the UUID of the sShard to the list of sShards already mirrored.
* Pull fileShard mirroring:
* The filesShards are available at mirror.softwareheritage.com as mapped RBD devices "/dev/rbd/readonly//Shard UUID"
* Pick a UUIDShard not already mirrored locally
* Create a RBD image by the same Shard UUID
* GET the sShard and writefrom mirror.softwareheritage.com/ and copy it to /dev/rbd/readonly//Shard UUID
* Full enumeration for mining purposes:
* Examples:
* find all files whose names are LICENSE/COPYRIGHT/COPYING/etc. pointed by any commit in the archive and retrieve them
* find all archived files referenced by any commit in the top-1k (by number of stars) GitHub repos
* Start enumeration from UUID:SHA256 upthe Object ID of the last object from the previous ieration, to N objects
* N objects are returned starting from UUID:SHA256Object ID, in order
* The client is responsible for keeping track of the UUID:SHA256Object ID from which to resume the enumeration
# Explorations
* Scale out data and metadata
* T3064 [[ https://github.com/linkedin/ambry | ambry ]]
* T3052 RADOS [[ https://forge.softwareheritage.org/T3052#58917 | space benchmark ]] (requires development to reduce the space overhead and maintain performances)
* ??? [[ https://docs.ceph.com/en/latest/radosgw/ | RGW ]]
* Object packing
* T3066 [RocksDB SST](https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats)
* [ambry partition format](https://forge.softwareheritage.org/T3064) (append only)
* T3068 [Sorted String Table](https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf) (read only)
* T3050 libcephsqlite or SQlite on top of RBD (read write)
* T3046 Using xz-file-format for 1TB archive
* T3045 Using pixz for 1TB archives
* T3048 Using a custom format for 1TB archive
* T3069 Using MZ as a file format
* Scale out data and scale up metadata. The metadata is in a database (Rocksdb, etc.) that must be looked up to figure out where the data is to be found, as described in the [[ https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf| Finding a needle in Haystack: Facebook’s photo storage ]].
* T3049 Distributed database + RBD [[ https://forge.softwareheritage.org/T3014#57836 | space benchmark ]] (requires development on top of these building blocks)
* Storage systems with blockers
* T3051 EOS is too complex (uses RBD + Paxos + QuarkDB for namespace)
* T3057 [[ https://github.com/chrislusf/seaweedfs | Seaweedfs ]] is not yet mature (uses large files to pack objects + Paxos + internal database for metadata)
* https://github.com/open-io replication is a proprietary feature https://docs.openio.io/latest/source/admin-guide/configuration_replicator.html
* https://ipfs.io/ does not provide replication or self-healing. Performances and space overhead are probably the same as the current Software Heritage storage system.
* https://www.rozosystems.com/about claims a software patent on the implementation
* http://www.orangefs.org/ or http://beegfs.io/ have a focus on high-end computing
* https://www.lustre.org/ https://moosefs.com/ are distributed file systems, not object / block storage
* [[ https://min.io/ | min.io ]] stores each object in an individual file on a file system, a space overhead that is identical to the current Software Heritage storage system.
* [[ https://docs.openstack.org/swift/latest/ | Swift ]] stores [[ https://docs.openstack.org/swift/latest/overview_architecture.html#object-server | each object in an individual file on a file system]], a space overhead that is identical to the current Software Heritage storage system.
* Inspiration
* T3065 git partial clone (in part because it does packing, in part because it is source code related)
* Hardware
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]]
# Discussions
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00079.html | Scale out object storage design (take 1) ]]
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00078.html | Hardware for object storage ]]
* [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JSG2TXKNXPXEKZOJZGYF2ZPTQHOB4LHJ/ | Storing 20 billions of immutable objects in Ceph, 75% <16KB ]]
* [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ | Small RGW objects and RADOS 64KB minimun size ]]
* [[ https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/ | Using RBD to pack billions of small files ]]
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-02/msg00055.html | Benchmarking RBD to store artifacts ]]
* [[ https://sympa.inria.fr/sympa/arc/swh-devel/2021-01/msg00026.html | Durable self healing distributed append only storage ]]
# Quantitative data
## Current
* I/O limits writes at 10MB/s
* reads are currently performing at ~300 objects per second, 25MB/s and performed at ~500 objects per second, 44MB/s in the past
* 50TB (30TB ZFS compressed) objects added every month
* Available space exhausted by the end of 2021
* 10 billions objects
* Objects occupy 750TB (350TB ZFS compressed) (see [[ https://forge.softwareheritage.org/T3054#58868 | statistics as of February 2021 ]] )
# Goals
* Write > 100MB/s, ~3,000 objects/s
* Read > 100MB/s, ~3,000 objects/s
* Durability overhead (erasure coding) 50% (2+1, 4+2)
* Storage overhead (storage system) < 20%
* Time to first bite (i.e. how long does it take for a client to get the first byte of an object after sending a read request to the server) < 100ms
* 100 billions objects