QuarkDB is now used for namespace. It stores 2.5 billions objects.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Feb 21 2021
Feb 20 2021
Feb 17 2021
Let's leave it open: although T3050 is a better fit, it is not ready yet and an interim solution may be required.
T3050 is a better fit as it does not require any specification or development.
Although it is not a good fit to store all objects, it is a better fit than RBD + a custom format to store 1TB worth of objects. Provided support for multiple concurrent readers is added.
In the following small objects are < 4KB and object storage software refers to the list of software from the description for which there are no blockers.
We'd want a reader to try reading on the mirrored pool, and then to fall back to the erasure coded pool if the object is larger than the cutoff. The increased latency in getting large objects may be worth the space savings ? I don't know.
The bench script and full results are in the tarbal.
In T3054#58874, @olasd wrote:@zack, very good point about having a target for the "time to first byte when reading an object".
I don't know what would be a "good" target for that metric; my gut says that staying within 100ms for any given object would be acceptable, as long as the number of parallel readers doesn't impact the amount too much (of course, within the IOPS of the underlying media, etc.).
If the size of the object was known to the reader of the object store it would be a great way to develop storage strategies depending on the object size. So far I assumed the reader does not have that information and is therefore unable to figure out which object storage to use based on that information but maybe I missed something?
Feb 16 2021
For the record stats from january 2021
Description Default value of bluestore compression min blob size for rotational media.
Type Unsigned Integer
Required No
Default 128K
With a 4KB min alloc and a 4+2 erasure coded pool, objects that have a size < 16KB will require 16KB anyway + 8KB for parity. T3054 suggests that 75% of objects have a size < 16KB. Since the space amplification makes even the smallest object 16KB big, that's a total of 16KB * 7.5B = 120TB. That's 120TB / 750TB = 16% of the total. Without the space amplification these objects only use ~5% of the total space. The space amplification costs 10% of the total uncompressed storage.
Josh Durgin gave some more pointers to relevant pull requests:
Root cause analysis for space overhead with erasure coded pools.
Feb 15 2021
There is one concern that was not addressed: the metadata do not scale out, it is a single rocksdb database.
At first glance EOS is an entire system that adresses all the needs of the researchers at CERN. It includes an object storage with data and metadata separated, which is what the Software Heritage is likely to look like as well. However, this part is not standalone. Although it is a great source of inspiration:
The Scalla software suite provides two fundamental building blocks: an xrootd server for low latency high bandwidth data access and an olbd server for building scalable xrootd clusters. This paper describes the architecture, how low latency is achieved, and the scaling opportunities the software allows. Actual performance measurements are presented and discussed. Scalla offers a readily deployable framework in which to construct large fault-tolerant high performance data access configurations using commodity hardware with a minimum amount of administrative overhead.
There is a hard limit on the sqlite database (~280TB) so it would not work, even if perfectly optimized.