For the record stats from january 2021
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Feb 16 2021
Description Default value of bluestore compression min blob size for rotational media.
Type Unsigned Integer
Required No
Default 128K
Here's the output of the following query, which computes exact aggregates for objects smaller than the size boundaries of the original quartiles:
Thanks for this summary/status, very useful. Regarding goals, I think we want to have a read goal also about time to first bite, which is a performance metric which is particularly bad in the current filesystem-based object storage. Not sure what would be a reasonable goal though. Poke @olasd: any idea about a good target for this?
With a 4KB min alloc and a 4+2 erasure coded pool, objects that have a size < 16KB will require 16KB anyway + 8KB for parity. T3054 suggests that 75% of objects have a size < 16KB. Since the space amplification makes even the smallest object 16KB big, that's a total of 16KB * 7.5B = 120TB. That's 120TB / 750TB = 16% of the total. Without the space amplification these objects only use ~5% of the total space. The space amplification costs 10% of the total uncompressed storage.
Josh Durgin gave some more pointers to relevant pull requests:
Root cause analysis for space overhead with erasure coded pools.
Feb 15 2021
There is one concern that was not addressed: the metadata do not scale out, it is a single rocksdb database.
At first glance EOS is an entire system that adresses all the needs of the researchers at CERN. It includes an object storage with data and metadata separated, which is what the Software Heritage is likely to look like as well. However, this part is not standalone. Although it is a great source of inspiration:
The Scalla software suite provides two fundamental building blocks: an xrootd server for low latency high bandwidth data access and an olbd server for building scalable xrootd clusters. This paper describes the architecture, how low latency is achieved, and the scaling opportunities the software allows. Actual performance measurements are presented and discussed. Scalla offers a readily deployable framework in which to construct large fault-tolerant high performance data access configurations using commodity hardware with a minimum amount of administrative overhead.
There is a hard limit on the sqlite database (~280TB) so it would not work, even if perfectly optimized.
A mail was sent to Patrick Donnelly to ask for his opinion on the matter.
This preliminary exploration is complete and moved to benchmarking to discover blockers.
Updated the description, even simpler.
Thanks for the comment. Let's keep just the SWHID then.
followed sequence of:
Size of SHA256, SWHID, Content
SHA256
SWHID
Content
The object storage is a collection of RBD images containing a sequence of objects (SHA256 + SWHID + content).
Although simple and close to what is needed, Xz is not an exact match: the index would need to be maintained.
Xz format inadequate for long-term archiving
The zstd format is tightly associated with the compression algorithm and is therefore more complex. It can however be a sequence of independently compressed content and could be used for the same purpose as xz.
The 7z format is more complex because it knows about files, directories etc. It is not not just a compressed data format.
There are two blockers:
When extracting a single file (-x file) the in memory index is walked sequentially looking for the file.
The index is located at the end of the file.
The content of the archive is compressed as successive blocs of a given size.
The index is compressed as a single block of unlimited size.
Feb 14 2021
About Ceph RGW and the lack of packing https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/
https://github.com/vasi/pixz is a candidate for the 1TB archive content
For the record yesterday's IRC log
Feb 13 2021
For the record, today's IRC log:
Feb 6 2021
Benchmarking S3 in Ceph with COSBench could be interesting (the video is not yet available). In the past COSBench was difficult to use but maybe it improved. This is off-topic though, but I don't know where to write that down at the moment.
Feb 4 2021
Feb 3 2021
Feb 2 2021
Feb 1 2021
A trivial test case (attached) shows that an RBD image backed by a k=4,m=2 erasure coded pool (RAID6 equivalent) can store 4GB of data using 6GB of disk. The metadata overhead is small. It would be great if someone could repeat the test to make sure I did not accidentally obtained these results.
Jan 4 2021
Nov 3 2020
Oct 30 2020
Oct 29 2020
Oct 26 2020
See also T2706
Oct 16 2020
Same as before but with 1M (fresh) sha1s:
Since the results on uffizi above did suffer from a few caveats, I've made a few more tests:
- a first result has been obtained with a dataset that had only objects stored on the XFS part of the objstorage
- a second dataset has been created (with the order by sha256 part to spread the sha1s)
- but results are a mix hot/cold cache tests
Oct 15 2020
Some results:
Sep 22 2020
This was, in fact, solved by adding more storage.