Page MenuHomeSoftware Heritage

Reducing Ceph bluestore_min_alloc_size from 64K to 4K
Closed, InvalidPublic

Description

Small RGW objects and RADOS 64KB minimun size

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/

Not much interest for this thread on the mailing list so far. If it was not for this overhead, every Software Heritage object could be stored in a RADOS object and the scale out problem would be solved (i.e. no need for T3049). Maybe there is a not-so-complicated fix to set it to 4K ? What is the limit imposed by rocksdb ? I don't see any reason why bluestore would be crossed if the limit is low. What about the overhead of a 4+2 erasure coding ? What were the followups of last year decision to go back to 64K because of allocation / performances problems ( https://github.com/ceph/ceph/pull/32809 ) ?

It is worth looking into

Event Timeline

dachary changed the task status from Open to Work in Progress.Feb 15 2021, 11:42 PM
dachary triaged this task as Normal priority.
dachary created this task.
dachary created this object in space S1 Public.
dachary updated the task description. (Show Details)

Root cause analysis for space overhead with erasure coded pools.

https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/OHPO43J54TPBEUISYCK3SRV55SIZX2AT/

Issues:

Erasure coded pool might need much more disk space than expected https://tracker.ceph.com/issues/44213
Erasure-Coded storage in bluestore has larger disk usage than expected https://tracker.ceph.com/issues/41577

Pull request fixing the allocator issue and setting the default for HDD to 4KB: https://github.com/ceph/ceph/pull/34588 to be released in Pacific next month https://docs.ceph.com/en/latest/releases/general/

The default for SSD already is 4KB and there is no performance issue

With a 4KB min alloc and a 4+2 erasure coded pool, objects that have a size < 16KB will require 16KB anyway + 8KB for parity. T3054 suggests that 75% of objects have a size < 16KB. Since the space amplification makes even the smallest object 16KB big, that's a total of 16KB * 7.5B = 120TB. That's 120TB / 750TB = 16% of the total. Without the space amplification these objects only use ~5% of the total space. The space amplification costs 10% of the total uncompressed storage.

When compressed the storage uses 350TB which is ~50% of 750TB. It follows that the space amplification has an impact on all object that have a size < 32KB because they can only be compressed to 16KB. And the objects with a size < 16KB still use 120TB which is 120TB / 350TB = 34%. Therefore the space amplification costs at least 34% of the total compressed storage.

Maybe it would make sense to consider putting the very small objects (e.g. those <= the min alloc size) into a 3 or 4-way mirrored pool instead of an erasure coded pool;

This would give us the same redundancy characteristics but reduce the size overage to (3-4 * min alloc) instead of having to eat the full stripe width for each of those objects. Of course this comes at the cost of having to look up objects twice, once in either pool.

If the size of the object was known to the reader of the object store it would be a great way to develop storage strategies depending on the object size. So far I assumed the reader does not have that information and is therefore unable to figure out which object storage to use based on that information but maybe I missed something?

The bench script and full results are in the tarbal.

#
# bluestore min alloc 65536 (default) write 250,000 4K objects, i.e. ~1GB
# expected result: ~15GB raw space used
#
echo 65536 4096 4096 250000 | ./bench-rados.sh main
#
# bluestore min alloc 1024 write 250,000 4K objects, i.e. ~1GB
# expected result: ~1.5GB raw space used
#
echo 1024 1024 4096 250000 | ./bench-rados.sh main
#
# bluestore min alloc 4096 write 250,000 4K objects, i.e. ~1GB
# expected result: ~6GB raw space used
#
echo 4096 4096 4096 250000 | ./bench-rados.sh main
#
# bluestore min alloc 4096 write 250,000 16K objects, i.e. ~4GB
# expected result: ~6GB raw space used
#
echo 4096 4096 16384 250000 | ./bench-rados.sh main
#
# bluestore min alloc 4096 write 250,00 20K objects, i.e. ~5GB
# expected result: ~12GB raw space used
#
echo 4096 4096 20480 250000 | ./bench-rados.sh main

If the size of the object was known to the reader of the object store it would be a great way to develop storage strategies depending on the object size. So far I assumed the reader does not have that information and is therefore unable to figure out which object storage to use based on that information but maybe I missed something?

No, on the reader side, that's correct.

This is what I meant when saying :

Of course this comes at the cost of having to look up objects twice, once in either pool.

We'd want a reader to try reading on the mirrored pool, and then to fall back to the erasure coded pool if the object is larger than the cutoff. The increased latency in getting large objects may be worth the space savings ? I don't know.

We'd want a reader to try reading on the mirrored pool, and then to fall back to the erasure coded pool if the object is larger than the cutoff. The increased latency in getting large objects may be worth the space savings ? I don't know.

Or it could always and simultaneously lookup both storage and return the one that does not 404. Interesting idea.

In the T3054 proposed design, objects are packed into larger files and there is no reason to continue in this direction. There seems to be a consensus that tenths of billions of individual objects is problematic. It takes very long to enumerate, for one thing. And noone is doing that which is not a great sign.