This diff adds a design considerations document to the objstorage documentation.
It outlines the problem our object storage is trying to solve, the solutions
we've come up with so far, as well as a draft of a new design for a more
disk-efficient "packed" object storage, based on experimentations and some
literature review around Ceph.
And yes, this is the description of a (somewhat crude) filesystem, trying to
balance cramming tiny objects together to avoid wasting space with the ability
to store files (way) larger than RADOS supports efficiently.
There's a few TODO points that need to be cleared before this can be implemented:
- How to efficiently handle index blocks. There is some literature regarding B-Trees backed with RADOS/Ceph which might be interesting to investigate: https://ceph.com/wp-content/uploads/2017/01/CawthonKeyValueStore.pdf. The only issue I can see is that Erasure Coded pools don't support OMAP metadata, which would force the index to be written to a separate, replicated pool.
- When adding a small object, how to select which data block to write it to. Easy to solve for a single writer (just keep a list of the last block you've written to for the given object size), harder to do properly with several distributed writers.
- How to handle object restores (i.e. overwriting data on an index node) and deletions. Erasure coded data pools don't support overwriting objects unless you turn a knob on, only create and append.
- Add some more links to the documents that inspired the design.