Page MenuHomeSoftware Heritage

[WIP] "packing" object storage design documentation
AbandonedPublic

Authored by olasd on Jul 22 2018, 6:59 PM.

Details

Reviewers
zack
Group Reviewers
Reviewers
Maniphest Tasks
T3054: Scale out object storage design
Summary

This diff adds a design considerations document to the objstorage documentation.

It outlines the problem our object storage is trying to solve, the solutions
we've come up with so far, as well as a draft of a new design for a more
disk-efficient "packed" object storage, based on experimentations and some
literature review around Ceph.

And yes, this is the description of a (somewhat crude) filesystem, trying to
balance cramming tiny objects together to avoid wasting space with the ability
to store files (way) larger than RADOS supports efficiently.

There's a few TODO points that need to be cleared before this can be implemented:

  • How to efficiently handle index blocks. There is some literature regarding B-Trees backed with RADOS/Ceph which might be interesting to investigate: https://ceph.com/wp-content/uploads/2017/01/CawthonKeyValueStore.pdf. The only issue I can see is that Erasure Coded pools don't support OMAP metadata, which would force the index to be written to a separate, replicated pool.
  • When adding a small object, how to select which data block to write it to. Easy to solve for a single writer (just keep a list of the last block you've written to for the given object size), harder to do properly with several distributed writers.
  • How to handle object restores (i.e. overwriting data on an index node) and deletions. Erasure coded data pools don't support overwriting objects unless you turn a knob on, only create and append.
  • Add some more links to the documents that inspired the design.
Test Plan

cd docs; make html

Diff Detail

Repository
rDOBJS Object storage
Branch
wip/objstorage-design
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 1360
Build 1704: arc lint + arc unit

Event Timeline

First of all, thanks for this design document. I have read through it (although not verified the details of offsets, sizes, etc, mind you :-)) and it looks reasonable to me. A few questions/comments below:

  • the document needs more details about the IDs of the index blocks themselves. They'll be truncated, but by how much? AFAIU that also determines the ratio of index blocks v. "real" blocks in Ceph (right?), so that's an important detail. Also, I'm not clear on how the 128 byte structs are put together within index blocks. I'm guessing you plan to do some direct addressing based on the ID that remains after you've removed the prefix of the index block from the real object index (right?). More details about this in the document would be helpful too
  • what does the RADOS API allow to do with respect to accessing sub-objects that start at specific offsets? I'm guessing "nothing". So is the plan to retrieve the entire block and just resolve the offset in our object storage layer? (Which makes sense.) This will incur a penalty in random access patterns, who might end up re-requesting the same index/packed-block over and over again, so we should keep this in mind and evaluate the need to have caches. (Are you starting to smell dentry cache too? :-))
  • before going ahead and implementing this, it would make sense to post this draft (once finalized from our POV) to the ceph-users mailing list, for validation of the overall idea. There might be people there that can immediately point fingers on potential pain points
  • I'm a bit scared by the indirection level of the indexes to eat up quite a bit of the potential performance improvements of going scale-out, but I have no sense of how it compares to our current performance penalties (it should be arguably much lower, and won't have the same buttlenecks, but who knows…). I see no easy way to benchmark the thing, though, short of actually implementing it and see how it goes. Suggestions welcome.
This comment was removed by olasd.
In D398#7709, @zack wrote:

First of all, thanks for this design document. I have read through it (although not verified the details of offsets, sizes, etc, mind you :-)) and it looks reasonable to me. A few questions/comments below:

  • the document needs more details about the IDs of the index blocks themselves. They'll be truncated, but by how much? AFAIU that also determines the ratio of index blocks v. "real" blocks in Ceph (right?), so that's an important detail.

That is correct.

Considering 8 billion objects, 128 byte index nodes and a target index block size of 4 MB, we would need to aim for around 240 000 index blocks; this means truncating the object id on the first 18 bits (262144 index blocks) would be reasonable.

We can go down to 16 bits (65536 index blocks) if we accept index blocks of up to 16 MB (which puts our split on a hex character boundary)

Also, I'm not clear on how the 128 byte structs are put together within index blocks. I'm guessing you plan to do some direct addressing based on the ID that remains after you've removed the prefix of the index block from the real object index (right?). More details about this in the document would be helpful too

The easy answer is to just pack the index nodes in order in the index block. Obviously this will make writes very expensive.

In the end, it'd be way better to use some kind of B/B*/B+-tree structure to store them sorted and avoid cascading writes too much, rather than doing a naive implementation of the index.

I've been looking into concurrent implementation of b-tree-like data structures, but only as a thought experiment for now...

  • what does the RADOS API allow to do with respect to accessing sub-objects that start at specific offsets? I'm guessing "nothing".

No, it actually supports reads at arbitrary offsets and lengths (by default, the maximum read length is 90MB)

So is the plan to retrieve the entire block and just resolve the offset in our object storage layer? (Which makes sense.)

The naive plan was therefore to just make reads read the exact data we need. I'm tempted to still read 4k blocks and index inside them (and cache them), but this reeks premature optimization.

This will incur a penalty in random access patterns, who might end up re-requesting the same index/packed-block over and over again, so we should keep this in mind and evaluate the need to have caches. (Are you starting to smell dentry cache too? :-))

Yes. I want to do the simplest implementation first, then we can see with actual usage patterns where we need caches or not.

  • before going ahead and implementing this, it would make sense to post this draft (once finalized from our POV) to the ceph-users mailing list, for validation of the overall idea. There might be people there that can immediately point fingers on potential pain points

Yes, this is probably a good idea.

  • I'm a bit scared by the indirection level of the indexes to eat up quite a bit of the potential performance improvements of going scale-out, but I have no sense of how it compares to our current performance penalties (it should be arguably much lower, and won't have the same buttlenecks, but who knows…). I see no easy way to benchmark the thing, though, short of actually implementing it and see how it goes. Suggestions welcome.

I have no better suggestion than to just go with it.

Seeing this great piece of doc unpublished makes me a bit sad... 😢

zack retitled this revision from [WIP] Object Storage design documentation to [WIP] "packing" object storage design documentation.Nov 30 2018, 5:06 PM

Some related work : https://fosdem.org/2019/schedule/event/small_files_in_swift_cluster/

ISTR @zack also had some ideas on how to try to select which files to pack where (clustering by filename)

One interesting project we can look at is seaweedfs which implements ideas, presented in this paper, of fb's Haystack.

To paraphrase the project's description:

SeaweedFS is a simple and highly scalable distributed file system.
There are two objectives: to store billions of files! to serve the files fast! SeaweedFS implements an object store
with O(1) disk seek and an optional Filer with POSIX interface, supporting S3 API, FUSE mount, Hadoop compatible.

It's especially designed to store small files, packing them in volumes. This also allows some kind of arbitrary level of write parallelism: the more volumes there are in the cluster, the more concurrent write access can be made.

This is also possible thanks to a 2-phase approach for storing a file; first, a request is made to get an obj id assigned, then a second request will upload the object's content.

No idea whether if it's of some interest for our subject, but we may also have a look at openio

No idea whether if it's of some interest for our subject, but we may also have a look at openio

From the documentation:

oio-replicator is part of our paid plans. Get in touch with the team or visit the page describing our plans.

🙄