diff --git a/docs/design.rst b/docs/design.rst new file mode 100644 --- /dev/null +++ b/docs/design.rst @@ -0,0 +1,360 @@ +===================== +Object storage design +===================== + +Problem statement +================= + +The main goal of the Software Heritage object storage is to store a large +quantity (order of magnitude: several billions) of files ranging from 0 to 100MB +(median size: 3kB). Objects are addressed via a cryptographic hash of their +contents. Objects are never deleted, only added. + +Some uses of the object storage may diverge from this baseline; for instance, +the Vault uses our object storage as a cache backend, and there's a few +differences in this usage: + + - objects can be larger (and can be up to 2GB) + - addressing is not done with an intrinsic hash but with an extrinsic + identifier + - objects can be removed (for cache expiry) + +Optimizations for the baseline should try not to hinder usage in the general +case. + +Historical Implementations +========================== + +Over the course of development of the Software Heritage project, we have taken +several approaches for the (low-level) storage of source file contents. The +scale (4.5 billion files and counting) as well as the size distribution (median +raw file size of 3kB) means that usual file storage solutions come short in one +or several ways. + +PathSlicingObjStorage +--------------------- + +The simplest approach we have taken is to use a regular filesystem, storing +compressed files there. To avoid issues with very large directories, contents +were sharded across several partitions, and then across directories. + +The first production deployments were sharded with the following pattern:: + + /srv/storage/A/AB/CD/EF/ABCDEFGH[...] + | | | | | + `-|--|--|--|-> partition name (0-f, 16 partitions) + `--|--|--|-> first level directory ([A]0-[A]f, 16 directories) + `--|--|-> second level directory (00-ff, 256 directories) + `--|-> third level directory (00-ff, 256 directories) + `-> full hash as filename (40 hexadecimal digits) + +This repartition gives a total number of 2 ** 24 = 16 million directories, each +containing a few hundred files. + +This approach works properly; however it has a few shortcomings: + - The number of directories means the kernel dentry cache fills up + pretty quickly; This means random accesses usually need to read + disk for all levels of the directory hierarchy, making them fairly slow. + - As files on disk are compressed, we end up with hundreds of millions of tiny + files on disk. While filesystems have some optimizations to inline very small + files directly into the inodes, we still had to resort to tiny block sizes to + avoid wasting lots of space, which makes disk accesses costly. + + +Cloud-based Object Storage +-------------------------- + +Our implementation of cloud-based object storages are a very thin layer on top +of the respective backends. Objects are stored directly using their full +hexadecimal object identifier as key, and the gzipped data as value. + +The Azure-based object storage also shards objects across a configurable number +of storage accounts, increasing the IOPS quota for operation of the content +storage (the Azure blob storage IOPS limits are set per storage account). + +Cloud providers don't offer ways to control the lower-level implementation +details of the storage on the backend. Setting aside factors out of our control, +the main bottlenecks for those implementations are the numerous HTTPS requests +needed to fetch objects: pipelining of those requests is critical, as it avoids +setting up and tearing down lots of expensive TCP+TLS connections. + +In the case of the Azure object storage, the account sharding means that we have +to connect to several domain names, possibly on different hosts, which may or +may not be on different IP addresses, making pipelining more challenging. + +Ceph implementation +=================== + +Ceph introduction +----------------- + +From the Ceph website: "Ceph is a unified, distributed storage system designed +for excellent performance, reliability and scalability.". Ceph is broken down in +several layers that implement low-level object storage, distributed block +devices as well as high-level distributed filesystems. + +This modular approach allows us to fine-tune our usage of Ceph to our exact +workload, only using the components that we actually need. Considering the +shortcomings of our previous implementations, we've elected to only use the +lowest-level RADOS ("Reliable Autonomic Distributed Object Store") API layer for +our backend. + +Our storage pool is set to use erasure coding, with a 5 data stripes + 2 parity +stripes mode, giving us a raw storage overhead of 40%, with the ability to +recover from 2 concurrent disk failures. + +"Naive" Ceph-based Object Storage +--------------------------------- + +The simplest implementation just stored the objects uncompressed directly in a +RADOS storage pool, using the hexadecimal object identifier as object name. +After using this implementation for a while, we had an unexpectedly high disk +usage, due to the way Ceph stores data on disk: + + - Ceph's modern on-disk format, BlueStore, has a minimal allocation size (on HDDs) of + 64kB, which means that all small writes turn into 64kB zero-padded objects; + - the RADOS erasure coding storage scheme needs to write a full stripe of data + chunks for each object. Data chunks on EC pools are 4kB by default, and + changing this setting is discouraged. + +Both those factors mean that objects are stored on disk within stripes of +multiples of *5*64kB = 320kB*, that each use *7*64kB = 448kB*. + +On the other hand of the spectrum, One last issue was that RADOS has a maximum +object size of *100MB*, and rejects writes of more than *90MB*. Those settings +are once again tunable but they have been picked by upstream carefully. + +All in all, on an almost random subset of 14 million objects, the storage +overhead was *630%* instead of the expected *40%*. + +Packed Object Storage +--------------------- + +To handle our peculiar workload, consisting of many small files, we decide to +pack several objects in the same RADOS data block. A set of index blocks is used +to keep track of what objects are stored, as well as the object creation time. + +To properly account for RADOS specifics, we try to align our data access on 4kB +boundaries. + +Data block format +~~~~~~~~~~~~~~~~~ + +Objects less than *64 bytes* +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +(2.2% of objects, 0.001% of total size) + +Those objects are only inlined in the index structure + +Objects less than *4kB* +^^^^^^^^^^^^^^^^^^^^^^^ +(51.7% of objects, 1.0% of total size) + +Size hint: *EC data chunk size = 4kB* + +For those objects, we select a data block by size, according to the next power +of two. We pack the objects according to this boundary. + +Objects less than *320kB* +^^^^^^^^^^^^^^^^^^^^^^^^^ +(43.9% of objects, 18.5% of total size) + +Size hint: *BlueStore minimum allocation size for HDD * stripe width = 320kB* + +We pack those objects in a *4MB*-ish data block, aligned at a *4kB* boundary. + +Objects less than *4MB* +^^^^^^^^^^^^^^^^^^^^^^^ +(1.9% of objects, 27.0% of total size) + +Size hint: *4MB* is the "optimal" RADOS object size. + +Those objects get stored in a standalone data block. + +Objects larger than *4MB* +^^^^^^^^^^^^^^^^^^^^^^^^^ +(0.02% of objects, 53.5% of total size) + +Size hint: *4MB* is the "optimal" RADOS object size. + +We split those objects in 4MB data blocks, and store each of these blocks separately. + +Index blocks +~~~~~~~~~~~~ + +As objects are addressed using a well-distributed cryptographic hash, index +sharding into blocks can be done by truncating the (hexadecimal) object +identifier up to a given length. + +Inside those index blocks, a number of 128-byte index nodes are stored. (TODO: +how exactly?) + + +Index node format +~~~~~~~~~~~~~~~~~ + +Index nodes are a packed, *128 byte* wide data structure. + +:: + + 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 0 |<- object identifier ->| 31 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 32 | size | ctime |T|<- | 63 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + 64 | | 95 + + index node type specific data + + 96 | ->| 127 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + + +All integers are stored in little-endian format. + +The common contents for the index nodes are: + ++--------------+----------------------+-------------------------+------------------------------------------------------+ +| byte address | content | type | interpretation | ++==============+======================+=========================+======================================================+ +| 0-31 | object identifier | byte array | sha1s use the first 20 bytes, sha256s all 32 | ++--------------+----------------------+-------------------------+------------------------------------------------------+ +| 32-39 | size | 64 bit unsigned integer | | ++--------------+----------------------+-------------------------+------------------------------------------------------+ +| 40-47 | file creation time | 64 bit signed integer | Number of milliseconds since the UNIX epoch | ++--------------+----------------------+-------------------------+------------------------------------------------------+ +| 48 | *T*: index node type | 8 bit unsigned integer | Currently known values: 0, 1, 2 | ++--------------+----------------------+-------------------------+------------------------------------------------------+ + +There currently are three index node types + +*T = 0*: Inline Object (*<= 64B*) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +:: + + 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 0 |<- object id ->| 31 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 32 | size | ctime |0| | 63 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 64 |<- | 95 + + object data + + 96 | ->| 127 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +Type-specific index node contents: + ++--------------+--------------------+-------------------------+---------------------------------+ +| byte address | content | type | interpretation | ++==============+====================+=========================+=================================+ +| 64-127 | object data | byte array | only use the first *size* bytes | ++--------------+--------------------+-------------------------+---------------------------------+ + + +*T = 1*: Small Object (*<= 4MB*) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +:: + + 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 0 |<- object id ->| 31 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 32 | size | ctime |1| | offset | 63 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 64 |<- | 95 + + data block identifier + + 96 | ->| 127 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +Type-specific index node contents: + ++--------------+--------------------+-------------------------+-----------------------------------------------------+ +| byte address | content | type | interpretation | ++==============+====================+=========================+=====================================================+ +| 56-63 | offset | 64 bit unsigned integer | | ++--------------+--------------------+-------------------------+-----------------------------------------------------+ +| 64-127 | data block id. | byte array | RADOS object id for the data block (probably ASCII) | ++--------------+--------------------+-------------------------+-----------------------------------------------------+ + +*T = 2*: Large Object (*> 4MB*) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +:: + + 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 0 |<- object id ->| 31 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 32 | size | ctime |2| chunk size | chunk count | 63 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 64 |<- | 95 + + data block id pattern + + 96 | ->| 127 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +Type-specific index node contents: + ++--------------+---------------------------+-------------------------+-----------------------------------------------------+ +| byte address | content | type | interpretation | ++==============+===========================+=========================+=====================================================+ +| 48-55 | chunk size | 64 bit unsigned integer | each chunk is this number of bytes (set LSB to 0) | ++--------------+---------------------------+-------------------------+-----------------------------------------------------+ +| 56-63 | chunk count | 64 bit unsigned integer | total number of chunks to fetch | ++--------------+---------------------------+-------------------------+-----------------------------------------------------+ +| 64-127 | data block id. pattern | byte array | RADOS object id for the data block (probably ASCII) | ++--------------+---------------------------+-------------------------+-----------------------------------------------------+ + + +Addendum: Object Size Statistics Generation +=========================================== + +Generate a CSV file containing the content lengths from the Software Heritage +database using: + +.. code-block:: postgresql + + copy ( + select sha1, length + from content + where exists ( + select 1 from tmp_sha1 + where tmp_sha1.sha1 = content.sha1 + ) + ) to 'test_lengths.csv' csv + + +Statistics are generated with Pandas:: + + import csv + import pandas as pd + + r = csv.reader(open('test_lengths.csv')) + + # s contains the list of sizes: size distribution + s = pd.Series(int(line[1]) for line in r) + + boundaries = [64, 4096, 65536*5, 4*1024*1024] + + for i in range(len(boundaries) + 1): + lower_bound = boundaries[i-1] if i > 0 else -1 + upper_bound = boundaries[i] if i < len(boundaries) else -1 + filter_lower = (s > lower_bound) if lower_bound > 0 else True + filter_upper = (s <= upper_bound) if upper_bound > 0 else True + matching = filter_lower & filter_upper + number_ratio = matching.sum() / s.count() + size_ratio = (s * matching).sum() / total + print(lower, upper, number_ratio, size_ratio) diff --git a/docs/index.rst b/docs/index.rst --- a/docs/index.rst +++ b/docs/index.rst @@ -7,7 +7,7 @@ :maxdepth: 2 :caption: Contents: - + design Indices and tables ==================