Changeset View
Changeset View
Standalone View
Standalone View
docs/design.rst
- This file was added.
===================== | |||||
Object storage design | |||||
===================== | |||||
Problem statement | |||||
================= | |||||
The main goal of the Software Heritage object storage is to store a large | |||||
quantity (order of magnitude: several billions) of files ranging from 0 to 100MB | |||||
(median size: 3kB). Objects are addressed via a cryptographic hash of their | |||||
contents. Objects are never deleted, only added. | |||||
Some uses of the object storage may diverge from this baseline; for instance, | |||||
the Vault uses our object storage as a cache backend, and there's a few | |||||
differences in this usage: | |||||
- objects can be larger (and can be up to 2GB) | |||||
- addressing is not done with an intrinsic hash but with an extrinsic | |||||
identifier | |||||
- objects can be removed (for cache expiry) | |||||
Optimizations for the baseline should try not to hinder usage in the general | |||||
case. | |||||
Historical Implementations | |||||
========================== | |||||
Over the course of development of the Software Heritage project, we have taken | |||||
several approaches for the (low-level) storage of source file contents. The | |||||
scale (4.5 billion files and counting) as well as the size distribution (median | |||||
raw file size of 3kB) means that usual file storage solutions come short in one | |||||
or several ways. | |||||
PathSlicingObjStorage | |||||
--------------------- | |||||
The simplest approach we have taken is to use a regular filesystem, storing | |||||
compressed files there. To avoid issues with very large directories, contents | |||||
were sharded across several partitions, and then across directories. | |||||
The first production deployments were sharded with the following pattern:: | |||||
/srv/storage/A/AB/CD/EF/ABCDEFGH[...] | |||||
| | | | | | |||||
`-|--|--|--|-> partition name (0-f, 16 partitions) | |||||
`--|--|--|-> first level directory ([A]0-[A]f, 16 directories) | |||||
`--|--|-> second level directory (00-ff, 256 directories) | |||||
`--|-> third level directory (00-ff, 256 directories) | |||||
`-> full hash as filename (40 hexadecimal digits) | |||||
This repartition gives a total number of 2 ** 24 = 16 million directories, each | |||||
containing a few hundred files. | |||||
This approach works properly; however it has a few shortcomings: | |||||
- The number of directories means the kernel dentry cache fills up | |||||
pretty quickly; This means random accesses usually need to read | |||||
disk for all levels of the directory hierarchy, making them fairly slow. | |||||
- As files on disk are compressed, we end up with hundreds of millions of tiny | |||||
files on disk. While filesystems have some optimizations to inline very small | |||||
files directly into the inodes, we still had to resort to tiny block sizes to | |||||
avoid wasting lots of space, which makes disk accesses costly. | |||||
Cloud-based Object Storage | |||||
-------------------------- | |||||
Our implementation of cloud-based object storages are a very thin layer on top | |||||
of the respective backends. Objects are stored directly using their full | |||||
hexadecimal object identifier as key, and the gzipped data as value. | |||||
The Azure-based object storage also shards objects across a configurable number | |||||
of storage accounts, increasing the IOPS quota for operation of the content | |||||
storage (the Azure blob storage IOPS limits are set per storage account). | |||||
Cloud providers don't offer ways to control the lower-level implementation | |||||
details of the storage on the backend. Setting aside factors out of our control, | |||||
the main bottlenecks for those implementations are the numerous HTTPS requests | |||||
needed to fetch objects: pipelining of those requests is critical, as it avoids | |||||
setting up and tearing down lots of expensive TCP+TLS connections. | |||||
In the case of the Azure object storage, the account sharding means that we have | |||||
to connect to several domain names, possibly on different hosts, which may or | |||||
may not be on different IP addresses, making pipelining more challenging. | |||||
Ceph implementation | |||||
=================== | |||||
Ceph introduction | |||||
----------------- | |||||
From the Ceph website: "Ceph is a unified, distributed storage system designed | |||||
for excellent performance, reliability and scalability.". Ceph is broken down in | |||||
several layers that implement low-level object storage, distributed block | |||||
devices as well as high-level distributed filesystems. | |||||
This modular approach allows us to fine-tune our usage of Ceph to our exact | |||||
workload, only using the components that we actually need. Considering the | |||||
shortcomings of our previous implementations, we've elected to only use the | |||||
lowest-level RADOS ("Reliable Autonomic Distributed Object Store") API layer for | |||||
our backend. | |||||
Our storage pool is set to use erasure coding, with a 5 data stripes + 2 parity | |||||
stripes mode, giving us a raw storage overhead of 40%, with the ability to | |||||
recover from 2 concurrent disk failures. | |||||
"Naive" Ceph-based Object Storage | |||||
--------------------------------- | |||||
The simplest implementation just stored the objects uncompressed directly in a | |||||
RADOS storage pool, using the hexadecimal object identifier as object name. | |||||
After using this implementation for a while, we had an unexpectedly high disk | |||||
usage, due to the way Ceph stores data on disk: | |||||
- Ceph's modern on-disk format, BlueStore, has a minimal allocation size (on HDDs) of | |||||
64kB, which means that all small writes turn into 64kB zero-padded objects; | |||||
- the RADOS erasure coding storage scheme needs to write a full stripe of data | |||||
chunks for each object. Data chunks on EC pools are 4kB by default, and | |||||
changing this setting is discouraged. | |||||
Both those factors mean that objects are stored on disk within stripes of | |||||
multiples of *5*64kB = 320kB*, that each use *7*64kB = 448kB*. | |||||
On the other hand of the spectrum, One last issue was that RADOS has a maximum | |||||
object size of *100MB*, and rejects writes of more than *90MB*. Those settings | |||||
are once again tunable but they have been picked by upstream carefully. | |||||
All in all, on an almost random subset of 14 million objects, the storage | |||||
overhead was *630%* instead of the expected *40%*. | |||||
Packed Object Storage | |||||
--------------------- | |||||
To handle our peculiar workload, consisting of many small files, we decide to | |||||
pack several objects in the same RADOS data block. A set of index blocks is used | |||||
to keep track of what objects are stored, as well as the object creation time. | |||||
To properly account for RADOS specifics, we try to align our data access on 4kB | |||||
boundaries. | |||||
Data block format | |||||
~~~~~~~~~~~~~~~~~ | |||||
Objects less than *64 bytes* | |||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |||||
(2.2% of objects, 0.001% of total size) | |||||
Those objects are only inlined in the index structure | |||||
Objects less than *4kB* | |||||
^^^^^^^^^^^^^^^^^^^^^^^ | |||||
(51.7% of objects, 1.0% of total size) | |||||
Size hint: *EC data chunk size = 4kB* | |||||
For those objects, we select a data block by size, according to the next power | |||||
of two. We pack the objects according to this boundary. | |||||
Objects less than *320kB* | |||||
^^^^^^^^^^^^^^^^^^^^^^^^^ | |||||
(43.9% of objects, 18.5% of total size) | |||||
Size hint: *BlueStore minimum allocation size for HDD * stripe width = 320kB* | |||||
We pack those objects in a *4MB*-ish data block, aligned at a *4kB* boundary. | |||||
Objects less than *4MB* | |||||
^^^^^^^^^^^^^^^^^^^^^^^ | |||||
(1.9% of objects, 27.0% of total size) | |||||
Size hint: *4MB* is the "optimal" RADOS object size. | |||||
Those objects get stored in a standalone data block. | |||||
Objects larger than *4MB* | |||||
^^^^^^^^^^^^^^^^^^^^^^^^^ | |||||
(0.02% of objects, 53.5% of total size) | |||||
Size hint: *4MB* is the "optimal" RADOS object size. | |||||
We split those objects in 4MB data blocks, and store each of these blocks separately. | |||||
Index blocks | |||||
~~~~~~~~~~~~ | |||||
As objects are addressed using a well-distributed cryptographic hash, index | |||||
sharding into blocks can be done by truncating the (hexadecimal) object | |||||
identifier up to a given length. | |||||
Inside those index blocks, a number of 128-byte index nodes are stored. (TODO: | |||||
how exactly?) | |||||
Index node format | |||||
~~~~~~~~~~~~~~~~~ | |||||
Index nodes are a packed, *128 byte* wide data structure. | |||||
:: | |||||
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 | |||||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||||
0 |<- object identifier ->| 31 | |||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||||
32 | size | ctime |T|<- | 63 | |||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | |||||
64 | | 95 | |||||
+ index node type specific data + | |||||
96 | ->| 127 | |||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||||
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 | |||||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |||||
All integers are stored in little-endian format. | |||||
The common contents for the index nodes are: | |||||
+--------------+----------------------+-------------------------+------------------------------------------------------+ | |||||
| byte address | content | type | interpretation | | |||||
+==============+======================+=========================+======================================================+ | |||||
| 0-31 | object identifier | byte array | sha1s use the first 20 bytes, sha256s all 32 | | |||||
+--------------+----------------------+-------------------------+------------------------------------------------------+ | |||||
| 32-39 | size | 64 bit unsigned integer | | | |||||
+--------------+----------------------+-------------------------+------------------------------------------------------+ | |||||
| 40-47 | file creation time | 64 bit signed integer | Number of milliseconds since the UNIX epoch | | |||||
+--------------+----------------------+-------------------------+------------------------------------------------------+ | |||||
| 48 | *T*: index node type | 8 bit unsigned integer | Currently known values: 0, 1, 2 | | |||||
+--------------+----------------------+-------------------------+------------------------------------------------------+ | |||||
There currently are three index node types | |||||
*T = 0*: Inline Object (*<= 64B*) | |||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |||||
:: | |||||
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 | |||||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||||
0 |<- object id ->| 31 | |||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||||
32 | size | ctime |0| | 63 | |||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||||
64 |<- | 95 | |||||
+ object data + | |||||
96 | ->| 127 | |||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||||
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 | |||||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |||||
Type-specific index node contents: | |||||
+--------------+--------------------+-------------------------+---------------------------------+ | |||||
| byte address | content | type | interpretation | | |||||
+==============+====================+=========================+=================================+ | |||||
| 64-127 | object data | byte array | only use the first *size* bytes | | |||||
+--------------+--------------------+-------------------------+---------------------------------+ | |||||
*T = 1*: Small Object (*<= 4MB*) | |||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |||||
:: | |||||
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 | |||||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||||
0 |<- object id ->| 31 | |||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||||
32 | size | ctime |1| | offset | 63 | |||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||||
64 |<- | 95 | |||||
+ data block identifier + | |||||
96 | ->| 127 | |||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||||
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 | |||||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |||||
Type-specific index node contents: | |||||
+--------------+--------------------+-------------------------+-----------------------------------------------------+ | |||||
| byte address | content | type | interpretation | | |||||
+==============+====================+=========================+=====================================================+ | |||||
| 56-63 | offset | 64 bit unsigned integer | | | |||||
+--------------+--------------------+-------------------------+-----------------------------------------------------+ | |||||
| 64-127 | data block id. | byte array | RADOS object id for the data block (probably ASCII) | | |||||
+--------------+--------------------+-------------------------+-----------------------------------------------------+ | |||||
*T = 2*: Large Object (*> 4MB*) | |||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |||||
:: | |||||
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 | |||||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||||
0 |<- object id ->| 31 | |||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||||
32 | size | ctime |2| chunk size | chunk count | 63 | |||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||||
64 |<- | 95 | |||||
+ data block id pattern + | |||||
96 | ->| 127 | |||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||||
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 | |||||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |||||
Type-specific index node contents: | |||||
+--------------+---------------------------+-------------------------+-----------------------------------------------------+ | |||||
| byte address | content | type | interpretation | | |||||
+==============+===========================+=========================+=====================================================+ | |||||
| 48-55 | chunk size | 64 bit unsigned integer | each chunk is this number of bytes (set LSB to 0) | | |||||
+--------------+---------------------------+-------------------------+-----------------------------------------------------+ | |||||
| 56-63 | chunk count | 64 bit unsigned integer | total number of chunks to fetch | | |||||
+--------------+---------------------------+-------------------------+-----------------------------------------------------+ | |||||
| 64-127 | data block id. pattern | byte array | RADOS object id for the data block (probably ASCII) | | |||||
+--------------+---------------------------+-------------------------+-----------------------------------------------------+ | |||||
Addendum: Object Size Statistics Generation | |||||
=========================================== | |||||
Generate a CSV file containing the content lengths from the Software Heritage | |||||
database using: | |||||
.. code-block:: postgresql | |||||
copy ( | |||||
select sha1, length | |||||
from content | |||||
where exists ( | |||||
select 1 from tmp_sha1 | |||||
where tmp_sha1.sha1 = content.sha1 | |||||
) | |||||
) to 'test_lengths.csv' csv | |||||
Statistics are generated with Pandas:: | |||||
import csv | |||||
import pandas as pd | |||||
r = csv.reader(open('test_lengths.csv')) | |||||
# s contains the list of sizes: size distribution | |||||
s = pd.Series(int(line[1]) for line in r) | |||||
boundaries = [64, 4096, 65536*5, 4*1024*1024] | |||||
for i in range(len(boundaries) + 1): | |||||
lower_bound = boundaries[i-1] if i > 0 else -1 | |||||
upper_bound = boundaries[i] if i < len(boundaries) else -1 | |||||
filter_lower = (s > lower_bound) if lower_bound > 0 else True | |||||
filter_upper = (s <= upper_bound) if upper_bound > 0 else True | |||||
matching = filter_lower & filter_upper | |||||
number_ratio = matching.sum() / s.count() | |||||
size_ratio = (s * matching).sum() / total | |||||
print(lower, upper, number_ratio, size_ratio) |