Page Menu
Home
Software Heritage
Search
Configure Global Search
Log In
Files
F9311983
D398.id1237.diff
No One
Temporary
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Size
17 KB
Subscribers
None
D398.id1237.diff
View Options
diff --git a/docs/design.rst b/docs/design.rst
new file mode 100644
--- /dev/null
+++ b/docs/design.rst
@@ -0,0 +1,360 @@
+=====================
+Object storage design
+=====================
+
+Problem statement
+=================
+
+The main goal of the Software Heritage object storage is to store a large
+quantity (order of magnitude: several billions) of files ranging from 0 to 100MB
+(median size: 3kB). Objects are addressed via a cryptographic hash of their
+contents. Objects are never deleted, only added.
+
+Some uses of the object storage may diverge from this baseline; for instance,
+the Vault uses our object storage as a cache backend, and there's a few
+differences in this usage:
+
+ - objects can be larger (and can be up to 2GB)
+ - addressing is not done with an intrinsic hash but with an extrinsic
+ identifier
+ - objects can be removed (for cache expiry)
+
+Optimizations for the baseline should try not to hinder usage in the general
+case.
+
+Historical Implementations
+==========================
+
+Over the course of development of the Software Heritage project, we have taken
+several approaches for the (low-level) storage of source file contents. The
+scale (4.5 billion files and counting) as well as the size distribution (median
+raw file size of 3kB) means that usual file storage solutions come short in one
+or several ways.
+
+PathSlicingObjStorage
+---------------------
+
+The simplest approach we have taken is to use a regular filesystem, storing
+compressed files there. To avoid issues with very large directories, contents
+were sharded across several partitions, and then across directories.
+
+The first production deployments were sharded with the following pattern::
+
+ /srv/storage/A/AB/CD/EF/ABCDEFGH[...]
+ | | | | |
+ `-|--|--|--|-> partition name (0-f, 16 partitions)
+ `--|--|--|-> first level directory ([A]0-[A]f, 16 directories)
+ `--|--|-> second level directory (00-ff, 256 directories)
+ `--|-> third level directory (00-ff, 256 directories)
+ `-> full hash as filename (40 hexadecimal digits)
+
+This repartition gives a total number of 2 ** 24 = 16 million directories, each
+containing a few hundred files.
+
+This approach works properly; however it has a few shortcomings:
+ - The number of directories means the kernel dentry cache fills up
+ pretty quickly; This means random accesses usually need to read
+ disk for all levels of the directory hierarchy, making them fairly slow.
+ - As files on disk are compressed, we end up with hundreds of millions of tiny
+ files on disk. While filesystems have some optimizations to inline very small
+ files directly into the inodes, we still had to resort to tiny block sizes to
+ avoid wasting lots of space, which makes disk accesses costly.
+
+
+Cloud-based Object Storage
+--------------------------
+
+Our implementation of cloud-based object storages are a very thin layer on top
+of the respective backends. Objects are stored directly using their full
+hexadecimal object identifier as key, and the gzipped data as value.
+
+The Azure-based object storage also shards objects across a configurable number
+of storage accounts, increasing the IOPS quota for operation of the content
+storage (the Azure blob storage IOPS limits are set per storage account).
+
+Cloud providers don't offer ways to control the lower-level implementation
+details of the storage on the backend. Setting aside factors out of our control,
+the main bottlenecks for those implementations are the numerous HTTPS requests
+needed to fetch objects: pipelining of those requests is critical, as it avoids
+setting up and tearing down lots of expensive TCP+TLS connections.
+
+In the case of the Azure object storage, the account sharding means that we have
+to connect to several domain names, possibly on different hosts, which may or
+may not be on different IP addresses, making pipelining more challenging.
+
+Ceph implementation
+===================
+
+Ceph introduction
+-----------------
+
+From the Ceph website: "Ceph is a unified, distributed storage system designed
+for excellent performance, reliability and scalability.". Ceph is broken down in
+several layers that implement low-level object storage, distributed block
+devices as well as high-level distributed filesystems.
+
+This modular approach allows us to fine-tune our usage of Ceph to our exact
+workload, only using the components that we actually need. Considering the
+shortcomings of our previous implementations, we've elected to only use the
+lowest-level RADOS ("Reliable Autonomic Distributed Object Store") API layer for
+our backend.
+
+Our storage pool is set to use erasure coding, with a 5 data stripes + 2 parity
+stripes mode, giving us a raw storage overhead of 40%, with the ability to
+recover from 2 concurrent disk failures.
+
+"Naive" Ceph-based Object Storage
+---------------------------------
+
+The simplest implementation just stored the objects uncompressed directly in a
+RADOS storage pool, using the hexadecimal object identifier as object name.
+After using this implementation for a while, we had an unexpectedly high disk
+usage, due to the way Ceph stores data on disk:
+
+ - Ceph's modern on-disk format, BlueStore, has a minimal allocation size (on HDDs) of
+ 64kB, which means that all small writes turn into 64kB zero-padded objects;
+ - the RADOS erasure coding storage scheme needs to write a full stripe of data
+ chunks for each object. Data chunks on EC pools are 4kB by default, and
+ changing this setting is discouraged.
+
+Both those factors mean that objects are stored on disk within stripes of
+multiples of *5*64kB = 320kB*, that each use *7*64kB = 448kB*.
+
+On the other hand of the spectrum, One last issue was that RADOS has a maximum
+object size of *100MB*, and rejects writes of more than *90MB*. Those settings
+are once again tunable but they have been picked by upstream carefully.
+
+All in all, on an almost random subset of 14 million objects, the storage
+overhead was *630%* instead of the expected *40%*.
+
+Packed Object Storage
+---------------------
+
+To handle our peculiar workload, consisting of many small files, we decide to
+pack several objects in the same RADOS data block. A set of index blocks is used
+to keep track of what objects are stored, as well as the object creation time.
+
+To properly account for RADOS specifics, we try to align our data access on 4kB
+boundaries.
+
+Data block format
+~~~~~~~~~~~~~~~~~
+
+Objects less than *64 bytes*
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+(2.2% of objects, 0.001% of total size)
+
+Those objects are only inlined in the index structure
+
+Objects less than *4kB*
+^^^^^^^^^^^^^^^^^^^^^^^
+(51.7% of objects, 1.0% of total size)
+
+Size hint: *EC data chunk size = 4kB*
+
+For those objects, we select a data block by size, according to the next power
+of two. We pack the objects according to this boundary.
+
+Objects less than *320kB*
+^^^^^^^^^^^^^^^^^^^^^^^^^
+(43.9% of objects, 18.5% of total size)
+
+Size hint: *BlueStore minimum allocation size for HDD * stripe width = 320kB*
+
+We pack those objects in a *4MB*-ish data block, aligned at a *4kB* boundary.
+
+Objects less than *4MB*
+^^^^^^^^^^^^^^^^^^^^^^^
+(1.9% of objects, 27.0% of total size)
+
+Size hint: *4MB* is the "optimal" RADOS object size.
+
+Those objects get stored in a standalone data block.
+
+Objects larger than *4MB*
+^^^^^^^^^^^^^^^^^^^^^^^^^
+(0.02% of objects, 53.5% of total size)
+
+Size hint: *4MB* is the "optimal" RADOS object size.
+
+We split those objects in 4MB data blocks, and store each of these blocks separately.
+
+Index blocks
+~~~~~~~~~~~~
+
+As objects are addressed using a well-distributed cryptographic hash, index
+sharding into blocks can be done by truncating the (hexadecimal) object
+identifier up to a given length.
+
+Inside those index blocks, a number of 128-byte index nodes are stored. (TODO:
+how exactly?)
+
+
+Index node format
+~~~~~~~~~~~~~~~~~
+
+Index nodes are a packed, *128 byte* wide data structure.
+
+::
+
+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 0 |<- object identifier ->| 31
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 32 | size | ctime |T|<- | 63
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +
+ 64 | | 95
+ + index node type specific data +
+ 96 | ->| 127
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+
+
+All integers are stored in little-endian format.
+
+The common contents for the index nodes are:
+
++--------------+----------------------+-------------------------+------------------------------------------------------+
+| byte address | content | type | interpretation |
++==============+======================+=========================+======================================================+
+| 0-31 | object identifier | byte array | sha1s use the first 20 bytes, sha256s all 32 |
++--------------+----------------------+-------------------------+------------------------------------------------------+
+| 32-39 | size | 64 bit unsigned integer | |
++--------------+----------------------+-------------------------+------------------------------------------------------+
+| 40-47 | file creation time | 64 bit signed integer | Number of milliseconds since the UNIX epoch |
++--------------+----------------------+-------------------------+------------------------------------------------------+
+| 48 | *T*: index node type | 8 bit unsigned integer | Currently known values: 0, 1, 2 |
++--------------+----------------------+-------------------------+------------------------------------------------------+
+
+There currently are three index node types
+
+*T = 0*: Inline Object (*<= 64B*)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+::
+
+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 0 |<- object id ->| 31
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 32 | size | ctime |0| | 63
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 64 |<- | 95
+ + object data +
+ 96 | ->| 127
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+
+Type-specific index node contents:
+
++--------------+--------------------+-------------------------+---------------------------------+
+| byte address | content | type | interpretation |
++==============+====================+=========================+=================================+
+| 64-127 | object data | byte array | only use the first *size* bytes |
++--------------+--------------------+-------------------------+---------------------------------+
+
+
+*T = 1*: Small Object (*<= 4MB*)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+::
+
+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 0 |<- object id ->| 31
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 32 | size | ctime |1| | offset | 63
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 64 |<- | 95
+ + data block identifier +
+ 96 | ->| 127
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+
+Type-specific index node contents:
+
++--------------+--------------------+-------------------------+-----------------------------------------------------+
+| byte address | content | type | interpretation |
++==============+====================+=========================+=====================================================+
+| 56-63 | offset | 64 bit unsigned integer | |
++--------------+--------------------+-------------------------+-----------------------------------------------------+
+| 64-127 | data block id. | byte array | RADOS object id for the data block (probably ASCII) |
++--------------+--------------------+-------------------------+-----------------------------------------------------+
+
+*T = 2*: Large Object (*> 4MB*)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+::
+
+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 0 |<- object id ->| 31
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 32 | size | ctime |2| chunk size | chunk count | 63
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 64 |<- | 95
+ + data block id pattern +
+ 96 | ->| 127
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+
+Type-specific index node contents:
+
++--------------+---------------------------+-------------------------+-----------------------------------------------------+
+| byte address | content | type | interpretation |
++==============+===========================+=========================+=====================================================+
+| 48-55 | chunk size | 64 bit unsigned integer | each chunk is this number of bytes (set LSB to 0) |
++--------------+---------------------------+-------------------------+-----------------------------------------------------+
+| 56-63 | chunk count | 64 bit unsigned integer | total number of chunks to fetch |
++--------------+---------------------------+-------------------------+-----------------------------------------------------+
+| 64-127 | data block id. pattern | byte array | RADOS object id for the data block (probably ASCII) |
++--------------+---------------------------+-------------------------+-----------------------------------------------------+
+
+
+Addendum: Object Size Statistics Generation
+===========================================
+
+Generate a CSV file containing the content lengths from the Software Heritage
+database using:
+
+.. code-block:: postgresql
+
+ copy (
+ select sha1, length
+ from content
+ where exists (
+ select 1 from tmp_sha1
+ where tmp_sha1.sha1 = content.sha1
+ )
+ ) to 'test_lengths.csv' csv
+
+
+Statistics are generated with Pandas::
+
+ import csv
+ import pandas as pd
+
+ r = csv.reader(open('test_lengths.csv'))
+
+ # s contains the list of sizes: size distribution
+ s = pd.Series(int(line[1]) for line in r)
+
+ boundaries = [64, 4096, 65536*5, 4*1024*1024]
+
+ for i in range(len(boundaries) + 1):
+ lower_bound = boundaries[i-1] if i > 0 else -1
+ upper_bound = boundaries[i] if i < len(boundaries) else -1
+ filter_lower = (s > lower_bound) if lower_bound > 0 else True
+ filter_upper = (s <= upper_bound) if upper_bound > 0 else True
+ matching = filter_lower & filter_upper
+ number_ratio = matching.sum() / s.count()
+ size_ratio = (s * matching).sum() / total
+ print(lower, upper, number_ratio, size_ratio)
diff --git a/docs/index.rst b/docs/index.rst
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -7,7 +7,7 @@
:maxdepth: 2
:caption: Contents:
-
+ design
Indices and tables
==================
File Metadata
Details
Attached
Mime Type
text/plain
Expires
Wed, Jul 2, 10:38 AM (2 w, 2 d ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3216281
Attached To
D398: [WIP] "packing" object storage design documentation
Event Timeline
Log In to Comment