Page MenuHomeSoftware Heritage

D398.id1237.diff
No OneTemporary

D398.id1237.diff

diff --git a/docs/design.rst b/docs/design.rst
new file mode 100644
--- /dev/null
+++ b/docs/design.rst
@@ -0,0 +1,360 @@
+=====================
+Object storage design
+=====================
+
+Problem statement
+=================
+
+The main goal of the Software Heritage object storage is to store a large
+quantity (order of magnitude: several billions) of files ranging from 0 to 100MB
+(median size: 3kB). Objects are addressed via a cryptographic hash of their
+contents. Objects are never deleted, only added.
+
+Some uses of the object storage may diverge from this baseline; for instance,
+the Vault uses our object storage as a cache backend, and there's a few
+differences in this usage:
+
+ - objects can be larger (and can be up to 2GB)
+ - addressing is not done with an intrinsic hash but with an extrinsic
+ identifier
+ - objects can be removed (for cache expiry)
+
+Optimizations for the baseline should try not to hinder usage in the general
+case.
+
+Historical Implementations
+==========================
+
+Over the course of development of the Software Heritage project, we have taken
+several approaches for the (low-level) storage of source file contents. The
+scale (4.5 billion files and counting) as well as the size distribution (median
+raw file size of 3kB) means that usual file storage solutions come short in one
+or several ways.
+
+PathSlicingObjStorage
+---------------------
+
+The simplest approach we have taken is to use a regular filesystem, storing
+compressed files there. To avoid issues with very large directories, contents
+were sharded across several partitions, and then across directories.
+
+The first production deployments were sharded with the following pattern::
+
+ /srv/storage/A/AB/CD/EF/ABCDEFGH[...]
+ | | | | |
+ `-|--|--|--|-> partition name (0-f, 16 partitions)
+ `--|--|--|-> first level directory ([A]0-[A]f, 16 directories)
+ `--|--|-> second level directory (00-ff, 256 directories)
+ `--|-> third level directory (00-ff, 256 directories)
+ `-> full hash as filename (40 hexadecimal digits)
+
+This repartition gives a total number of 2 ** 24 = 16 million directories, each
+containing a few hundred files.
+
+This approach works properly; however it has a few shortcomings:
+ - The number of directories means the kernel dentry cache fills up
+ pretty quickly; This means random accesses usually need to read
+ disk for all levels of the directory hierarchy, making them fairly slow.
+ - As files on disk are compressed, we end up with hundreds of millions of tiny
+ files on disk. While filesystems have some optimizations to inline very small
+ files directly into the inodes, we still had to resort to tiny block sizes to
+ avoid wasting lots of space, which makes disk accesses costly.
+
+
+Cloud-based Object Storage
+--------------------------
+
+Our implementation of cloud-based object storages are a very thin layer on top
+of the respective backends. Objects are stored directly using their full
+hexadecimal object identifier as key, and the gzipped data as value.
+
+The Azure-based object storage also shards objects across a configurable number
+of storage accounts, increasing the IOPS quota for operation of the content
+storage (the Azure blob storage IOPS limits are set per storage account).
+
+Cloud providers don't offer ways to control the lower-level implementation
+details of the storage on the backend. Setting aside factors out of our control,
+the main bottlenecks for those implementations are the numerous HTTPS requests
+needed to fetch objects: pipelining of those requests is critical, as it avoids
+setting up and tearing down lots of expensive TCP+TLS connections.
+
+In the case of the Azure object storage, the account sharding means that we have
+to connect to several domain names, possibly on different hosts, which may or
+may not be on different IP addresses, making pipelining more challenging.
+
+Ceph implementation
+===================
+
+Ceph introduction
+-----------------
+
+From the Ceph website: "Ceph is a unified, distributed storage system designed
+for excellent performance, reliability and scalability.". Ceph is broken down in
+several layers that implement low-level object storage, distributed block
+devices as well as high-level distributed filesystems.
+
+This modular approach allows us to fine-tune our usage of Ceph to our exact
+workload, only using the components that we actually need. Considering the
+shortcomings of our previous implementations, we've elected to only use the
+lowest-level RADOS ("Reliable Autonomic Distributed Object Store") API layer for
+our backend.
+
+Our storage pool is set to use erasure coding, with a 5 data stripes + 2 parity
+stripes mode, giving us a raw storage overhead of 40%, with the ability to
+recover from 2 concurrent disk failures.
+
+"Naive" Ceph-based Object Storage
+---------------------------------
+
+The simplest implementation just stored the objects uncompressed directly in a
+RADOS storage pool, using the hexadecimal object identifier as object name.
+After using this implementation for a while, we had an unexpectedly high disk
+usage, due to the way Ceph stores data on disk:
+
+ - Ceph's modern on-disk format, BlueStore, has a minimal allocation size (on HDDs) of
+ 64kB, which means that all small writes turn into 64kB zero-padded objects;
+ - the RADOS erasure coding storage scheme needs to write a full stripe of data
+ chunks for each object. Data chunks on EC pools are 4kB by default, and
+ changing this setting is discouraged.
+
+Both those factors mean that objects are stored on disk within stripes of
+multiples of *5*64kB = 320kB*, that each use *7*64kB = 448kB*.
+
+On the other hand of the spectrum, One last issue was that RADOS has a maximum
+object size of *100MB*, and rejects writes of more than *90MB*. Those settings
+are once again tunable but they have been picked by upstream carefully.
+
+All in all, on an almost random subset of 14 million objects, the storage
+overhead was *630%* instead of the expected *40%*.
+
+Packed Object Storage
+---------------------
+
+To handle our peculiar workload, consisting of many small files, we decide to
+pack several objects in the same RADOS data block. A set of index blocks is used
+to keep track of what objects are stored, as well as the object creation time.
+
+To properly account for RADOS specifics, we try to align our data access on 4kB
+boundaries.
+
+Data block format
+~~~~~~~~~~~~~~~~~
+
+Objects less than *64 bytes*
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+(2.2% of objects, 0.001% of total size)
+
+Those objects are only inlined in the index structure
+
+Objects less than *4kB*
+^^^^^^^^^^^^^^^^^^^^^^^
+(51.7% of objects, 1.0% of total size)
+
+Size hint: *EC data chunk size = 4kB*
+
+For those objects, we select a data block by size, according to the next power
+of two. We pack the objects according to this boundary.
+
+Objects less than *320kB*
+^^^^^^^^^^^^^^^^^^^^^^^^^
+(43.9% of objects, 18.5% of total size)
+
+Size hint: *BlueStore minimum allocation size for HDD * stripe width = 320kB*
+
+We pack those objects in a *4MB*-ish data block, aligned at a *4kB* boundary.
+
+Objects less than *4MB*
+^^^^^^^^^^^^^^^^^^^^^^^
+(1.9% of objects, 27.0% of total size)
+
+Size hint: *4MB* is the "optimal" RADOS object size.
+
+Those objects get stored in a standalone data block.
+
+Objects larger than *4MB*
+^^^^^^^^^^^^^^^^^^^^^^^^^
+(0.02% of objects, 53.5% of total size)
+
+Size hint: *4MB* is the "optimal" RADOS object size.
+
+We split those objects in 4MB data blocks, and store each of these blocks separately.
+
+Index blocks
+~~~~~~~~~~~~
+
+As objects are addressed using a well-distributed cryptographic hash, index
+sharding into blocks can be done by truncating the (hexadecimal) object
+identifier up to a given length.
+
+Inside those index blocks, a number of 128-byte index nodes are stored. (TODO:
+how exactly?)
+
+
+Index node format
+~~~~~~~~~~~~~~~~~
+
+Index nodes are a packed, *128 byte* wide data structure.
+
+::
+
+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 0 |<- object identifier ->| 31
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 32 | size | ctime |T|<- | 63
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +
+ 64 | | 95
+ + index node type specific data +
+ 96 | ->| 127
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+
+
+All integers are stored in little-endian format.
+
+The common contents for the index nodes are:
+
++--------------+----------------------+-------------------------+------------------------------------------------------+
+| byte address | content | type | interpretation |
++==============+======================+=========================+======================================================+
+| 0-31 | object identifier | byte array | sha1s use the first 20 bytes, sha256s all 32 |
++--------------+----------------------+-------------------------+------------------------------------------------------+
+| 32-39 | size | 64 bit unsigned integer | |
++--------------+----------------------+-------------------------+------------------------------------------------------+
+| 40-47 | file creation time | 64 bit signed integer | Number of milliseconds since the UNIX epoch |
++--------------+----------------------+-------------------------+------------------------------------------------------+
+| 48 | *T*: index node type | 8 bit unsigned integer | Currently known values: 0, 1, 2 |
++--------------+----------------------+-------------------------+------------------------------------------------------+
+
+There currently are three index node types
+
+*T = 0*: Inline Object (*<= 64B*)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+::
+
+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 0 |<- object id ->| 31
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 32 | size | ctime |0| | 63
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 64 |<- | 95
+ + object data +
+ 96 | ->| 127
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+
+Type-specific index node contents:
+
++--------------+--------------------+-------------------------+---------------------------------+
+| byte address | content | type | interpretation |
++==============+====================+=========================+=================================+
+| 64-127 | object data | byte array | only use the first *size* bytes |
++--------------+--------------------+-------------------------+---------------------------------+
+
+
+*T = 1*: Small Object (*<= 4MB*)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+::
+
+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 0 |<- object id ->| 31
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 32 | size | ctime |1| | offset | 63
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 64 |<- | 95
+ + data block identifier +
+ 96 | ->| 127
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+
+Type-specific index node contents:
+
++--------------+--------------------+-------------------------+-----------------------------------------------------+
+| byte address | content | type | interpretation |
++==============+====================+=========================+=====================================================+
+| 56-63 | offset | 64 bit unsigned integer | |
++--------------+--------------------+-------------------------+-----------------------------------------------------+
+| 64-127 | data block id. | byte array | RADOS object id for the data block (probably ASCII) |
++--------------+--------------------+-------------------------+-----------------------------------------------------+
+
+*T = 2*: Large Object (*> 4MB*)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+::
+
+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 0 |<- object id ->| 31
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 32 | size | ctime |2| chunk size | chunk count | 63
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 64 |<- | 95
+ + data block id pattern +
+ 96 | ->| 127
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+
+Type-specific index node contents:
+
++--------------+---------------------------+-------------------------+-----------------------------------------------------+
+| byte address | content | type | interpretation |
++==============+===========================+=========================+=====================================================+
+| 48-55 | chunk size | 64 bit unsigned integer | each chunk is this number of bytes (set LSB to 0) |
++--------------+---------------------------+-------------------------+-----------------------------------------------------+
+| 56-63 | chunk count | 64 bit unsigned integer | total number of chunks to fetch |
++--------------+---------------------------+-------------------------+-----------------------------------------------------+
+| 64-127 | data block id. pattern | byte array | RADOS object id for the data block (probably ASCII) |
++--------------+---------------------------+-------------------------+-----------------------------------------------------+
+
+
+Addendum: Object Size Statistics Generation
+===========================================
+
+Generate a CSV file containing the content lengths from the Software Heritage
+database using:
+
+.. code-block:: postgresql
+
+ copy (
+ select sha1, length
+ from content
+ where exists (
+ select 1 from tmp_sha1
+ where tmp_sha1.sha1 = content.sha1
+ )
+ ) to 'test_lengths.csv' csv
+
+
+Statistics are generated with Pandas::
+
+ import csv
+ import pandas as pd
+
+ r = csv.reader(open('test_lengths.csv'))
+
+ # s contains the list of sizes: size distribution
+ s = pd.Series(int(line[1]) for line in r)
+
+ boundaries = [64, 4096, 65536*5, 4*1024*1024]
+
+ for i in range(len(boundaries) + 1):
+ lower_bound = boundaries[i-1] if i > 0 else -1
+ upper_bound = boundaries[i] if i < len(boundaries) else -1
+ filter_lower = (s > lower_bound) if lower_bound > 0 else True
+ filter_upper = (s <= upper_bound) if upper_bound > 0 else True
+ matching = filter_lower & filter_upper
+ number_ratio = matching.sum() / s.count()
+ size_ratio = (s * matching).sum() / total
+ print(lower, upper, number_ratio, size_ratio)
diff --git a/docs/index.rst b/docs/index.rst
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -7,7 +7,7 @@
:maxdepth: 2
:caption: Contents:
-
+ design
Indices and tables
==================

File Metadata

Mime Type
text/plain
Expires
Wed, Jul 2, 10:38 AM (2 w, 3 d ago)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
3216281

Event Timeline