D398.id1237.diff
No OneTemporary
Actions

Size

17 KB

Subscribers

None

D398.id1237.diff
View Options

	diff --git a/docs/design.rst b/docs/design.rst
	new file mode 100644
	--- /dev/null
	+++ b/docs/design.rst
	@@ -0,0 +1,360 @@
	+=====================
	+Object storage design
	+=====================
	+
	+Problem statement
	+=================
	+
	+The main goal of the Software Heritage object storage is to store a large
	+quantity (order of magnitude: several billions) of files ranging from 0 to 100MB
	+(median size: 3kB). Objects are addressed via a cryptographic hash of their
	+contents. Objects are never deleted, only added.
	+
	+Some uses of the object storage may diverge from this baseline; for instance,
	+the Vault uses our object storage as a cache backend, and there's a few
	+differences in this usage:
	+
	+ - objects can be larger (and can be up to 2GB)
	+ - addressing is not done with an intrinsic hash but with an extrinsic
	+ identifier
	+ - objects can be removed (for cache expiry)
	+
	+Optimizations for the baseline should try not to hinder usage in the general
	+case.
	+
	+Historical Implementations
	+==========================
	+
	+Over the course of development of the Software Heritage project, we have taken
	+several approaches for the (low-level) storage of source file contents. The
	+scale (4.5 billion files and counting) as well as the size distribution (median
	+raw file size of 3kB) means that usual file storage solutions come short in one
	+or several ways.
	+
	+PathSlicingObjStorage
	+---------------------
	+
	+The simplest approach we have taken is to use a regular filesystem, storing
	+compressed files there. To avoid issues with very large directories, contents
	+were sharded across several partitions, and then across directories.
	+
	+The first production deployments were sharded with the following pattern::
	+
	+ /srv/storage/A/AB/CD/EF/ABCDEFGH[...]
	+ \| \| \| \| \|
	+ `-\|--\|--\|--\|-> partition name (0-f, 16 partitions)
	+ `--\|--\|--\|-> first level directory ([A]0-[A]f, 16 directories)
	+ `--\|--\|-> second level directory (00-ff, 256 directories)
	+ `--\|-> third level directory (00-ff, 256 directories)
	+ `-> full hash as filename (40 hexadecimal digits)
	+
	+This repartition gives a total number of 2 ** 24 = 16 million directories, each
	+containing a few hundred files.
	+
	+This approach works properly; however it has a few shortcomings:
	+ - The number of directories means the kernel dentry cache fills up
	+ pretty quickly; This means random accesses usually need to read
	+ disk for all levels of the directory hierarchy, making them fairly slow.
	+ - As files on disk are compressed, we end up with hundreds of millions of tiny
	+ files on disk. While filesystems have some optimizations to inline very small
	+ files directly into the inodes, we still had to resort to tiny block sizes to
	+ avoid wasting lots of space, which makes disk accesses costly.
	+
	+
	+Cloud-based Object Storage
	+--------------------------
	+
	+Our implementation of cloud-based object storages are a very thin layer on top
	+of the respective backends. Objects are stored directly using their full
	+hexadecimal object identifier as key, and the gzipped data as value.
	+
	+The Azure-based object storage also shards objects across a configurable number
	+of storage accounts, increasing the IOPS quota for operation of the content
	+storage (the Azure blob storage IOPS limits are set per storage account).
	+
	+Cloud providers don't offer ways to control the lower-level implementation
	+details of the storage on the backend. Setting aside factors out of our control,
	+the main bottlenecks for those implementations are the numerous HTTPS requests
	+needed to fetch objects: pipelining of those requests is critical, as it avoids
	+setting up and tearing down lots of expensive TCP+TLS connections.
	+
	+In the case of the Azure object storage, the account sharding means that we have
	+to connect to several domain names, possibly on different hosts, which may or
	+may not be on different IP addresses, making pipelining more challenging.
	+
	+Ceph implementation
	+===================
	+
	+Ceph introduction
	+-----------------
	+
	+From the Ceph website: "Ceph is a unified, distributed storage system designed
	+for excellent performance, reliability and scalability.". Ceph is broken down in
	+several layers that implement low-level object storage, distributed block
	+devices as well as high-level distributed filesystems.
	+
	+This modular approach allows us to fine-tune our usage of Ceph to our exact
	+workload, only using the components that we actually need. Considering the
	+shortcomings of our previous implementations, we've elected to only use the
	+lowest-level RADOS ("Reliable Autonomic Distributed Object Store") API layer for
	+our backend.
	+
	+Our storage pool is set to use erasure coding, with a 5 data stripes + 2 parity
	+stripes mode, giving us a raw storage overhead of 40%, with the ability to
	+recover from 2 concurrent disk failures.
	+
	+"Naive" Ceph-based Object Storage
	+---------------------------------
	+
	+The simplest implementation just stored the objects uncompressed directly in a
	+RADOS storage pool, using the hexadecimal object identifier as object name.
	+After using this implementation for a while, we had an unexpectedly high disk
	+usage, due to the way Ceph stores data on disk:
	+
	+ - Ceph's modern on-disk format, BlueStore, has a minimal allocation size (on HDDs) of
	+ 64kB, which means that all small writes turn into 64kB zero-padded objects;
	+ - the RADOS erasure coding storage scheme needs to write a full stripe of data
	+ chunks for each object. Data chunks on EC pools are 4kB by default, and
	+ changing this setting is discouraged.
	+
	+Both those factors mean that objects are stored on disk within stripes of
	+multiples of 564kB = 320kB, that each use 764kB = 448kB.
	+
	+On the other hand of the spectrum, One last issue was that RADOS has a maximum
	+object size of 100MB, and rejects writes of more than 90MB. Those settings
	+are once again tunable but they have been picked by upstream carefully.
	+
	+All in all, on an almost random subset of 14 million objects, the storage
	+overhead was 630% instead of the expected 40%.
	+
	+Packed Object Storage
	+---------------------
	+
	+To handle our peculiar workload, consisting of many small files, we decide to
	+pack several objects in the same RADOS data block. A set of index blocks is used
	+to keep track of what objects are stored, as well as the object creation time.
	+
	+To properly account for RADOS specifics, we try to align our data access on 4kB
	+boundaries.
	+
	+Data block format
	+~~~~~~~~~~~~~~~~~
	+
	+Objects less than 64 bytes
	+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	+(2.2% of objects, 0.001% of total size)
	+
	+Those objects are only inlined in the index structure
	+
	+Objects less than 4kB
	+^^^^^^^^^^^^^^^^^^^^^^^
	+(51.7% of objects, 1.0% of total size)
	+
	+Size hint: EC data chunk size = 4kB
	+
	+For those objects, we select a data block by size, according to the next power
	+of two. We pack the objects according to this boundary.
	+
	+Objects less than 320kB
	+^^^^^^^^^^^^^^^^^^^^^^^^^
	+(43.9% of objects, 18.5% of total size)
	+
	+Size hint: BlueStore minimum allocation size for HDD stripe width = 320kB*
	+
	+We pack those objects in a 4MB-ish data block, aligned at a 4kB boundary.
	+
	+Objects less than 4MB
	+^^^^^^^^^^^^^^^^^^^^^^^
	+(1.9% of objects, 27.0% of total size)
	+
	+Size hint: 4MB is the "optimal" RADOS object size.
	+
	+Those objects get stored in a standalone data block.
	+
	+Objects larger than 4MB
	+^^^^^^^^^^^^^^^^^^^^^^^^^
	+(0.02% of objects, 53.5% of total size)
	+
	+Size hint: 4MB is the "optimal" RADOS object size.
	+
	+We split those objects in 4MB data blocks, and store each of these blocks separately.
	+
	+Index blocks
	+~~~~~~~~~~~~
	+
	+As objects are addressed using a well-distributed cryptographic hash, index
	+sharding into blocks can be done by truncating the (hexadecimal) object
	+identifier up to a given length.
	+
	+Inside those index blocks, a number of 128-byte index nodes are stored. (TODO:
	+how exactly?)
	+
	+
	+Index node format
	+~~~~~~~~~~~~~~~~~
	+
	+Index nodes are a packed, 128 byte wide data structure.
	+
	+::
	+
	+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
	+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
	+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	+ 0 \|<- object identifier ->\| 31
	+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	+ 32 \| size \| ctime \|T\|<- \| 63
	+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +
	+ 64 \| \| 95
	+ + index node type specific data +
	+ 96 \| ->\| 127
	+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
	+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
	+
	+
	+All integers are stored in little-endian format.
	+
	+The common contents for the index nodes are:
	+
	++--------------+----------------------+-------------------------+------------------------------------------------------+
	+\| byte address \| content \| type \| interpretation \|
	++==============+======================+=========================+======================================================+
	+\| 0-31 \| object identifier \| byte array \| sha1s use the first 20 bytes, sha256s all 32 \|
	++--------------+----------------------+-------------------------+------------------------------------------------------+
	+\| 32-39 \| size \| 64 bit unsigned integer \| \|
	++--------------+----------------------+-------------------------+------------------------------------------------------+
	+\| 40-47 \| file creation time \| 64 bit signed integer \| Number of milliseconds since the UNIX epoch \|
	++--------------+----------------------+-------------------------+------------------------------------------------------+
	+\| 48 \| T: index node type \| 8 bit unsigned integer \| Currently known values: 0, 1, 2 \|
	++--------------+----------------------+-------------------------+------------------------------------------------------+
	+
	+There currently are three index node types
	+
	+T = 0: Inline Object (<= 64B)
	+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	+
	+::
	+
	+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
	+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
	+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	+ 0 \|<- object id ->\| 31
	+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	+ 32 \| size \| ctime \|0\| \| 63
	+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	+ 64 \|<- \| 95
	+ + object data +
	+ 96 \| ->\| 127
	+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
	+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
	+
	+Type-specific index node contents:
	+
	++--------------+--------------------+-------------------------+---------------------------------+
	+\| byte address \| content \| type \| interpretation \|
	++==============+====================+=========================+=================================+
	+\| 64-127 \| object data \| byte array \| only use the first size bytes \|
	++--------------+--------------------+-------------------------+---------------------------------+
	+
	+
	+T = 1: Small Object (<= 4MB)
	+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	+
	+::
	+
	+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
	+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
	+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	+ 0 \|<- object id ->\| 31
	+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	+ 32 \| size \| ctime \|1\| \| offset \| 63
	+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	+ 64 \|<- \| 95
	+ + data block identifier +
	+ 96 \| ->\| 127
	+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
	+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
	+
	+Type-specific index node contents:
	+
	++--------------+--------------------+-------------------------+-----------------------------------------------------+
	+\| byte address \| content \| type \| interpretation \|
	++==============+====================+=========================+=====================================================+
	+\| 56-63 \| offset \| 64 bit unsigned integer \| \|
	++--------------+--------------------+-------------------------+-----------------------------------------------------+
	+\| 64-127 \| data block id. \| byte array \| RADOS object id for the data block (probably ASCII) \|
	++--------------+--------------------+-------------------------+-----------------------------------------------------+
	+
	+T = 2: Large Object (> 4MB)
	+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	+
	+::
	+
	+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
	+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
	+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	+ 0 \|<- object id ->\| 31
	+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	+ 32 \| size \| ctime \|2\| chunk size \| chunk count \| 63
	+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	+ 64 \|<- \| 95
	+ + data block id pattern +
	+ 96 \| ->\| 127
	+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	+ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
	+ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
	+
	+Type-specific index node contents:
	+
	++--------------+---------------------------+-------------------------+-----------------------------------------------------+
	+\| byte address \| content \| type \| interpretation \|
	++==============+===========================+=========================+=====================================================+
	+\| 48-55 \| chunk size \| 64 bit unsigned integer \| each chunk is this number of bytes (set LSB to 0) \|
	++--------------+---------------------------+-------------------------+-----------------------------------------------------+
	+\| 56-63 \| chunk count \| 64 bit unsigned integer \| total number of chunks to fetch \|
	++--------------+---------------------------+-------------------------+-----------------------------------------------------+
	+\| 64-127 \| data block id. pattern \| byte array \| RADOS object id for the data block (probably ASCII) \|
	++--------------+---------------------------+-------------------------+-----------------------------------------------------+
	+
	+
	+Addendum: Object Size Statistics Generation
	+===========================================
	+
	+Generate a CSV file containing the content lengths from the Software Heritage
	+database using:
	+
	+.. code-block:: postgresql
	+
	+ copy (
	+ select sha1, length
	+ from content
	+ where exists (
	+ select 1 from tmp_sha1
	+ where tmp_sha1.sha1 = content.sha1
	+ )
	+ ) to 'test_lengths.csv' csv
	+
	+
	+Statistics are generated with Pandas::
	+
	+ import csv
	+ import pandas as pd
	+
	+ r = csv.reader(open('test_lengths.csv'))
	+
	+ # s contains the list of sizes: size distribution
	+ s = pd.Series(int(line[1]) for line in r)
	+
	+ boundaries = [64, 4096, 655365, 41024*1024]
	+
	+ for i in range(len(boundaries) + 1):
	+ lower_bound = boundaries[i-1] if i > 0 else -1
	+ upper_bound = boundaries[i] if i < len(boundaries) else -1
	+ filter_lower = (s > lower_bound) if lower_bound > 0 else True
	+ filter_upper = (s <= upper_bound) if upper_bound > 0 else True
	+ matching = filter_lower & filter_upper
	+ number_ratio = matching.sum() / s.count()
	+ size_ratio = (s * matching).sum() / total
	+ print(lower, upper, number_ratio, size_ratio)
	diff --git a/docs/index.rst b/docs/index.rst
	--- a/docs/index.rst
	+++ b/docs/index.rst
	@@ -7,7 +7,7 @@
	:maxdepth: 2
	:caption: Contents:

	-
	+ design

	Indices and tables
	==================

File Metadata

Mime Type: text/plain
Expires: Wed, Jul 2, 10:38 AM (2 w, 3 d ago)
Storage Engine: blob
Storage Format: Raw Data
Storage Handle: 3216281

D398.id1237.diffNo OneTemporaryActions

D398.id1237.diffView Options

File Metadata

Event Timeline

D398.id1237.diff
No OneTemporary
Actions

D398.id1237.diff
View Options