Differential D398 Diff 1237 docs/design.rst

Changeset View

Standalone View

docs/design.rst

This file was added.

				=====================
				Object storage design
				=====================

				Problem statement
				=================

				The main goal of the Software Heritage object storage is to store a large
				quantity (order of magnitude: several billions) of files ranging from 0 to 100MB
				(median size: 3kB). Objects are addressed via a cryptographic hash of their
				contents. Objects are never deleted, only added.

				Some uses of the object storage may diverge from this baseline; for instance,
				the Vault uses our object storage as a cache backend, and there's a few
				differences in this usage:

				- objects can be larger (and can be up to 2GB)
				- addressing is not done with an intrinsic hash but with an extrinsic
				identifier
				- objects can be removed (for cache expiry)

				Optimizations for the baseline should try not to hinder usage in the general
				case.

				Historical Implementations
				==========================

				Over the course of development of the Software Heritage project, we have taken
				several approaches for the (low-level) storage of source file contents. The
				scale (4.5 billion files and counting) as well as the size distribution (median
				raw file size of 3kB) means that usual file storage solutions come short in one
				or several ways.

				PathSlicingObjStorage
				---------------------

				The simplest approach we have taken is to use a regular filesystem, storing
				compressed files there. To avoid issues with very large directories, contents
				were sharded across several partitions, and then across directories.

				The first production deployments were sharded with the following pattern::

				/srv/storage/A/AB/CD/EF/ABCDEFGH[...]
				\| \| \| \| \|
				`-\|--\|--\|--\|-> partition name (0-f, 16 partitions)
				`--\|--\|--\|-> first level directory ([A]0-[A]f, 16 directories)
				`--\|--\|-> second level directory (00-ff, 256 directories)
				`--\|-> third level directory (00-ff, 256 directories)
				`-> full hash as filename (40 hexadecimal digits)

				This repartition gives a total number of 2 ** 24 = 16 million directories, each
				containing a few hundred files.

				This approach works properly; however it has a few shortcomings:
				- The number of directories means the kernel dentry cache fills up
				pretty quickly; This means random accesses usually need to read
				disk for all levels of the directory hierarchy, making them fairly slow.
				- As files on disk are compressed, we end up with hundreds of millions of tiny
				files on disk. While filesystems have some optimizations to inline very small
				files directly into the inodes, we still had to resort to tiny block sizes to
				avoid wasting lots of space, which makes disk accesses costly.


				Cloud-based Object Storage
				--------------------------

				Our implementation of cloud-based object storages are a very thin layer on top
				of the respective backends. Objects are stored directly using their full
				hexadecimal object identifier as key, and the gzipped data as value.

				The Azure-based object storage also shards objects across a configurable number
				of storage accounts, increasing the IOPS quota for operation of the content
				storage (the Azure blob storage IOPS limits are set per storage account).

				Cloud providers don't offer ways to control the lower-level implementation
				details of the storage on the backend. Setting aside factors out of our control,
				the main bottlenecks for those implementations are the numerous HTTPS requests
				needed to fetch objects: pipelining of those requests is critical, as it avoids
				setting up and tearing down lots of expensive TCP+TLS connections.

				In the case of the Azure object storage, the account sharding means that we have
				to connect to several domain names, possibly on different hosts, which may or
				may not be on different IP addresses, making pipelining more challenging.

				Ceph implementation
				===================

				Ceph introduction
				-----------------

				From the Ceph website: "Ceph is a unified, distributed storage system designed
				for excellent performance, reliability and scalability.". Ceph is broken down in
				several layers that implement low-level object storage, distributed block
				devices as well as high-level distributed filesystems.

				This modular approach allows us to fine-tune our usage of Ceph to our exact
				workload, only using the components that we actually need. Considering the
				shortcomings of our previous implementations, we've elected to only use the
				lowest-level RADOS ("Reliable Autonomic Distributed Object Store") API layer for
				our backend.

				Our storage pool is set to use erasure coding, with a 5 data stripes + 2 parity
				stripes mode, giving us a raw storage overhead of 40%, with the ability to
				recover from 2 concurrent disk failures.

				"Naive" Ceph-based Object Storage
				---------------------------------

				The simplest implementation just stored the objects uncompressed directly in a
				RADOS storage pool, using the hexadecimal object identifier as object name.
				After using this implementation for a while, we had an unexpectedly high disk
				usage, due to the way Ceph stores data on disk:

				- Ceph's modern on-disk format, BlueStore, has a minimal allocation size (on HDDs) of
				64kB, which means that all small writes turn into 64kB zero-padded objects;
				- the RADOS erasure coding storage scheme needs to write a full stripe of data
				chunks for each object. Data chunks on EC pools are 4kB by default, and
				changing this setting is discouraged.

				Both those factors mean that objects are stored on disk within stripes of
				multiples of 564kB = 320kB, that each use 764kB = 448kB.

				On the other hand of the spectrum, One last issue was that RADOS has a maximum
				object size of 100MB, and rejects writes of more than 90MB. Those settings
				are once again tunable but they have been picked by upstream carefully.

				All in all, on an almost random subset of 14 million objects, the storage
				overhead was 630% instead of the expected 40%.

				Packed Object Storage
				---------------------

				To handle our peculiar workload, consisting of many small files, we decide to
				pack several objects in the same RADOS data block. A set of index blocks is used
				to keep track of what objects are stored, as well as the object creation time.

				To properly account for RADOS specifics, we try to align our data access on 4kB
				boundaries.

				Data block format
				~~~~~~~~~~~~~~~~~

				Objects less than 64 bytes
				^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				(2.2% of objects, 0.001% of total size)

				Those objects are only inlined in the index structure

				Objects less than 4kB
				^^^^^^^^^^^^^^^^^^^^^^^
				(51.7% of objects, 1.0% of total size)

				Size hint: EC data chunk size = 4kB

				For those objects, we select a data block by size, according to the next power
				of two. We pack the objects according to this boundary.

				Objects less than 320kB
				^^^^^^^^^^^^^^^^^^^^^^^^^
				(43.9% of objects, 18.5% of total size)

				Size hint: BlueStore minimum allocation size for HDD stripe width = 320kB*

				We pack those objects in a 4MB-ish data block, aligned at a 4kB boundary.

				Objects less than 4MB
				^^^^^^^^^^^^^^^^^^^^^^^
				(1.9% of objects, 27.0% of total size)

				Size hint: 4MB is the "optimal" RADOS object size.

				Those objects get stored in a standalone data block.

				Objects larger than 4MB
				^^^^^^^^^^^^^^^^^^^^^^^^^
				(0.02% of objects, 53.5% of total size)

				Size hint: 4MB is the "optimal" RADOS object size.

				We split those objects in 4MB data blocks, and store each of these blocks separately.

				Index blocks
				~~~~~~~~~~~~

				As objects are addressed using a well-distributed cryptographic hash, index
				sharding into blocks can be done by truncating the (hexadecimal) object
				identifier up to a given length.

				Inside those index blocks, a number of 128-byte index nodes are stored. (TODO:
				how exactly?)


				Index node format
				~~~~~~~~~~~~~~~~~

				Index nodes are a packed, 128 byte wide data structure.

				::

				0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
				0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
				+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
				0 \|<- object identifier ->\| 31
				+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
				32 \| size \| ctime \|T\|<- \| 63
				+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +
				64 \| \| 95
				+ index node type specific data +
				96 \| ->\| 127
				+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
				0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
				0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1


				All integers are stored in little-endian format.

				The common contents for the index nodes are:

				+--------------+----------------------+-------------------------+------------------------------------------------------+
				\| byte address \| content \| type \| interpretation \|
				+==============+======================+=========================+======================================================+
				\| 0-31 \| object identifier \| byte array \| sha1s use the first 20 bytes, sha256s all 32 \|
				+--------------+----------------------+-------------------------+------------------------------------------------------+
				\| 32-39 \| size \| 64 bit unsigned integer \| \|
				+--------------+----------------------+-------------------------+------------------------------------------------------+
				\| 40-47 \| file creation time \| 64 bit signed integer \| Number of milliseconds since the UNIX epoch \|
				+--------------+----------------------+-------------------------+------------------------------------------------------+
				\| 48 \| T: index node type \| 8 bit unsigned integer \| Currently known values: 0, 1, 2 \|
				+--------------+----------------------+-------------------------+------------------------------------------------------+

				There currently are three index node types

				T = 0: Inline Object (<= 64B)
				^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

				::

				0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
				0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
				+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
				0 \|<- object id ->\| 31
				+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
				32 \| size \| ctime \|0\| \| 63
				+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
				64 \|<- \| 95
				+ object data +
				96 \| ->\| 127
				+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
				0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
				0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

				Type-specific index node contents:

				+--------------+--------------------+-------------------------+---------------------------------+
				\| byte address \| content \| type \| interpretation \|
				+==============+====================+=========================+=================================+
				\| 64-127 \| object data \| byte array \| only use the first size bytes \|
				+--------------+--------------------+-------------------------+---------------------------------+


				T = 1: Small Object (<= 4MB)
				^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

				::

				0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
				0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
				+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
				0 \|<- object id ->\| 31
				+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
				32 \| size \| ctime \|1\| \| offset \| 63
				+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
				64 \|<- \| 95
				+ data block identifier +
				96 \| ->\| 127
				+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
				0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
				0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

				Type-specific index node contents:

				+--------------+--------------------+-------------------------+-----------------------------------------------------+
				\| byte address \| content \| type \| interpretation \|
				+==============+====================+=========================+=====================================================+
				\| 56-63 \| offset \| 64 bit unsigned integer \| \|
				+--------------+--------------------+-------------------------+-----------------------------------------------------+
				\| 64-127 \| data block id. \| byte array \| RADOS object id for the data block (probably ASCII) \|
				+--------------+--------------------+-------------------------+-----------------------------------------------------+

				T = 2: Large Object (> 4MB)
				^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

				::

				0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
				0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
				+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
				0 \|<- object id ->\| 31
				+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
				32 \| size \| ctime \|2\| chunk size \| chunk count \| 63
				+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
				64 \|<- \| 95
				+ data block id pattern +
				96 \| ->\| 127
				+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
				0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
				0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

				Type-specific index node contents:

				+--------------+---------------------------+-------------------------+-----------------------------------------------------+
				\| byte address \| content \| type \| interpretation \|
				+==============+===========================+=========================+=====================================================+
				\| 48-55 \| chunk size \| 64 bit unsigned integer \| each chunk is this number of bytes (set LSB to 0) \|
				+--------------+---------------------------+-------------------------+-----------------------------------------------------+
				\| 56-63 \| chunk count \| 64 bit unsigned integer \| total number of chunks to fetch \|
				+--------------+---------------------------+-------------------------+-----------------------------------------------------+
				\| 64-127 \| data block id. pattern \| byte array \| RADOS object id for the data block (probably ASCII) \|
				+--------------+---------------------------+-------------------------+-----------------------------------------------------+


				Addendum: Object Size Statistics Generation
				===========================================

				Generate a CSV file containing the content lengths from the Software Heritage
				database using:

				.. code-block:: postgresql

				copy (
				select sha1, length
				from content
				where exists (
				select 1 from tmp_sha1
				where tmp_sha1.sha1 = content.sha1
				)
				) to 'test_lengths.csv' csv


				Statistics are generated with Pandas::

				import csv
				import pandas as pd

				r = csv.reader(open('test_lengths.csv'))

				# s contains the list of sizes: size distribution
				s = pd.Series(int(line[1]) for line in r)

				boundaries = [64, 4096, 655365, 41024*1024]

				for i in range(len(boundaries) + 1):
				lower_bound = boundaries[i-1] if i > 0 else -1
				upper_bound = boundaries[i] if i < len(boundaries) else -1
				filter_lower = (s > lower_bound) if lower_bound > 0 else True
				filter_upper = (s <= upper_bound) if upper_bound > 0 else True
				matching = filter_lower & filter_upper
				number_ratio = matching.sum() / s.count()
				size_ratio = (s * matching).sum() / total
				print(lower, upper, number_ratio, size_ratio)