diff --git a/docs/design.rst b/docs/design.rst
new file mode 100644
--- /dev/null
+++ b/docs/design.rst
@@ -0,0 +1,360 @@
+=====================
+Object storage design
+=====================
+
+Problem statement
+=================
+
+The main goal of the Software Heritage object storage is to store a large
+quantity (order of magnitude: several billions) of files ranging from 0 to 100MB
+(median size: 3kB). Objects are addressed via a cryptographic hash of their
+contents. Objects are never deleted, only added.
+
+Some uses of the object storage may diverge from this baseline; for instance,
+the Vault uses our object storage as a cache backend, and there's a few
+differences in this usage:
+
+ - objects can be larger (and can be up to 2GB)
+ - addressing is not done with an intrinsic hash but with an extrinsic
+   identifier
+ - objects can be removed (for cache expiry)
+
+Optimizations for the baseline should try not to hinder usage in the general
+case.
+
+Historical Implementations
+==========================
+
+Over the course of development of the Software Heritage project, we have taken
+several approaches for the (low-level) storage of source file contents. The
+scale (4.5 billion files and counting) as well as the size distribution (median
+raw file size of 3kB) means that usual file storage solutions come short in one
+or several ways.
+
+PathSlicingObjStorage
+---------------------
+
+The simplest approach we have taken is to use a regular filesystem, storing
+compressed files there. To avoid issues with very large directories, contents
+were sharded across several partitions, and then across directories.
+
+The first production deployments were sharded with the following pattern::
+
+  /srv/storage/A/AB/CD/EF/ABCDEFGH[...]
+               | |  |  |  |
+               `-|--|--|--|-> partition name (0-f, 16 partitions)
+                 `--|--|--|-> first level directory ([A]0-[A]f, 16 directories)
+                    `--|--|-> second level directory (00-ff, 256 directories)
+                       `--|-> third level directory (00-ff, 256 directories)
+                          `-> full hash as filename (40 hexadecimal digits)
+
+This repartition gives a total number of 2 ** 24 = 16 million directories, each
+containing a few hundred files.
+
+This approach works properly; however it has a few shortcomings:
+ - The number of directories means the kernel dentry cache fills up
+   pretty quickly; This means random accesses usually need to read
+   disk for all levels of the directory hierarchy, making them fairly slow.
+ - As files on disk are compressed, we end up with hundreds of millions of tiny
+   files on disk. While filesystems have some optimizations to inline very small
+   files directly into the inodes, we still had to resort to tiny block sizes to
+   avoid wasting lots of space, which makes disk accesses costly.
+
+
+Cloud-based Object Storage
+--------------------------
+
+Our implementation of cloud-based object storages are a very thin layer on top
+of the respective backends. Objects are stored directly using their full
+hexadecimal object identifier as key, and the gzipped data as value.
+
+The Azure-based object storage also shards objects across a configurable number
+of storage accounts, increasing the IOPS quota for operation of the content
+storage (the Azure blob storage IOPS limits are set per storage account).
+
+Cloud providers don't offer ways to control the lower-level implementation
+details of the storage on the backend. Setting aside factors out of our control,
+the main bottlenecks for those implementations are the numerous HTTPS requests
+needed to fetch objects: pipelining of those requests is critical, as it avoids
+setting up and tearing down lots of expensive TCP+TLS connections.
+
+In the case of the Azure object storage, the account sharding means that we have
+to connect to several domain names, possibly on different hosts, which may or
+may not be on different IP addresses, making pipelining more challenging.
+
+Ceph implementation
+===================
+
+Ceph introduction
+-----------------
+
+From the Ceph website: "Ceph is a unified, distributed storage system designed
+for excellent performance, reliability and scalability.". Ceph is broken down in
+several layers that implement low-level object storage, distributed block
+devices as well as high-level distributed filesystems.
+
+This modular approach allows us to fine-tune our usage of Ceph to our exact
+workload, only using the components that we actually need. Considering the
+shortcomings of our previous implementations, we've elected to only use the
+lowest-level RADOS ("Reliable Autonomic Distributed Object Store") API layer for
+our backend.
+
+Our storage pool is set to use erasure coding, with a 5 data stripes + 2 parity
+stripes mode, giving us a raw storage overhead of 40%, with the ability to
+recover from 2 concurrent disk failures.
+
+"Naive" Ceph-based Object Storage
+---------------------------------
+
+The simplest implementation just stored the objects uncompressed directly in a
+RADOS storage pool, using the hexadecimal object identifier as object name.
+After using this implementation for a while, we had an unexpectedly high disk
+usage, due to the way Ceph stores data on disk:
+
+ - Ceph's modern on-disk format, BlueStore, has a minimal allocation size (on HDDs) of
+   64kB, which means that all small writes turn into 64kB zero-padded objects;
+ - the RADOS erasure coding storage scheme needs to write a full stripe of data
+   chunks for each object. Data chunks on EC pools are 4kB by default, and
+   changing this setting is discouraged.
+
+Both those factors mean that objects are stored on disk within stripes of
+multiples of *5*64kB = 320kB*, that each use *7*64kB = 448kB*.
+
+On the other hand of the spectrum, One last issue was that RADOS has a maximum
+object size of *100MB*, and rejects writes of more than *90MB*. Those settings
+are once again tunable but they have been picked by upstream carefully.
+
+All in all, on an almost random subset of 14 million objects, the storage
+overhead was *630%* instead of the expected *40%*.
+
+Packed Object Storage
+---------------------
+
+To handle our peculiar workload, consisting of many small files, we decide to
+pack several objects in the same RADOS data block. A set of index blocks is used
+to keep track of what objects are stored, as well as the object creation time.
+
+To properly account for RADOS specifics, we try to align our data access on 4kB
+boundaries.
+
+Data block format
+~~~~~~~~~~~~~~~~~
+
+Objects less than *64 bytes*
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+(2.2% of objects, 0.001% of total size)
+
+Those objects are only inlined in the index structure
+
+Objects less than *4kB*
+^^^^^^^^^^^^^^^^^^^^^^^
+(51.7% of objects, 1.0% of total size)
+
+Size hint: *EC data chunk size = 4kB*
+
+For those objects, we select a data block by size, according to the next power
+of two. We pack the objects according to this boundary.
+
+Objects less than *320kB*
+^^^^^^^^^^^^^^^^^^^^^^^^^
+(43.9% of objects, 18.5% of total size)
+
+Size hint: *BlueStore minimum allocation size for HDD * stripe width = 320kB*
+
+We pack those objects in a *4MB*-ish data block, aligned at a *4kB* boundary.
+
+Objects less than *4MB*
+^^^^^^^^^^^^^^^^^^^^^^^
+(1.9% of objects, 27.0% of total size)
+
+Size hint: *4MB* is the "optimal" RADOS object size.
+
+Those objects get stored in a standalone data block.
+
+Objects larger than *4MB*
+^^^^^^^^^^^^^^^^^^^^^^^^^
+(0.02% of objects, 53.5% of total size)
+
+Size hint: *4MB* is the "optimal" RADOS object size.
+
+We split those objects in 4MB data blocks, and store each of these blocks separately.
+
+Index blocks
+~~~~~~~~~~~~
+
+As objects are addressed using a well-distributed cryptographic hash, index
+sharding into blocks can be done by truncating the (hexadecimal) object
+identifier up to a given length.
+
+Inside those index blocks, a number of 128-byte index nodes are stored. (TODO:
+how exactly?)
+
+
+Index node format
+~~~~~~~~~~~~~~~~~
+
+Index nodes are a packed, *128 byte* wide data structure.
+
+::
+
+      0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+  0  |<-                     object identifier                     ->| 31 
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+    
+  32 |      size     |     ctime     |T|<-                           | 63 
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                             +    
+  64 |                                                               | 95 
+     +                 index node type specific data                 +    
+  96 |                                                             ->| 127 
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+     
+      0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+                                                                           
+                                                                           
+All integers are stored in little-endian format.
+                                                                           
+The common contents for the index nodes are:                              
+                                                                           
++--------------+----------------------+-------------------------+------------------------------------------------------+
+| byte address | content              | type                    | interpretation                                       |
++==============+======================+=========================+======================================================+
+| 0-31         | object identifier    | byte array              | sha1s use the first 20 bytes, sha256s all 32         |
++--------------+----------------------+-------------------------+------------------------------------------------------+
+| 32-39        | size                 | 64 bit unsigned integer |                                                      |
++--------------+----------------------+-------------------------+------------------------------------------------------+
+| 40-47        | file creation time   | 64 bit signed integer   | Number of milliseconds since the UNIX epoch          |
++--------------+----------------------+-------------------------+------------------------------------------------------+
+| 48           | *T*: index node type | 8 bit unsigned integer  | Currently known values: 0, 1, 2                      |
++--------------+----------------------+-------------------------+------------------------------------------------------+
+ 
+There currently are three index node types
+
+*T = 0*: Inline Object (*<= 64B*)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+::
+
+      0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+  0  |<-                         object id                         ->| 31 
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+    
+  32 |      size     |     ctime     |0|                             | 63 
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+    
+  64 |<-                                                             | 95 
+     +                          object data                          +    
+  96 |                                                             ->| 127
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+      0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+
+Type-specific index node contents:
+
++--------------+--------------------+-------------------------+---------------------------------+
+| byte address | content            | type                    | interpretation                  |
++==============+====================+=========================+=================================+
+| 64-127       | object data        | byte array              | only use the first *size* bytes |
++--------------+--------------------+-------------------------+---------------------------------+
+
+
+*T = 1*: Small Object (*<= 4MB*)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+::
+
+      0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+  0  |<-                         object id                         ->| 31 
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+    
+  32 |      size     |     ctime     |1|             |     offset    | 63 
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+    
+  64 |<-                                                             | 95 
+     +                     data block identifier                     +    
+  96 |                                                             ->| 127
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+      0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+
+Type-specific index node contents:
+
++--------------+--------------------+-------------------------+-----------------------------------------------------+
+| byte address | content            | type                    | interpretation                                      |
++==============+====================+=========================+=====================================================+
+| 56-63        | offset             | 64 bit unsigned integer |                                                     |
++--------------+--------------------+-------------------------+-----------------------------------------------------+
+| 64-127       | data block id.     | byte array              | RADOS object id for the data block (probably ASCII) |
++--------------+--------------------+-------------------------+-----------------------------------------------------+
+
+*T = 2*: Large Object (*> 4MB*)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+::
+
+      0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+  0  |<-                         object id                         ->| 31 
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+    
+  32 |      size     |     ctime     |2| chunk size  |  chunk count  | 63 
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+    
+  64 |<-                                                             | 95 
+     +                      data block id pattern                    +    
+  96 |                                                             ->| 127
+     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+      0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
+      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+
+Type-specific index node contents:
+
++--------------+---------------------------+-------------------------+-----------------------------------------------------+
+| byte address | content                   | type                    | interpretation                                      |
++==============+===========================+=========================+=====================================================+
+| 48-55        | chunk size                | 64 bit unsigned integer | each chunk is this number of bytes (set LSB to 0)   |
++--------------+---------------------------+-------------------------+-----------------------------------------------------+
+| 56-63        | chunk count               | 64 bit unsigned integer | total number of chunks to fetch                     |
++--------------+---------------------------+-------------------------+-----------------------------------------------------+
+| 64-127       | data block id. pattern    | byte array              | RADOS object id for the data block (probably ASCII) |
++--------------+---------------------------+-------------------------+-----------------------------------------------------+
+
+
+Addendum: Object Size Statistics Generation
+===========================================
+
+Generate a CSV file containing the content lengths from the Software Heritage
+database using:
+
+.. code-block:: postgresql
+
+    copy (
+      select sha1, length
+      from content
+      where exists (
+        select 1 from tmp_sha1
+        where tmp_sha1.sha1 = content.sha1
+      )
+    ) to 'test_lengths.csv' csv
+
+
+Statistics are generated with Pandas::
+
+   import csv
+   import pandas as pd
+
+   r = csv.reader(open('test_lengths.csv'))
+
+   # s contains the list of sizes: size distribution
+   s = pd.Series(int(line[1]) for line in r)
+  
+   boundaries = [64, 4096, 65536*5, 4*1024*1024]
+  
+   for i in range(len(boundaries) + 1):
+        lower_bound = boundaries[i-1] if i > 0 else -1
+        upper_bound = boundaries[i] if i < len(boundaries) else -1
+        filter_lower = (s > lower_bound) if lower_bound > 0 else True
+        filter_upper = (s <= upper_bound) if upper_bound > 0 else True
+        matching = filter_lower & filter_upper
+        number_ratio = matching.sum() / s.count()
+        size_ratio = (s * matching).sum() / total
+        print(lower, upper, number_ratio, size_ratio)
diff --git a/docs/index.rst b/docs/index.rst
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -7,7 +7,7 @@
    :maxdepth: 2
    :caption: Contents:
 
-
+   design
 
 Indices and tables
 ==================