**WARNING: work in progress blueprint**

Introduction
------------

A repository snapshot object, or simply **snapshot object**, is a
[Merkle](https://en.wikipedia.org/wiki/Merkle_tree) DAG node used to
capture the current state of a VCS repository.

Conceptually, a snapshot object is a complete map from repository entry
points ("branches" in [Software Heritage](Software_Heritage "wikilink")
terminology, "refs" in Git) to other objects in the repository,
including other snapshot objects if repository entry points point to
them.\
Practically, the map is serialized into a **manifest** consisting of a
list of triples *<object type, object ID, branch name>* (specifying the
object *type* would not be strictly needed, but it eases traversal,
allows for better integrity checking, etc.).

Entries in snapshots can point to the following object kinds:

-   contents (Git terminology: blobs)
-   directories (tree)
-   releases (annotated tags)
-   revisions (commits)
-   snapshots

The object ID of a repository object is the cryptographic hash of its
manifest, computed in the usual way for the Merkle DAG.

Manifest
--------

The manifest of a repository object is a **canonical representation** of
it as a sequence of bytes.\
Two alternative serialization formats for such manifests are proposed
below:

-   *a-la Software Heritage*: how we would implement on our own, not
    taking into account compatibility with/stylistic choices of other
    VCSs
-   *a-la Git*: manifest implementation similar to how Git implements
    manifests for other DAG objects

No matter the implementation of the manifest *itself*, its **object ID**
is obtained as follows:

-   take the string obtained by concatenating:

1.  the header string "snapshot " (without quotes)
2.  the length of the manifest serialized as a decimal integer in ASCII
    (e.g., "42")
3.  the NULL byte "\\0"
4.  the manifest itself

-   compute the SHA1 checksum of the obtained string

This is equivalent to the current implementation of
[git-hash-object(1)](https://git-scm.com/docs/git-hash-object) with
object type set to "snapshot" (note that to use it you will need to pass
`--literally`, as "snapshot" is currently not a supported Git object
type).

Note: branch/ref names might contain arbitrary characters except the
NULL byte itself.

Implementation
--------------

Two possible implementations of repository objects are detailed below,
one more in line with Software Heritage conventions, the other more
aligned with Git ones.

### a-la Software Heritage

Repository objects will contain one entry for each branch that would be
listed in the occurrence table while visiting an origin. (In this
context "branches" roughly correspond to Git refs.) Each entry will
point to a fully resolved object ID (i.e., a SHA). The equivalent of Git
symbolic refs are not stored in their non-resolved form (i.e., a ref
name).

#### Manifest serialization

A repository object manifest is a list of entries, named and sorted by
branch, where each entry is as follows:

1.  object kind (one of: "content", "directory", "release",
    "revision", "snapshot")
2.  ASCII space
3.  SHA1 of the target object serialized as a string of ASCII, lowercase
    hex digits (e.g., "585f6e27f540012af621a18d0155aae2a8ec0276")
4.  ASCII space
5.  entry name as a sequence of bytes (e.g., "refs/heads/master")
6.  NULL byte "\\0"

### a-la Git

Repository objects will contain an entry for each ref that would exist
in a bare repository.

Possible variants:

-   also store symbolic refs, in their non-resolved form (i.e.,
    ref names) in the manifests. To be useful this needs the guarantee
    that refs pointed by symbolic refs are included in the manifest

#### Manifest serialization

A repository object manifest is a list of entries, named and sorted by
ref, where each entry is as follows:

1.  object kind (one of: "blob", "tree", "tag", "commit", "snapshot")
2.  ASCII space
3.  entry name as a sequence of bytes (e.g., "refs/heads/master")
4.  NULL byte "\\0"
5.  SHA1 of the target object serialized as 20 raw bytes

Notes:

-   there is no separator between entries
-   the above is inspired by the compact serialization format of tree
    objects, but other variants are possible:
    -   reorder the columns so that entry name comes last and "\\0" acts
        as entry terminator
    -   serialize SHA1 as ASCII instead of binary (it will take more
        space, but arguably snapshot objects will be less popular than
        tree objects)