diff --git a/docs/persistent-identifiers.rst b/docs/persistent-identifiers.rst index e08c0a8..29bf797 100644 --- a/docs/persistent-identifiers.rst +++ b/docs/persistent-identifiers.rst @@ -1,193 +1,193 @@ .. _persistent-identifiers: Persistent identifiers ====================== You can point to objects present in the Software Heritage archive by the means of **persistent identifiers** that are guaranteed to remain stable (persistent) over time. Their syntax, meaning, and usage is described below. Note that they are identifiers and not URLs, even though an URL-based resolver for Software Heritage persistent identifiers is also provided. A persistent identifier can point to any software artifact (or "object") available in the Software Heritage archive. Objects come in different types, and most notably: * contents * directories * revisions * releases * snapshots Each object is identified by an intrinsic, type-specific object identifier that is embedded in its persistent identifier as described below. Object identifiers are strong cryptographic hashes computed on the entire set of object properties to form a `Merkle structure `_. See :ref:`data-model` for an overview of object types and how they are linked together. See :py:mod:`swh.model.identifiers` for details on how intrinsic object identifiers are computed. Syntax ------ Syntactically, persistent identifiers are generated by the ```` entry point of the grammar: .. code-block:: bnf ::= "swh" ":" ":" ":" ; ::= "1" ; ::= "snp" (* snapshot *) | "rel" (* release *) | "rev" (* revision *) | "dir" (* directory *) | "cnt" (* content *) ; ::= 40 * ; (* intrinsic object id, as hex-encoded SHA1 *) ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ::= | "a" | "b" | "c" | "d" | "e" | "f" ; Semantics --------- ``:`` is used as separator between the logical parts of identifiers. The ``swh`` prefix makes explicit that these identifiers are related to *SoftWare Heritage*. ``1`` (````) is the current version of this identifier *scheme*; future editions will use higher version numbers, possibly breaking backward compatibility (but without breaking the resolvability of identifiers that conform to previous versions of the scheme). A persistent identifier points to a single object, whose type is explicitly captured by ````: * ``snp`` identifiers points to **snapshots**, * ``rel`` to **releases**, * ``rev`` to **revisions**, * ``dir`` to **directories**, * ``cnt`` to **contents**. The actual object pointed to is identified by the intrinsic identifier ````, which is a hex-encoded (using lowercase ASCII characters) SHA1 computed on the content and metadata of the object itself, as follows: * for **snapshots**, intrinsic identifiers are computed as per :py:func:`swh.model.identifiers.snapshot_identifier` * for **releases**, as per :py:func:`swh.model.identifiers.release_identifier` * for **revisions**, as per :py:func:`swh.model.identifiers.revision_identifier` * for **directories**, as per :py:func:`swh.model.identifiers.directory_identifier` * for **contents**, the intrinsic identifier is the ``sha1_git`` hash of the multiple hashes returned by :py:func:`swh.model.identifiers.content_identifier`, i.e., the SHA1 of a byte sequence obtained by juxtaposing the ASCII string ``"blob"`` (without quotes), a space, the length of the content as decimal digits, a NULL byte, and the actual content of the file. Git compatibility ~~~~~~~~~~~~~~~~~ Intrinsic object identifiers for contents, directories, revisions, and releases are, at present, compatible with the `Git `_ way of `computing identifiers `_ for its objects. A Software Heritage content identifier will be identical to a Git blob identifier of any file with the same content, a Software Heritage revision identifier will be identical to the corresponding Git commit identifier, etc. This is not the case for snapshot identifiers as Git doesn't have a corresponding object type. Note that Git compatibility is incidental and is not guaranteed to be maintained in future versions of this scheme (or Git). Examples -------- * ``swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2`` points to the content of a file containing the full text of the GPL3 license * ``swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505`` points to a directory containing the source code of the Darktable photography application as it was at some point on 4 May 2017 * ``swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d`` points to a commit in the development history of Darktable, dated 16 January 2017, that added undo/redo supports for masks * ``swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f`` points to Darktable release 2.3.0, dated 24 December 2016 * ``swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453`` points to a snapshot of the entire Darktable Git repository taken on 4 May 2017 from GitHub Resolution ---------- Persistent identifiers can be resolved using the Software Heritage Web application (see :py:mod:`swh.web`). In particular, the root endpoint ``/`` can be given a persistent identifier and will lead to the browsing page of the corresponding object, like this: ``https://archive.softwareheritage.org/``. For example: * ``_ * ``_ * ``_ * ``_ * ``_ Contextual information ====================== It is often useful to complement persistent identifiers with **contextual information** about where the identified object has been found as well as which specific parts of it are of interest. To that end it is possible, via a dedicated syntax, to extend persistent identifiers with the following pieces of information: * the **software origin** where an object has been found/observed * the **line number(s)** of interest, usually within a content object Syntax ------ The full-syntax to complement identifiers with contextual information is given by the ```` entry point of the grammar: .. code-block:: bnf ::= [] [] ::= ";" "lines" "=" ["-" ] ::= ";" "origin" "=" ::= + ::= (* RFC 3986 compliant URLs *) Semantics --------- -";" is used a separator between persistent identifiers and additional optional -contextual information. Each piece of contextual information is specified as a -key/value pair, using "=" as a separator. +``;`` is used as separator between persistent identifiers and additional +optional contextual information. Each piece of contextual information is +specified as a key/value pair, using ``=`` as a separator. The following piece of contextual information are supported: * line numbers: it is possible to specify a single line number or a line range, - separating two numbers with "-". Note that line numbers are purely indicative - and are not meant to be stable, as in some degenerate cases (e.g., text files - which mix different types of line terminators) it is impossible to resolve - them unambiguously. + separating two numbers with ``-``. Note that line numbers are purely + indicative and are not meant to be stable, as in some degenerate cases + (e.g., text files which mix different types of line terminators) it is + impossible to resolve them unambiguously. * software origin: where a given object has been found or observed in the wild, as the URI that was used by Software Heritage to ingest the object into the archive