diff --git a/docs/persistent-identifiers.rst b/docs/persistent-identifiers.rst index 03d213e..c34273f 100644 --- a/docs/persistent-identifiers.rst +++ b/docs/persistent-identifiers.rst @@ -1,241 +1,271 @@ .. _persistent-identifiers: ====================== Persistent identifiers ====================== Description =========== You can point to objects present in the Software Heritage archive by the means of **persistent identifiers** that are guaranteed to remain stable (persistent) over time. Their syntax, meaning, and usage is described below. Note that they are identifiers and not URLs, even though an URL-based resolver for Software Heritage persistent identifiers is also provided. A persistent identifier can point to any software artifact (or "object") available in the Software Heritage archive. Objects come in different types, and most notably: * contents * directories * revisions * releases * snapshots Each object is identified by an intrinsic, type-specific object identifier that is embedded in its persistent identifier as described below. Object identifiers are strong cryptographic hashes computed on the entire set of object properties to form a `Merkle structure `_. See :ref:`data-model` for an overview of object types and how they are linked together. See :py:mod:`swh.model.identifiers` for details on how intrinsic object identifiers are computed. Syntax ------ Syntactically, persistent identifiers are generated by the ```` entry point of the grammar: .. code-block:: bnf ::= "swh" ":" ":" ":" ; ::= "1" ; ::= "snp" (* snapshot *) | "rel" (* release *) | "rev" (* revision *) | "dir" (* directory *) | "cnt" (* content *) ; ::= 40 * ; (* intrinsic object id, as hex-encoded SHA1 *) ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ::= | "a" | "b" | "c" | "d" | "e" | "f" ; Semantics --------- ``:`` is used as separator between the logical parts of identifiers. The ``swh`` prefix makes explicit that these identifiers are related to *SoftWare Heritage*. ``1`` (````) is the current version of this identifier *scheme*; future editions will use higher version numbers, possibly breaking backward compatibility (but without breaking the resolvability of identifiers that conform to previous versions of the scheme). A persistent identifier points to a single object, whose type is explicitly captured by ````: * ``snp`` to **snapshots**, * ``rel`` to **releases**, * ``rev`` to **revisions**, * ``dir`` to **directories**, * ``cnt`` to **contents**. The actual object pointed to is identified by the intrinsic identifier ````, which is a hex-encoded (using lowercase ASCII characters) SHA1 computed on the content and metadata of the object itself, as follows: * for **snapshots**, intrinsic identifiers are computed as per :py:func:`swh.model.identifiers.snapshot_identifier` * for **releases**, as per :py:func:`swh.model.identifiers.release_identifier` * for **revisions**, as per :py:func:`swh.model.identifiers.revision_identifier` * for **directories**, as per :py:func:`swh.model.identifiers.directory_identifier` * for **contents**, the intrinsic identifier is the ``sha1_git`` hash of the multiple hashes returned by :py:func:`swh.model.identifiers.content_identifier`, i.e., the SHA1 of a byte sequence obtained by juxtaposing the ASCII string ``"blob"`` (without quotes), a space, the length of the content as decimal digits, a NULL byte, and the actual content of the file. Git compatibility ~~~~~~~~~~~~~~~~~ Intrinsic object identifiers for contents, directories, revisions, and releases are, at present, compatible with the `Git `_ way of `computing identifiers `_ for its objects. A Software Heritage content identifier will be identical to a Git blob identifier of any file with the same content, a Software Heritage revision identifier will be identical to the corresponding Git commit identifier, etc. This is not the case for snapshot identifiers as Git doesn't have a corresponding object type. Note that Git compatibility is incidental and is not guaranteed to be maintained in future versions of this scheme (or Git). Examples -------- * ``swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2`` points to the content of a file containing the full text of the GPL3 license * ``swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505`` points to a directory containing the source code of the Darktable photography application as it was at some point on 4 May 2017 * ``swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d`` points to a commit in the development history of Darktable, dated 16 January 2017, that added undo/redo supports for masks * ``swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f`` points to Darktable release 2.3.0, dated 24 December 2016 * ``swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453`` points to a snapshot of the entire Darktable Git repository taken on 4 May 2017 from GitHub Contextual information ====================== -It is often useful to complement persistent identifiers with **contextual -information** about where the identified object has been found as well as which -specific parts of it are of interest. To that end it is possible, via a -dedicated syntax, to extend persistent identifiers with the following pieces of -information: - -* the **software origin** where an object has been found/observed -* the **line number(s)** of interest, usually within a content object +Persistent identifiers may be equipped with **qualifiers** to provide *contextual information* about the object designated by the identifier. Qualifiers come in different kinds : +* origin +* visit +* anchor +* path +* lines Syntax ------ The full-syntax to complement identifiers with contextual information is given by the ```` entry point of the grammar: .. code-block:: bnf - ::= [] [] - ::= ";" "lines" "=" ["-" ] + ::= [ ] + := [ ] + ::= | | | | ::= ";" "origin" "=" + ::= ";" "visit" "=" + ::= ";" "anchor" "=" + ::= ";" "path" "=" + ::= ";" "lines" "=" ["-" ] ::= + ::= (* RFC 3986 compliant URLs *) + ::= (* RFC 3986 compliant absolute file path *) +For ```` see `Section 3.3 of RFC 3986 `_ Semantics --------- -``;`` is used as separator between persistent identifiers and additional -optional contextual information. Each piece of contextual information is +``;`` is used as separator between persistent identifiers and the +optional contextual information qualifiers. Each contextual information qualifier is specified as a key/value pair, using ``=`` as a separator. The following piece of contextual information are supported: -* line numbers: it is possible to specify a single line number or a line range, - separating two numbers with ``-``. Note that line numbers are purely - indicative and are not meant to be stable, as in some degenerate cases - (e.g., text files which mix different types of line terminators) it is - impossible to resolve them unambiguously. - -* software origin: where a given object has been found or observed in the wild, - as the URI that was used by Software Heritage to ingest the object into the - archive - +* **origin** : the *software origin* where an object has been found or observed in the wild, + as the URI that was used by Software Heritage to ingest the object into the archive; +* **visit** : the *status of a full repository* containing the designated object, as a *snapshot* + corresponding to a specific *visit* of that repository; +* **anchor** : a *designated node* in the Merkle DAG relative to which a *path to the object* is specified, + as a persistent identifier of a directory, a revision, a release or a snapshot; +* **path** : the *absolute file path* from the *root directory* associated to the *anchor node* to the object; + when the anchor denotes a directory or a revision, and almost always when it's a release, + the root directory is uniquely determined; when the anchor denotes a snapshot, the root + directory is considered to be the one associated to the main branch of that snapshot; +* **lines** : *line number(s)* of interest, usually within a content object + +We recommend to equip with as many qualifiers as possible identifiers meant +to be shared. Redundant information should be omitted: for example, if the *visit* +is present, and the *path* is relative to the snapshot indicated there, then +the *anchor* qualifier is superfluous. + +Example +------- + +The following `fully qualified identifier `_ +denotes the lines 9 to 15 of a file content that +can be found at absolute path ``/Examples/SimpleFarm/simplefarm.ml`` from the root directory +of the revision ``swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0`` that is contained +in the snapshot ``swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9`` taken from +the origin ``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git``. + +.. code-block:: url + + swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b; + anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0; + path=/Examples/SimpleFarm/simplefarm.ml; + visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9; + origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git; + lines=9-15 Resolution ========== Dedicated resolvers ------------------- Persistent identifiers can be resolved using the Software Heritage Web application (see :py:mod:`swh.web`). In particular, the **root endpoint** ``/`` can be given a persistent identifier and will lead to the browsing page of the corresponding object, like this: ``https://archive.softwareheritage.org/``. A **dedicated** ``/resolve`` **endpoint** of the HTTP API is also available to explicitly request persistent identifier resolution; see: :http:get:`/api/1/resolve/(swh_id)/`. Examples: * ``_ * ``_ * ``_ * ``_ * ``_ External resolvers ------------------ The following **independent resolvers** support resolution of Software Heritage persistent identifiers: * `Identifiers.org `_; see: ``_ (registry identifier `MIR:00000655 `_). * `Name-to-Thing (N2T) `_ Examples: * ``_ * ``_ * ``_ * ``_ * ``_ Note that resolution via Identifiers.org does not support contextual information, due to `syntactic incompatibilities `_. References ========== * Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. `Identifiers for Digital Objects: the Case of Software Source Code Preservation `_. In Proceedings of `iPRES 2018 `_: 15th International Conference on Digital Preservation, Boston, MA, USA, September 2018, 9 pages.