diff --git a/docs/persistent-identifiers.rst b/docs/persistent-identifiers.rst index a25e421..f78aee3 100644 --- a/docs/persistent-identifiers.rst +++ b/docs/persistent-identifiers.rst @@ -1,294 +1,307 @@ .. _persistent-identifiers: -====================== -Persistent identifiers -====================== +================================================ +SoftWare Heritage persistent IDentifiers (SWHID) +================================================ + +**version 1.2** + Description =========== You can point to objects present in the Software Heritage archive by the means -of **persistent identifiers** that are guaranteed to remain stable (persistent) -over time. Their syntax, meaning, and usage is described below. Note that they -are identifiers and not URLs, even though an URL-based resolver for Software -Heritage persistent identifiers is also provided. +of **SoftWare Heritage persistent IDentifiers**, or **SWHID** for short, that +are guaranteed to remain stable (persistent) over time. Their syntax, meaning, +and usage is described below. Note that they are identifiers and not URLs, even +though an URL-based resolver for Software Heritage persistent identifiers is +also provided. -A persistent identifier can point to any software artifact (or "object") -available in the Software Heritage archive. Objects come in different types, -and most notably: +A SWHID can point to any software artifact (or "object") available in the +Software Heritage archive. Objects come in different types, and most notably: * contents * directories * revisions * releases * snapshots Each object is identified by an intrinsic, type-specific object identifier that -is embedded in its persistent identifier as described below. Object identifiers -are strong cryptographic hashes computed on the entire set of object properties -to form a `Merkle structure `_. +is embedded in its SWHID as described below. SWHIDs are strong cryptographic +hashes computed on the entire set of object properties to form a `Merkle +structure `_. -See :ref:`data-model` for an overview of object types and how they are linked -together. See :py:mod:`swh.model.identifiers` for details on how intrinsic -object identifiers are computed. +See the :ref:`Software Heritage data model ` for an overview of +object types and how they are linked together. See +:py:mod:`swh.model.identifiers` for details on how SWHIDs are computed. Syntax ------ -Syntactically, persistent identifiers are generated by the ```` -entry point of the grammar: +Syntactically, SWHIDs are generated by the ```` entry point of the +grammar: .. code-block:: bnf ::= "swh" ":" ":" ":" ; ::= "1" ; ::= "snp" (* snapshot *) | "rel" (* release *) | "rev" (* revision *) | "dir" (* directory *) | "cnt" (* content *) ; ::= 40 * ; (* intrinsic object id, as hex-encoded SHA1 *) ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ::= | "a" | "b" | "c" | "d" | "e" | "f" ; Semantics --------- -``:`` is used as separator between the logical parts of identifiers. The -``swh`` prefix makes explicit that these identifiers are related to *SoftWare +``:`` is used as separator between the logical parts of SWHIDs. The ``swh`` +prefix makes explicit that these identifiers are related to *SoftWare Heritage*. ``1`` (````) is the current version of this identifier *scheme*; future editions will use higher version numbers, possibly breaking backward compatibility (but without breaking the resolvability of -identifiers that conform to previous versions of the scheme). +SWHIDs conform to previous versions of the scheme). -A persistent identifier points to a single object, whose type is explicitly -captured by ````: +A SWHID points to a single object, whose type is explicitly captured by +````: * ``snp`` to **snapshots**, * ``rel`` to **releases**, * ``rev`` to **revisions**, * ``dir`` to **directories**, * ``cnt`` to **contents**. The actual object pointed to is identified by the intrinsic identifier ````, which is a hex-encoded (using lowercase ASCII characters) SHA1 computed on the content and metadata of the object itself, as follows: * for **snapshots**, intrinsic identifiers are computed as per :py:func:`swh.model.identifiers.snapshot_identifier` * for **releases**, as per :py:func:`swh.model.identifiers.release_identifier` * for **revisions**, as per :py:func:`swh.model.identifiers.revision_identifier` * for **directories**, as per :py:func:`swh.model.identifiers.directory_identifier` * for **contents**, the intrinsic identifier is the ``sha1_git`` hash of the multiple hashes returned by :py:func:`swh.model.identifiers.content_identifier`, i.e., the SHA1 of a byte sequence obtained by juxtaposing the ASCII string ``"blob"`` (without quotes), a space, the length of the content as decimal digits, a NULL byte, and the actual content of the file. Git compatibility ~~~~~~~~~~~~~~~~~ -Intrinsic object identifiers for contents, directories, revisions, and releases -are, at present, compatible with the `Git `_ way of -`computing identifiers +SWHIDs for contents, directories, revisions, and releases are, at present, +compatible with the `Git `_ way of `computing identifiers `_ for its objects. -A Software Heritage content identifier will be identical to a Git blob -identifier of any file with the same content, a Software Heritage revision -identifier will be identical to the corresponding Git commit identifier, etc. -This is not the case for snapshot identifiers as Git doesn't have a -corresponding object type. +A SWHID for a content object will correspond (in its ```` part) to a +Git blob identifier of any file with the same content; a SWHID for a revision +will correspond to the Git commit identifier for the same revision, etc. This +is not the case for snapshot identifiers, as Git does not have a corresponding +object type. Note that Git compatibility is incidental and is not guaranteed to be maintained in future versions of this scheme (or Git). Examples -------- * ``swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2`` points to the content of a file containing the full text of the GPL3 license * ``swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505`` points to a directory containing the source code of the Darktable photography application as it was at some point on 4 May 2017 * ``swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d`` points to a commit in the development history of Darktable, dated 16 January 2017, that added undo/redo supports for masks * ``swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f`` points to Darktable release 2.3.0, dated 24 December 2016 * ``swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453`` points to a snapshot of the entire Darktable Git repository taken on 4 May 2017 from GitHub Contextual information ====================== -The Software Heritage persistent identifiers described above are *intrinsic identifiers*, as they are computed from the designated object itself, and it is often useful to provide *contextual information* about a particular -occurrence of the object, like the origin from where the object has been found. -To this end, persistent identifiers can be equipped with **qualifiers** that -contain this *contextual information*. Qualifiers come in different kinds : +The SWHIDs as described above are *intrinsic identifiers*, as they are computed +from the designated object itself, and it is often useful to provide +*contextual information* about a particular occurrence of the object, like the +origin from where the object has been found. To this end, SWHIDs can be +coupled with **qualifiers** that capture such *contextual information*. +Qualifiers come in different kinds: * origin * visit * anchor * path * lines + Syntax ------ -The full-syntax to complement identifiers with contextual information is given -by the ```` entry point of the grammar: +The full-syntax to complement SWHIDs with contextual information is given by +the ```` entry point of the grammar: .. code-block:: bnf ::= [ ] := [ ] ::= | | | | ::= ";" "origin" "=" ::= ";" "visit" "=" ::= ";" "anchor" "=" ::= ";" "path" "=" ::= ";" "lines" "=" ["-" ] ::= + ::= (* RFC 3986 compliant URLs *) ::= (* RFC 3986 compliant absolute file path, percent-escaped *) -Here ```` is the ```` in `Section 3.3 of RFC 3986 `_ where all occurrences of ``;`` and ``%`` must be percent-encoded (as `%3B` and `%25` respectively). +Here ```` is the ```` in `Section 3.3 of +RFC 3986 `_ where all +occurrences of ``;`` and ``%`` must be percent-encoded (as `%3B` and `%25` +respectively). + Semantics --------- -``;`` is used as separator between persistent identifiers and the -optional contextual information qualifiers. Each contextual information qualifier is -specified as a key/value pair, using ``=`` as a separator. +``;`` is used as separator between SWHIDs and the optional contextual +information qualifiers. Each contextual information qualifier is specified as a +key/value pair, using ``=`` as a separator. The following piece of contextual information are supported: -* **origin** : the *software origin* where an object has been found or observed in the wild, - as an URI; -* **visit** : persistent identifier of a *snapshot* corresponding to a specific *visit* of a repository containing the designated object; -* **anchor** : a *designated node* in the Merkle DAG relative to which a *path to the object* is specified, - as a persistent identifier of a directory, a revision, a release or a snapshot; -* **path** : the *absolute file path*, from the *root directory* associated to the *anchor node*, to the object; - when the anchor denotes a directory or a revision, and almost always when it's a release, - the root directory is uniquely determined; when the anchor denotes a snapshot, the root - directory is the one pointed to by ``HEAD`` (possibly indirectly), - and undefined if such a reference is missing; +* **origin** : the *software origin* where an object has been found or observed + in the wild, as an URI; +* **visit** : persistent identifier of a *snapshot* corresponding to a specific + *visit* of a repository containing the designated object; +* **anchor** : a *designated node* in the Merkle DAG relative to which a *path + to the object* is specified, as a persistent identifier of a directory, a + revision, a release or a snapshot; +* **path** : the *absolute file path*, from the *root directory* associated to + the *anchor node*, to the object; when the anchor denotes a directory or a + revision, and almost always when it's a release, the root directory is + uniquely determined; when the anchor denotes a snapshot, the root directory + is the one pointed to by ``HEAD`` (possibly indirectly), and undefined if + such a reference is missing; * **lines** : *line number(s)* of interest, usually within a content object We recommend to equip identifiers meant to be shared with as many qualifiers as -possible. While qualifiers may be listed in any order, it is good practice -to present them in the order given above, i.e. ``origin``, ``visit``, ``anchor``, ``path``, ``lines``. -Redundant information should be omitted: for example, if the *visit* -is present, and the *path* is relative to the snapshot indicated there, then the -*anchor* qualifier is superfluous. +possible. While qualifiers may be listed in any order, it is good practice to +present them in the order given above, i.e., ``origin``, ``visit``, ``anchor``, +``path``, ``lines``. Redundant information should be omitted: for example, if +the *visit* is present, and the *path* is relative to the snapshot indicated +there, then the *anchor* qualifier is superfluous. + Example ------- -The following `fully qualified identifier `_ -denotes the lines 9 to 15 of a file content that -can be found at absolute path ``/Examples/SimpleFarm/simplefarm.ml`` from the root directory -of the revision ``swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0`` that is contained -in the snapshot ``swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9`` taken from -the origin ``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git``. +The following `fully qualified SWHID +`_ +denotes the lines 9 to 15 of a file content that can be found at absolute path +``/Examples/SimpleFarm/simplefarm.ml`` from the root directory of the revision +``swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0`` that is contained in the +snapshot ``swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9`` taken from the +origin ``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git``. .. code-block:: url swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b; origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git; visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9; anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0; path=/Examples/SimpleFarm/simplefarm.ml; lines=9-15 -And this is an example of `a fully qualified identifier with a percent escaped file path `_ +And this is an example of `a fully qualified SWHID with a percent escaped file +path +`_ .. code-block:: url swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04; origin=https://github.com/web-platform-tests/wpt; visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499; anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96; path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%3Burl=foo/ Resolution ========== Dedicated resolvers ------------------- -Persistent identifiers can be resolved using the Software Heritage Web -application (see :py:mod:`swh.web`). In particular, the **root endpoint** -``/`` can be given a persistent identifier and will lead to the browsing page -of the corresponding object, like this: -``https://archive.softwareheritage.org/``. +SWHIDs can be resolved using the Software Heritage Web application (see +:py:mod:`swh.web`). In particular, the **root endpoint** ``/`` can be given a +SWHID and will lead to the browsing page of the corresponding object, like +this: ``https://archive.softwareheritage.org/``. A **dedicated** ``/resolve`` **endpoint** of the HTTP API is also available to -explicitly request persistent identifier resolution; see: -:http:get:`/api/1/resolve/(swh_id)/`. +explicitly request SWHID resolution; see: :http:get:`/api/1/resolve/(swh_id)/`. Examples: * ``_ * ``_ * ``_ * ``_ * ``_ External resolvers ------------------ -The following **independent resolvers** support resolution of Software -Heritage persistent identifiers: +The following **independent resolvers** support resolution of SWHIDs: * `Identifiers.org `_; see: ``_ (registry identifier `MIR:00000655 `_). * `Name-to-Thing (N2T) `_ Examples: * ``_ * ``_ * ``_ * ``_ * ``_ Note that resolution via Identifiers.org does not support contextual information, due to `syntactic incompatibilities `_. References ========== * Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. `Identifiers for Digital Objects: the Case of Software Source Code Preservation `_. In Proceedings of `iPRES 2018 `_: 15th International Conference on Digital Preservation, Boston, MA, USA, September 2018, 9 pages. * Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. `Referencing Source Code Artifacts: a Separate Concern in Software Citation _`. In Computing in Science and Engineering, volume 22, issue 2, pages 33-43. ISSN 1521-9615, IEEE. March 2020.