diff --git a/docs/persistent-identifiers.rst b/docs/persistent-identifiers.rst --- a/docs/persistent-identifiers.rst +++ b/docs/persistent-identifiers.rst @@ -4,22 +4,29 @@ SoftWare Heritage persistent IDentifiers (SWHIDs) ================================================= -**version 1.3, last modified 2020-04-28** +**version 1.4, last modified 2020-04-30** + +.. contents:: + :local: + :depth: 2 Overview ======== -You can point to objects present in the Software Heritage archive by the means -of **SoftWare Heritage persistent IDentifiers**, or **SWHIDs** for short, that -are guaranteed to remain stable (persistent) over time. Their syntax, meaning, -and usage is described below. Note that they are identifiers and not URLs, even -though URL-based resolvers for SWHIDs are also available. +You can point to objects present in the `Software Heritage +`_ `archive +`_ by the means of **SoftWare Heritage +persistent IDentifiers**, or **SWHIDs** for short, that are guaranteed to +remain stable (persistent) over time. Their syntax, meaning, and usage is +described below. Note that they are identifiers and not URLs, even though +URL-based `resolvers`_ for SWHIDs are also available. -A SWHID consists of two separate parts, a *core identifier* that can point to -any software artifact (or "object") available in the Software Heritage archive, -and an *optional list of qualifiers* that allows to specify the context where -the object is meant to be seen, or point to a subpart of the object itself. +A SWHID consists of two separate parts, a mandatory *core identifier* that can +point to any software artifact (or "object") available in the Software Heritage +archive, and an optional list of *qualifiers* that allows to specify the +context where the object is meant to be seen and point to a subpart of the +object itself. Objects come in different types: @@ -33,7 +40,8 @@ is embedded in its SWHID as described below. The intrinsic identifiers embedded in SWHIDs are strong cryptographic hashes computed on the entire set of object properties. Together, these identifiers form a `Merkle structure -`_, specifically a Merkle DAG. +`_, specifically a Merkle `DAG +`_. See the :ref:`Software Heritage data model ` for an overview of object types and how they are linked together. See @@ -42,23 +50,24 @@ The optional qualifiers are of two kinds: -* *context qualifiers* carry information about the context where a given - object is meant to be seen; this is particularly important, as the same object - can be reached in the Merkle graph following different *paths* from different - nodes (or *anchors*), and it may have been retrieved from different *origins*, - that may evolve between different *visits*, -* *fragment qualifiers* allow to pinpoint specific subparts of an object +* **context qualifiers:** carry information about the context where a given + object is meant to be seen. This is particularly important, as the same + object can be reached in the Merkle graph following different *paths* + starting from different nodes (or *anchors*), and it may have been retrieved + from different *origins*, that may evolve between different *visits* +* **fragment qualifiers:** allow to pinpoint specific subparts of an object Syntax ------- +====== -Syntactically, SWHIDs are generated by the ```` entry point of the -grammar: +Syntactically, SWHIDs are generated by the ```` entry point in the +following grammar: .. code-block:: bnf - ::= [ ] ; + ::= [ ] ; + ::= "swh" ":" ":" ":" ; ::= "1" ; ::= @@ -71,7 +80,8 @@ ::= 40 * ; (* intrinsic object id, as hex-encoded SHA1 *) ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ; ::= | "a" | "b" | "c" | "d" | "e" | "f" ; - := [ ] ; + + := ";" [ ] ; ::= | @@ -82,14 +92,14 @@ | | ; - ::= ";" "origin" "=" ; - ::= ";" "visit" "=" ; - ::= ";" "anchor" "=" ; - ::= ";" "path" "=" ; - ::= ";" "lines" "=" ["-" ] ; + ::= "origin" "=" ; + ::= "visit" "=" ; + ::= "anchor" "=" ; + ::= "path" "=" ; + ::= "lines" "=" ["-" ] ; ::= + ; - ::= (* RFC 3986 compliant URLs, percent-escaped *) - ::= (* RFC 3986 compliant absolute file path, percent-escaped *) + ::= (* RFC 3987 IRI *) + ::= (* RFC 3987 absolute path *) Where: @@ -105,17 +115,18 @@ Semantics ---------- +========= + Core identifiers -~~~~~~~~~~~~~~~~ +---------------- -``:`` is used as separator between the logical parts of core identifiers. The ``swh`` -prefix makes explicit that these identifiers are related to *SoftWare +``:`` is used as separator between the logical parts of core identifiers. The +``swh`` prefix makes explicit that these identifiers are related to *SoftWare Heritage*. ``1`` (````) is the current version of this -identifier *scheme*; future editions will use higher version numbers, possibly -breaking backward compatibility (but without breaking the resolvability of -SWHIDs that conform to previous versions of the scheme). +identifier *scheme*. Future editions will use higher version numbers, possibly +breaking backward compatibility, but without breaking the resolvability of +SWHIDs that conform to previous versions of the scheme. A SWHID points to a single object, whose type is explicitly captured by ````: @@ -151,23 +162,27 @@ quotes), a space, the length of the content as decimal digits, a NULL byte, and the actual content of the file. + Qualifiers -~~~~~~~~~~ +---------- ``;`` is used as separator between the core identifier and the optional -qualifiers, and optional qualifiers. Each qualifier is specified as a +qualifiers, as well as between qualifiers. Each qualifier is specified as a key/value pair, using ``=`` as a separator. The following *context qualifiers* are available: -* **origin** : the *software origin* where an object has been found or observed +* **origin:** the *software origin* where an object has been found or observed in the wild, as an URI; -* **visit** : the core identifier of a *snapshot* corresponding to a specific + +* **visit:** the core identifier of a *snapshot* corresponding to a specific *visit* of a repository containing the designated object; -* **anchor** : a *designated node* in the Merkle DAG relative to which a *path + +* **anchor:** a *designated node* in the Merkle DAG relative to which a *path to the object* is specified, as the core identifier of a directory, a revision, a release or a snapshot; -* **path** : the *absolute file path*, from the *root directory* associated to + +* **path:** the *absolute file path*, from the *root directory* associated to the *anchor node*, to the object; when the anchor denotes a directory or a revision, and almost always when it's a release, the root directory is uniquely determined; when the anchor denotes a snapshot, the root directory @@ -176,7 +191,7 @@ The following *fragment qualifier* is available: -* **lines** : *line number(s)* of interest, usually within a content object +* **lines:** *line number(s)* of interest, usually within a content object We recommend to equip identifiers meant to be shared with as many qualifiers as possible. While qualifiers may be listed in any order, it is good practice to @@ -186,44 +201,69 @@ there, then the *anchor* qualifier is superfluous; similarly, if the *path* is empty, it may be omitted. + +Interoperability +================ + + +URI scheme +---------- + +The ``swh`` URI scheme is registered at IANA for SWHIDs. The present documents +constitutes the scheme specification for such URI scheme. + + Git compatibility -~~~~~~~~~~~~~~~~~ +----------------- SWHIDs for contents, directories, revisions, and releases are, at present, compatible with the `Git `_ way of `computing identifiers `_ for its objects. The ```` part of a SWHID for a content object is the Git blob identifier of any file with the same content; for a revision it is the Git -commit identifier for the same revision, etc. This is not the case for snapshot -identifiers, as Git does not have a corresponding object type. +commit identifier for the same revision, etc. This is not the case for +snapshot identifiers, as Git does not have a corresponding object type. Note that Git compatibility is incidental and is not guaranteed to be maintained in future versions of this scheme (or Git). Examples --------- +======== + Core identifiers -~~~~~~~~~~~~~~~~ +---------------- * ``swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2`` points to the content of a file containing the full text of the GPL3 license + * ``swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505`` points to a directory containing the source code of the Darktable photography application as it was at some point on 4 May 2017 + * ``swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d`` points to a commit in the development history of Darktable, dated 16 January 2017, that added undo/redo supports for masks + * ``swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f`` points to Darktable release 2.3.0, dated 24 December 2016 + * ``swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453`` points to a snapshot of the entire Darktable Git repository taken on 4 May 2017 from GitHub + Identifiers with qualifiers -~~~~~~~~~~~~~~~~~~~~~~~~~~~ +--------------------------- -* The following `fully qualified SWHID `_ denotes the lines 9 to 15 of a file content that can be found at absolute path ``/Examples/SimpleFarm/simplefarm.ml`` from the root directory of the revision ``swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0`` that is contained in the snapshot ``swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9`` taken from the origin ``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git`` +* The following `SWHID + `_ + denotes the lines 9 to 15 of a file content that can be found at absolute + path ``/Examples/SimpleFarm/simplefarm.ml`` from the root directory of the + revision ``swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0`` that is + contained in the snapshot + ``swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9`` taken from the origin + ``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git``: .. code-block:: url @@ -234,8 +274,9 @@ path=/Examples/SimpleFarm/simplefarm.ml; lines=9-15 - -* This is an example of `a fully qualified SWHID with a percent escaped file path `_ +* Here is an example of a `SWHID + `_ + with a file path that requires percent-escaping: .. code-block:: url @@ -246,11 +287,23 @@ path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%3Burl=foo/ -Computing and resolving SWHIDs -============================== +Implementation +============== + + +Computing +--------- + +An important property of any SWHID is that its core identifier is *intrinsic*: +it can be *computed from the object itself*, without having to rely on any +third party. An implementation of SWHID that allows to do so locally is the +`swh identify `_ +tool, available from the `swh.model `_ +Python package under the GPL license. -An important property of SWHIDs is that a core identifier is *intrinsic*: it can -be *computed from the object itself* using the `swh-identify `_ utility, or equivalently using standard git tools. +SWHIDs are also automatically computed by Software Heritage for all archived +objects as part of its archival activity, and can be looked up via the project +`Web interface `_. This has various practical implications: @@ -259,19 +312,26 @@ just compute the core identifier from the artefact itself, and check that it is the same as the core identifier part of the SHWID -* the core identifier of a software artifact can be computed *before* its archival on - Software Heritage +* the core identifier of a software artifact can be computed *before* its + archival on Software Heritage + Resolvers --------- -SWHIDs can be resolved using the Software Heritage Web application (see -:py:mod:`swh.web`). In particular, the **root endpoint** ``/`` can be given a -SWHID and will lead to the browsing page of the corresponding object, like -this: ``https://archive.softwareheritage.org/``. -A **dedicated** ``/resolve`` **endpoint** of the HTTP API is also available to -explicitly request SWHID resolution; see: :http:get:`/api/1/resolve/(swh_id)/`. +Software Heritage resolver +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +SWHIDs can be resolved using the Software Heritage `Web interface +`_. In particular, the **root endpoint** +``/`` can be given a SWHID and will lead to the browsing page of the +corresponding object, like this: +``https://archive.softwareheritage.org/``. + +A **dedicated** ``/resolve`` **endpoint** of the Software Heritage `Web API +`_ is also available to +programmatically resolve SWHIDs; see: :http:get:`/api/1/resolve/(swh_id)/`. Examples: @@ -283,10 +343,11 @@ * ``_ * ``_ -External resolvers -~~~~~~~~~~~~~~~~~~ -The following **independent resolvers** support resolution of SWHIDs: +Third-party resolvers +~~~~~~~~~~~~~~~~~~~~~ + +The following **third party resolvers** support SWHID resolution: * `Identifiers.org `_; see: ``_ (registry identifier `MIR:00000655 @@ -294,6 +355,10 @@ * `Name-to-Thing (N2T) `_ +Note that resolution via Identifiers.org currently only supports *core +identifiers* due to `syntactic incompatibilities with qualifiers +`_. + Examples: * ``_ @@ -304,8 +369,6 @@ * ``_ * ``_ -Note that resolution via Identifiers.org currently only supports *core identifiers* due to `syntactic incompatibilities with qualifiers `_. - References ==========