diff --git a/docs/persistent-identifiers.rst b/docs/persistent-identifiers.rst index b6d8c81..8538b19 100644 --- a/docs/persistent-identifiers.rst +++ b/docs/persistent-identifiers.rst @@ -1,315 +1,323 @@ .. _persistent-identifiers: ================================================= SoftWare Heritage persistent IDentifiers (SWHIDs) ================================================= **version 1.3, last modified 2020-04-28** Overview ======== You can point to objects present in the Software Heritage archive by the means of **SoftWare Heritage persistent IDentifiers**, or **SWHIDs** for short, that are guaranteed to remain stable (persistent) over time. Their syntax, meaning, and usage is described below. Note that they are identifiers and not URLs, even though URL-based resolvers for SWHIDs are also available. -A SWHID can point to any software artifact (or "object") available in the -Software Heritage archive. Objects come in different types: +A SWHID consists of two separate parts, a *core identifier* that can point to +any software artifact (or "object") available in the Software Heritage archive, +and an *optional list of qualifiers* that allows to specify the context where +the object is meant to be seen, or point to a subpart of the object itself. + +Objects come in different types: * contents * directories * revisions * releases * snapshots Each object is identified by an intrinsic, type-specific object identifier that is embedded in its SWHID as described below. The intrinsic identifiers embedded in SWHIDs are strong cryptographic hashes computed on the entire set of object properties. Together, these identifiers form a `Merkle structure `_, specifically a Merkle DAG. See the :ref:`Software Heritage data model ` for an overview of object types and how they are linked together. See :py:mod:`swh.model.identifiers` for details on how the intrinsic identifiers embedded in SWHIDs are computed. +The optional qualifiers are of two kinds: + +* *context qualifiers* carry information about the context where a given + object is meant to be seen; this is particularly important, as the same object + can be reached in the Merkle graph following different *paths* from different + nodes (or *anchors*), and it may have been retrieved from different *origins*, + that may evolve between different *visits*, +* *fragment qualifiers* allow to pinpoint specific subparts of an object + Syntax ------ Syntactically, SWHIDs are generated by the ```` entry point of the grammar: .. code-block:: bnf - ::= "swh" ":" ":" ":" ; + ::= [ ] ; + ::= "swh" ":" ":" ":" ; ::= "1" ; ::= "snp" (* snapshot *) | "rel" (* release *) | "rev" (* revision *) | "dir" (* directory *) | "cnt" (* content *) ; ::= 40 * ; (* intrinsic object id, as hex-encoded SHA1 *) - ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" + ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ; ::= | "a" | "b" | "c" | "d" | "e" | "f" ; + := [ ] ; + ::= + + | + ; + ::= + + | + | + | + ; + ::= ";" "origin" "=" ; + ::= ";" "visit" "=" ; + ::= ";" "anchor" "=" ; + ::= ";" "path" "=" ; + ::= ";" "lines" "=" ["-" ] ; + ::= + ; + ::= (* RFC 3986 compliant URLs, percent-escaped *) + ::= (* RFC 3986 compliant absolute file path, percent-escaped *) + +Where: + +- ```` is an ```` from `RFC 3987`_, and +- ```` is a `RFC 3987`_ IRI + +in either case all occurrences of ``;`` (and ``%``, as required by the RFC) +have been percent-encoded (as ``%3B`` and ``%25`` respectively). Other +characters *can* be percent-encoded, e.g., to improve readability and/or +embeddability of SWHID in other contexts. + +.. _RFC 3987: https://tools.ietf.org/html/rfc3987 Semantics --------- -``:`` is used as separator between the logical parts of SWHIDs. The ``swh`` +Core identifiers +~~~~~~~~~~~~~~~~ + +``:`` is used as separator between the logical parts of core identifiers. The ``swh`` prefix makes explicit that these identifiers are related to *SoftWare Heritage*. ``1`` (````) is the current version of this identifier *scheme*; future editions will use higher version numbers, possibly breaking backward compatibility (but without breaking the resolvability of SWHIDs that conform to previous versions of the scheme). A SWHID points to a single object, whose type is explicitly captured by ````: * ``snp`` to **snapshots**, * ``rel`` to **releases**, * ``rev`` to **revisions**, * ``dir`` to **directories**, * ``cnt`` to **contents**. The actual object pointed to is identified by the intrinsic identifier ````, which is a hex-encoded (using lowercase ASCII characters) SHA1 computed on the content and metadata of the object itself, as follows: * for **snapshots**, intrinsic identifiers are computed as per :py:func:`swh.model.identifiers.snapshot_identifier` * for **releases**, as per :py:func:`swh.model.identifiers.release_identifier` + that produces the same result as a git release hash * for **revisions**, as per :py:func:`swh.model.identifiers.revision_identifier` + that produces the same result as a git commit hash -* for **directories**, as per +* for **directories**, per :py:func:`swh.model.identifiers.directory_identifier` + that produces the same result as a git tree hash -* for **contents**, the intrinsic identifier is the ``sha1_git`` hash of the - multiple hashes returned by +* for **contents**, the intrinsic identifier is the ``sha1_git`` hash returned by :py:func:`swh.model.identifiers.content_identifier`, i.e., the SHA1 of a byte sequence obtained by juxtaposing the ASCII string ``"blob"`` (without quotes), a space, the length of the content as decimal digits, a NULL byte, and the actual content of the file. +Qualifiers +~~~~~~~~~~ + +``;`` is used as separator between the core identifier and the optional +qualifiers, and optional qualifiers. Each qualifier is specified as a +key/value pair, using ``=`` as a separator. + +The following *context qualifiers* are available: + +* **origin** : the *software origin* where an object has been found or observed + in the wild, as an URI; +* **visit** : the core identifier of a *snapshot* corresponding to a specific + *visit* of a repository containing the designated object; +* **anchor** : a *designated node* in the Merkle DAG relative to which a *path + to the object* is specified, as the core identifier of a directory, a + revision, a release or a snapshot; +* **path** : the *absolute file path*, from the *root directory* associated to + the *anchor node*, to the object; when the anchor denotes a directory or a + revision, and almost always when it's a release, the root directory is + uniquely determined; when the anchor denotes a snapshot, the root directory + is the one pointed to by ``HEAD`` (possibly indirectly), and undefined if + such a reference is missing; + +The following *fragment qualifier* is available: + +* **lines** : *line number(s)* of interest, usually within a content object + +We recommend to equip identifiers meant to be shared with as many qualifiers as +possible. While qualifiers may be listed in any order, it is good practice to +present them in the order given above, i.e., ``origin``, ``visit``, ``anchor``, +``path``, ``lines``. Redundant information should be omitted: for example, if +the *visit* is present, and the *path* is relative to the snapshot indicated +there, then the *anchor* qualifier is superfluous; similarly, if the *path* is +empty, it may be omitted. Git compatibility ~~~~~~~~~~~~~~~~~ SWHIDs for contents, directories, revisions, and releases are, at present, compatible with the `Git `_ way of `computing identifiers `_ for its objects. -A SWHID for a content object will correspond (in its ```` part) to a -Git blob identifier of any file with the same content; a SWHID for a revision -will correspond to the Git commit identifier for the same revision, etc. This -is not the case for snapshot identifiers, as Git does not have a corresponding -object type. +The ```` part of a SWHID for a content object is the Git blob +identifier of any file with the same content; for a revision it is the Git +commit identifier for the same revision, etc. This is not the case for snapshot +identifiers, as Git does not have a corresponding object type. Note that Git compatibility is incidental and is not guaranteed to be maintained in future versions of this scheme (or Git). Examples -------- +Core identifiers +~~~~~~~~~~~~~~~~ + * ``swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2`` points to the content of a file containing the full text of the GPL3 license * ``swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505`` points to a directory containing the source code of the Darktable photography application as it was at some point on 4 May 2017 * ``swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d`` points to a commit in the development history of Darktable, dated 16 January 2017, that added undo/redo supports for masks * ``swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f`` points to Darktable release 2.3.0, dated 24 December 2016 * ``swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453`` points to a snapshot of the entire Darktable Git repository taken on 4 May 2017 from GitHub +Identifiers with qualifiers +~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Contextual information -====================== - -The SWHIDs as described above are *intrinsic identifiers*, as they are computed -from the designated object itself, and it is often useful to provide -*contextual information* about a particular occurrence of the object, like the -origin from where the object has been found. To this end, SWHIDs can be -coupled with **qualifiers** that capture such *contextual information*. -Qualifiers come in different kinds: - -* origin -* visit -* anchor -* path -* lines - - -Syntax ------- - -The full-syntax to complement SWHIDs with contextual information is given by -the ```` entry point of the grammar: - -.. code-block:: bnf - - ::= [ ] - := [ ] - ::= | | | | - ::= ";" "origin" "=" - ::= ";" "visit" "=" - ::= ";" "anchor" "=" - ::= ";" "path" "=" - ::= ";" "lines" "=" ["-" ] - ::= + - ::= (* RFC 3986 compliant URLs, percent-escaped *) - ::= (* RFC 3986 compliant absolute file path, percent-escaped *) - -Where: - -- ```` is an ```` from `RFC 3987`_, and -- ```` is a `RFC 3987`_ IRI - -in either case all occurrences of ``;`` (and ``%``, as required by the RFC) -have been percent-encoded (as ``%3B`` and ``%25`` respectively). Other -characters *can* be percent-encoded, e.g., to improve readability and/or -embeddability of SWHID in other contexts. - -.. _RFC 3987: https://tools.ietf.org/html/rfc3987 - - -Semantics ---------- - -``;`` is used as separator between SWHIDs and the optional contextual -information qualifiers. Each contextual information qualifier is specified as a -key/value pair, using ``=`` as a separator. - -The following piece of contextual information are supported: - -* **origin** : the *software origin* where an object has been found or observed - in the wild, as an URI; -* **visit** : persistent identifier of a *snapshot* corresponding to a specific - *visit* of a repository containing the designated object; -* **anchor** : a *designated node* in the Merkle DAG relative to which a *path - to the object* is specified, as a persistent identifier of a directory, a - revision, a release or a snapshot; -* **path** : the *absolute file path*, from the *root directory* associated to - the *anchor node*, to the object; when the anchor denotes a directory or a - revision, and almost always when it's a release, the root directory is - uniquely determined; when the anchor denotes a snapshot, the root directory - is the one pointed to by ``HEAD`` (possibly indirectly), and undefined if - such a reference is missing; -* **lines** : *line number(s)* of interest, usually within a content object - -We recommend to equip identifiers meant to be shared with as many qualifiers as -possible. While qualifiers may be listed in any order, it is good practice to -present them in the order given above, i.e., ``origin``, ``visit``, ``anchor``, -``path``, ``lines``. Redundant information should be omitted: for example, if -the *visit* is present, and the *path* is relative to the snapshot indicated -there, then the *anchor* qualifier is superfluous. - - -Example -------- - -The following `fully qualified SWHID -`_ -denotes the lines 9 to 15 of a file content that can be found at absolute path -``/Examples/SimpleFarm/simplefarm.ml`` from the root directory of the revision -``swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0`` that is contained in the -snapshot ``swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9`` taken from the -origin ``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git``. +* The following `fully qualified SWHID `_ denotes the lines 9 to 15 of a file content that can be found at absolute path ``/Examples/SimpleFarm/simplefarm.ml`` from the root directory of the revision ``swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0`` that is contained in the snapshot ``swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9`` taken from the origin ``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git`` .. code-block:: url swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b; origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git; visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9; anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0; path=/Examples/SimpleFarm/simplefarm.ml; lines=9-15 -And this is an example of `a fully qualified SWHID with a percent escaped file -path -`_ +* This is an example of `a fully qualified SWHID with a percent escaped file path `_ .. code-block:: url swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04; origin=https://github.com/web-platform-tests/wpt; visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499; anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96; path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%3Burl=foo/ -Resolution -========== +Computing and resolving SWHIDs +============================== + +An important property of SWHIDs is that a core identifier is *intrinsic*: it can +be *computed from the object itself* using the `swh-identify `_ utility, or equivalently using standard git tools. + +This has various practical implications: +* when a software artifact is obtained from Software Heritage by resolving a + SWHID, it is straightforward to verify that it is exactly the intended one: + just compute the core identifier from the artefact itself, and check that it + is the same as the core identifier part of the SHWID -Dedicated resolvers -------------------- +* the core identifier of a software artifact can be computed *before* its archival on + Software Heritage + +Resolvers +--------- SWHIDs can be resolved using the Software Heritage Web application (see :py:mod:`swh.web`). In particular, the **root endpoint** ``/`` can be given a SWHID and will lead to the browsing page of the corresponding object, like this: ``https://archive.softwareheritage.org/``. A **dedicated** ``/resolve`` **endpoint** of the HTTP API is also available to explicitly request SWHID resolution; see: :http:get:`/api/1/resolve/(swh_id)/`. Examples: * ``_ * ``_ * ``_ * ``_ * ``_ - +* ``_ +* ``_ External resolvers ------------------- +~~~~~~~~~~~~~~~~~~ The following **independent resolvers** support resolution of SWHIDs: * `Identifiers.org `_; see: ``_ (registry identifier `MIR:00000655 `_). * `Name-to-Thing (N2T) `_ Examples: * ``_ * ``_ * ``_ * ``_ * ``_ +* ``_ +* ``_ -Note that resolution via Identifiers.org does not support contextual -information, due to `syntactic incompatibilities -`_. +Note that resolution via Identifiers.org currently only supports *core identifiers* due to `syntactic incompatibilities with qualifiers `_. References ========== * Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. `Identifiers for Digital Objects: the Case of Software Source Code Preservation `_. In Proceedings of `iPRES 2018 `_: 15th International Conference on Digital Preservation, Boston, MA, USA, September 2018, 9 pages. * Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. `Referencing Source Code Artifacts: a Separate Concern in Software Citation `_. In Computing in Science and Engineering, volume 22, issue 2, pages 33-43. ISSN 1521-9615, IEEE. March 2020.