diff --git a/docs/persistent-identifiers.rst b/docs/persistent-identifiers.rst --- a/docs/persistent-identifiers.rst +++ b/docs/persistent-identifiers.rst @@ -16,8 +16,12 @@ and usage is described below. Note that they are identifiers and not URLs, even though URL-based resolvers for SWHIDs are also available. -A SWHID can point to any software artifact (or "object") available in the -Software Heritage archive. Objects come in different types: +A SWHID consists of two separate parts, a *core identifier* that can point to +any software artifact (or "object") available in the Software Heritage archive, +and an *optional list of qualifiers* that allows to specify the context where +the object is meant to be seen, or point to a subpart of the object itself. + +Objects come in different types: * contents * directories @@ -36,6 +40,15 @@ :py:mod:`swh.model.identifiers` for details on how the intrinsic identifiers embedded in SWHIDs are computed. +The optional qualifiers are of two kinds: + +* *context qualifiers* carry information about the context where a given + object is meant to be seen; this is particularly important, as the same object + can be reached in the Merkle graph following different *paths* from different + nodes (or *anchors*), and it may have been retrieved from different *origins*, + that may evolve between different *visits*, +* *fragment qualifiers* allow to pinpoint specific subparts of an object + Syntax ------ @@ -45,7 +58,8 @@ .. code-block:: bnf - ::= "swh" ":" ":" ":" ; + ::= [ ] ; + ::= "swh" ":" ":" ":" ; ::= "1" ; ::= "snp" (* snapshot *) @@ -55,14 +69,48 @@ | "cnt" (* content *) ; ::= 40 * ; (* intrinsic object id, as hex-encoded SHA1 *) - ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" + ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ; ::= | "a" | "b" | "c" | "d" | "e" | "f" ; + := [ ] ; + ::= + + | + ; + ::= + + | + | + | + ; + ::= ";" "origin" "=" ; + ::= ";" "visit" "=" ; + ::= ";" "anchor" "=" ; + ::= ";" "path" "=" ; + ::= ";" "lines" "=" ["-" ] ; + ::= + ; + ::= (* RFC 3986 compliant URLs, percent-escaped *) + ::= (* RFC 3986 compliant absolute file path, percent-escaped *) + +Where: + +- ```` is an ```` from `RFC 3987`_, and +- ```` is a `RFC 3987`_ IRI + +in either case all occurrences of ``;`` (and ``%``, as required by the RFC) +have been percent-encoded (as ``%3B`` and ``%25`` respectively). Other +characters *can* be percent-encoded, e.g., to improve readability and/or +embeddability of SWHID in other contexts. + +.. _RFC 3987: https://tools.ietf.org/html/rfc3987 Semantics --------- -``:`` is used as separator between the logical parts of SWHIDs. The ``swh`` +Core identifiers +~~~~~~~~~~~~~~~~ + +``:`` is used as separator between the logical parts of core identifiers. The ``swh`` prefix makes explicit that these identifiers are related to *SoftWare Heritage*. ``1`` (````) is the current version of this identifier *scheme*; future editions will use higher version numbers, possibly @@ -87,20 +135,56 @@ * for **releases**, as per :py:func:`swh.model.identifiers.release_identifier` + that produces the same result as a git release hash * for **revisions**, as per :py:func:`swh.model.identifiers.revision_identifier` + that produces the same result as a git commit hash -* for **directories**, as per +* for **directories**, per :py:func:`swh.model.identifiers.directory_identifier` + that produces the same result as a git tree hash -* for **contents**, the intrinsic identifier is the ``sha1_git`` hash of the - multiple hashes returned by +* for **contents**, the intrinsic identifier is the ``sha1_git`` hash returned by :py:func:`swh.model.identifiers.content_identifier`, i.e., the SHA1 of a byte sequence obtained by juxtaposing the ASCII string ``"blob"`` (without quotes), a space, the length of the content as decimal digits, a NULL byte, and the actual content of the file. +Qualifiers +~~~~~~~~~~ + +``;`` is used as separator between the core identifier and the optional +qualifiers, and optional qualifiers. Each qualifier is specified as a +key/value pair, using ``=`` as a separator. + +The following *context qualifiers* are available: + +* **origin** : the *software origin* where an object has been found or observed + in the wild, as an URI; +* **visit** : the core identifier of a *snapshot* corresponding to a specific + *visit* of a repository containing the designated object; +* **anchor** : a *designated node* in the Merkle DAG relative to which a *path + to the object* is specified, as the core identifier of a directory, a + revision, a release or a snapshot; +* **path** : the *absolute file path*, from the *root directory* associated to + the *anchor node*, to the object; when the anchor denotes a directory or a + revision, and almost always when it's a release, the root directory is + uniquely determined; when the anchor denotes a snapshot, the root directory + is the one pointed to by ``HEAD`` (possibly indirectly), and undefined if + such a reference is missing; + +The following *fragment qualifier* is available: + +* **lines** : *line number(s)* of interest, usually within a content object + +We recommend to equip identifiers meant to be shared with as many qualifiers as +possible. While qualifiers may be listed in any order, it is good practice to +present them in the order given above, i.e., ``origin``, ``visit``, ``anchor``, +``path``, ``lines``. Redundant information should be omitted: for example, if +the *visit* is present, and the *path* is relative to the snapshot indicated +there, then the *anchor* qualifier is superfluous; similarly, if the *path* is +empty, it may be omitted. Git compatibility ~~~~~~~~~~~~~~~~~ @@ -108,11 +192,10 @@ SWHIDs for contents, directories, revisions, and releases are, at present, compatible with the `Git `_ way of `computing identifiers `_ for its objects. -A SWHID for a content object will correspond (in its ```` part) to a -Git blob identifier of any file with the same content; a SWHID for a revision -will correspond to the Git commit identifier for the same revision, etc. This -is not the case for snapshot identifiers, as Git does not have a corresponding -object type. +The ```` part of a SWHID for a content object is the Git blob +identifier of any file with the same content; for a revision it is the Git +commit identifier for the same revision, etc. This is not the case for snapshot +identifiers, as Git does not have a corresponding object type. Note that Git compatibility is incidental and is not guaranteed to be maintained in future versions of this scheme (or Git). @@ -121,6 +204,9 @@ Examples -------- +Core identifiers +~~~~~~~~~~~~~~~~ + * ``swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2`` points to the content of a file containing the full text of the GPL3 license * ``swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505`` points to a directory @@ -134,99 +220,10 @@ * ``swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453`` points to a snapshot of the entire Darktable Git repository taken on 4 May 2017 from GitHub +Identifiers with qualifiers +~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Contextual information -====================== - -The SWHIDs as described above are *intrinsic identifiers*, as they are computed -from the designated object itself, and it is often useful to provide -*contextual information* about a particular occurrence of the object, like the -origin from where the object has been found. To this end, SWHIDs can be -coupled with **qualifiers** that capture such *contextual information*. -Qualifiers come in different kinds: - -* origin -* visit -* anchor -* path -* lines - - -Syntax ------- - -The full-syntax to complement SWHIDs with contextual information is given by -the ```` entry point of the grammar: - -.. code-block:: bnf - - ::= [ ] - := [ ] - ::= | | | | - ::= ";" "origin" "=" - ::= ";" "visit" "=" - ::= ";" "anchor" "=" - ::= ";" "path" "=" - ::= ";" "lines" "=" ["-" ] - ::= + - ::= (* RFC 3986 compliant URLs, percent-escaped *) - ::= (* RFC 3986 compliant absolute file path, percent-escaped *) - -Where: - -- ```` is an ```` from `RFC 3987`_, and -- ```` is a `RFC 3987`_ IRI - -in either case all occurrences of ``;`` (and ``%``, as required by the RFC) -have been percent-encoded (as ``%3B`` and ``%25`` respectively). Other -characters *can* be percent-encoded, e.g., to improve readability and/or -embeddability of SWHID in other contexts. - -.. _RFC 3987: https://tools.ietf.org/html/rfc3987 - - -Semantics ---------- - -``;`` is used as separator between SWHIDs and the optional contextual -information qualifiers. Each contextual information qualifier is specified as a -key/value pair, using ``=`` as a separator. - -The following piece of contextual information are supported: - -* **origin** : the *software origin* where an object has been found or observed - in the wild, as an URI; -* **visit** : persistent identifier of a *snapshot* corresponding to a specific - *visit* of a repository containing the designated object; -* **anchor** : a *designated node* in the Merkle DAG relative to which a *path - to the object* is specified, as a persistent identifier of a directory, a - revision, a release or a snapshot; -* **path** : the *absolute file path*, from the *root directory* associated to - the *anchor node*, to the object; when the anchor denotes a directory or a - revision, and almost always when it's a release, the root directory is - uniquely determined; when the anchor denotes a snapshot, the root directory - is the one pointed to by ``HEAD`` (possibly indirectly), and undefined if - such a reference is missing; -* **lines** : *line number(s)* of interest, usually within a content object - -We recommend to equip identifiers meant to be shared with as many qualifiers as -possible. While qualifiers may be listed in any order, it is good practice to -present them in the order given above, i.e., ``origin``, ``visit``, ``anchor``, -``path``, ``lines``. Redundant information should be omitted: for example, if -the *visit* is present, and the *path* is relative to the snapshot indicated -there, then the *anchor* qualifier is superfluous. - - -Example -------- - -The following `fully qualified SWHID -`_ -denotes the lines 9 to 15 of a file content that can be found at absolute path -``/Examples/SimpleFarm/simplefarm.ml`` from the root directory of the revision -``swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0`` that is contained in the -snapshot ``swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9`` taken from the -origin ``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git``. +* The following `fully qualified SWHID `_ denotes the lines 9 to 15 of a file content that can be found at absolute path ``/Examples/SimpleFarm/simplefarm.ml`` from the root directory of the revision ``swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0`` that is contained in the snapshot ``swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9`` taken from the origin ``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git`` .. code-block:: url @@ -238,9 +235,7 @@ lines=9-15 -And this is an example of `a fully qualified SWHID with a percent escaped file -path -`_ +* This is an example of `a fully qualified SWHID with a percent escaped file path `_ .. code-block:: url @@ -251,12 +246,24 @@ path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%3Burl=foo/ -Resolution -========== +Computing and resolving SWHIDs +============================== + +An important property of SWHIDs is that a core identifier is *intrinsic*: it can +be *computed from the object itself* using the `swh-identify `_ utility, or equivalently using standard git tools. + +This has various practical implications: +* when a software artifact is obtained from Software Heritage by resolving a + SWHID, it is straightforward to verify that it is exactly the intended one: + just compute the core identifier from the artefact itself, and check that it + is the same as the core identifier part of the SHWID -Dedicated resolvers -------------------- +* the core identifier of a software artifact can be computed *before* its archival on + Software Heritage + +Resolvers +--------- SWHIDs can be resolved using the Software Heritage Web application (see :py:mod:`swh.web`). In particular, the **root endpoint** ``/`` can be given a @@ -273,10 +280,11 @@ * ``_ * ``_ * ``_ - +* ``_ +* ``_ External resolvers ------------------- +~~~~~~~~~~~~~~~~~~ The following **independent resolvers** support resolution of SWHIDs: @@ -293,10 +301,10 @@ * ``_ * ``_ * ``_ +* ``_ +* ``_ -Note that resolution via Identifiers.org does not support contextual -information, due to `syntactic incompatibilities -`_. +Note that resolution via Identifiers.org currently only supports *core identifiers* due to `syntactic incompatibilities with qualifiers `_. References