diff --git a/docs/persistent-identifiers.rst b/docs/persistent-identifiers.rst index b1e7bfc..b01897b 100644 --- a/docs/persistent-identifiers.rst +++ b/docs/persistent-identifiers.rst @@ -1,274 +1,288 @@ .. _persistent-identifiers: ====================== Persistent identifiers ====================== Description =========== You can point to objects present in the Software Heritage archive by the means of **persistent identifiers** that are guaranteed to remain stable (persistent) over time. Their syntax, meaning, and usage is described below. Note that they are identifiers and not URLs, even though an URL-based resolver for Software Heritage persistent identifiers is also provided. A persistent identifier can point to any software artifact (or "object") available in the Software Heritage archive. Objects come in different types, and most notably: * contents * directories * revisions * releases * snapshots Each object is identified by an intrinsic, type-specific object identifier that is embedded in its persistent identifier as described below. Object identifiers are strong cryptographic hashes computed on the entire set of object properties to form a `Merkle structure `_. See :ref:`data-model` for an overview of object types and how they are linked together. See :py:mod:`swh.model.identifiers` for details on how intrinsic object identifiers are computed. Syntax ------ Syntactically, persistent identifiers are generated by the ```` entry point of the grammar: .. code-block:: bnf ::= "swh" ":" ":" ":" ; ::= "1" ; ::= "snp" (* snapshot *) | "rel" (* release *) | "rev" (* revision *) | "dir" (* directory *) | "cnt" (* content *) ; ::= 40 * ; (* intrinsic object id, as hex-encoded SHA1 *) ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ::= | "a" | "b" | "c" | "d" | "e" | "f" ; Semantics --------- ``:`` is used as separator between the logical parts of identifiers. The ``swh`` prefix makes explicit that these identifiers are related to *SoftWare Heritage*. ``1`` (````) is the current version of this identifier *scheme*; future editions will use higher version numbers, possibly breaking backward compatibility (but without breaking the resolvability of identifiers that conform to previous versions of the scheme). A persistent identifier points to a single object, whose type is explicitly captured by ````: * ``snp`` to **snapshots**, * ``rel`` to **releases**, * ``rev`` to **revisions**, * ``dir`` to **directories**, * ``cnt`` to **contents**. The actual object pointed to is identified by the intrinsic identifier ````, which is a hex-encoded (using lowercase ASCII characters) SHA1 computed on the content and metadata of the object itself, as follows: * for **snapshots**, intrinsic identifiers are computed as per :py:func:`swh.model.identifiers.snapshot_identifier` * for **releases**, as per :py:func:`swh.model.identifiers.release_identifier` * for **revisions**, as per :py:func:`swh.model.identifiers.revision_identifier` * for **directories**, as per :py:func:`swh.model.identifiers.directory_identifier` * for **contents**, the intrinsic identifier is the ``sha1_git`` hash of the multiple hashes returned by :py:func:`swh.model.identifiers.content_identifier`, i.e., the SHA1 of a byte sequence obtained by juxtaposing the ASCII string ``"blob"`` (without quotes), a space, the length of the content as decimal digits, a NULL byte, and the actual content of the file. Git compatibility ~~~~~~~~~~~~~~~~~ Intrinsic object identifiers for contents, directories, revisions, and releases are, at present, compatible with the `Git `_ way of `computing identifiers `_ for its objects. A Software Heritage content identifier will be identical to a Git blob identifier of any file with the same content, a Software Heritage revision identifier will be identical to the corresponding Git commit identifier, etc. This is not the case for snapshot identifiers as Git doesn't have a corresponding object type. Note that Git compatibility is incidental and is not guaranteed to be maintained in future versions of this scheme (or Git). Examples -------- * ``swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2`` points to the content of a file containing the full text of the GPL3 license * ``swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505`` points to a directory containing the source code of the Darktable photography application as it was at some point on 4 May 2017 * ``swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d`` points to a commit in the development history of Darktable, dated 16 January 2017, that added undo/redo supports for masks * ``swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f`` points to Darktable release 2.3.0, dated 24 December 2016 * ``swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453`` points to a snapshot of the entire Darktable Git repository taken on 4 May 2017 from GitHub Contextual information ====================== The Software Heritage persistent identifiers described above are *intrinsic identifiers*, as they are computed from the designated object itself, and it is often useful to provide *contextual information* about a particular occurrence of the object, like the origin from where the object has been found. To this end, persistent identifiers can be equipped with **qualifiers** that contain this *contextual information*. Qualifiers come in different kinds : * origin * visit * anchor * path * lines Syntax ------ The full-syntax to complement identifiers with contextual information is given by the ```` entry point of the grammar: .. code-block:: bnf ::= [ ] := [ ] ::= | | | | ::= ";" "origin" "=" ::= ";" "visit" "=" ::= ";" "anchor" "=" - ::= ";" "path" "=" + ::= ";" "path" "=" ::= ";" "lines" "=" ["-" ] ::= + ::= (* RFC 3986 compliant URLs *) - ::= (* RFC 3986 compliant absolute file path, percent-encoded *) + ::= (* RFC 3986 compliant absolute file path, percent-encoded *) -Here ```` is a percent-encoded version of the ```` in `Section 3.3 of RFC 3986 `_ +Here ```` is the ```` in `Section 3.3 of RFC 3986 `_ where all occurrences of ``;`` and ``%`` must be percent-encoded (as `%3B` and `%25` respectively). Semantics --------- ``;`` is used as separator between persistent identifiers and the optional contextual information qualifiers. Each contextual information qualifier is specified as a key/value pair, using ``=`` as a separator. The following piece of contextual information are supported: * **origin** : the *software origin* where an object has been found or observed in the wild, as an URI; * **visit** : persistent identifier of a *snapshot* corresponding to a specific *visit* of a repository containing the designated object; * **anchor** : a *designated node* in the Merkle DAG relative to which a *path to the object* is specified, as a persistent identifier of a directory, a revision, a release or a snapshot; * **path** : the *absolute file path*, from the *root directory* associated to the *anchor node*, to the object; when the anchor denotes a directory or a revision, and almost always when it's a release, the root directory is uniquely determined; when the anchor denotes a snapshot, the root - directory is the one associated to the branch pointed to by the ``HEAD`` symbolic reference, + directory is the one pointed to by ``HEAD`` (possibly indirectly), and undefined if such a reference is missing; * **lines** : *line number(s)* of interest, usually within a content object We recommend to equip identifiers meant to be shared with as many qualifiers as -possible. Redundant information should be omitted: for example, if the *visit* +possible. While qualifiers may be listed in any order, it is good practice +to present them in the order given above, i.e. ``origin``, ``visit``, ``anchor``, ``path``, ``lines``. +Redundant information should be omitted: for example, if the *visit* is present, and the *path* is relative to the snapshot indicated there, then the *anchor* qualifier is superfluous. Example ------- -The following `fully qualified identifier `_ +The following `fully qualified identifier `_ denotes the lines 9 to 15 of a file content that can be found at absolute path ``/Examples/SimpleFarm/simplefarm.ml`` from the root directory of the revision ``swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0`` that is contained in the snapshot ``swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9`` taken from the origin ``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git``. .. code-block:: url swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b; + origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git; + visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9; anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0; path=/Examples/SimpleFarm/simplefarm.ml; - visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9; - origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git; lines=9-15 + +And this is an example of `a fully qualified identifier with a percent escaped file path `_ + +.. code-block:: url + + swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04; + origin=https://github.com/web-platform-tests/wpt; + visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499; + anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96; + path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%3Burl=foo/ + + Resolution ========== Dedicated resolvers ------------------- Persistent identifiers can be resolved using the Software Heritage Web application (see :py:mod:`swh.web`). In particular, the **root endpoint** ``/`` can be given a persistent identifier and will lead to the browsing page of the corresponding object, like this: ``https://archive.softwareheritage.org/``. A **dedicated** ``/resolve`` **endpoint** of the HTTP API is also available to explicitly request persistent identifier resolution; see: :http:get:`/api/1/resolve/(swh_id)/`. Examples: * ``_ * ``_ * ``_ * ``_ * ``_ External resolvers ------------------ The following **independent resolvers** support resolution of Software Heritage persistent identifiers: * `Identifiers.org `_; see: ``_ (registry identifier `MIR:00000655 `_). * `Name-to-Thing (N2T) `_ Examples: * ``_ * ``_ * ``_ * ``_ * ``_ Note that resolution via Identifiers.org does not support contextual information, due to `syntactic incompatibilities `_. References ========== * Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. `Identifiers for Digital Objects: the Case of Software Source Code Preservation `_. In Proceedings of `iPRES 2018 `_: 15th International Conference on Digital Preservation, Boston, MA, USA, September 2018, 9 pages.