diff --git a/docs/persistent-identifiers.rst b/docs/persistent-identifiers.rst --- a/docs/persistent-identifiers.rst +++ b/docs/persistent-identifiers.rst @@ -135,15 +135,16 @@ Contextual information ====================== -It is often useful to complement persistent identifiers with **contextual -information** about where the identified object has been found as well as which -specific parts of it are of interest. To that end it is possible, via a -dedicated syntax, to extend persistent identifiers with the following pieces of -information: - -* the **software origin** where an object has been found/observed -* the **line number(s)** of interest, usually within a content object +The Software Heritage persistent identifiers described above are *intrinsic identifiers*, as they are computed from the designated object itself, and it is often useful to provide *contextual information* about a particular +occurrence of the object, like the origin from where the object has been found. +To this end, persistent identifiers can be equipped with **qualifiers** that +contain this *contextual information*. Qualifiers come in different kinds : +* origin +* visit +* anchor +* path +* lines Syntax ------ @@ -153,31 +154,77 @@ .. code-block:: bnf - ::= [] [] - ::= ";" "lines" "=" ["-" ] + ::= [ ] + := [ ] + ::= | | | | ::= ";" "origin" "=" + ::= ";" "visit" "=" + ::= ";" "anchor" "=" + ::= ";" "path" "=" + ::= ";" "lines" "=" ["-" ] ::= + ::= (* RFC 3986 compliant URLs *) + ::= (* RFC 3986 compliant absolute file path, percent-escaped *) +Here ```` is the ```` in `Section 3.3 of RFC 3986 `_ where all occurrences of ``;`` and ``%`` must be percent-encoded (as `%3B` and `%25` respectively). Semantics --------- -``;`` is used as separator between persistent identifiers and additional -optional contextual information. Each piece of contextual information is +``;`` is used as separator between persistent identifiers and the +optional contextual information qualifiers. Each contextual information qualifier is specified as a key/value pair, using ``=`` as a separator. The following piece of contextual information are supported: -* line numbers: it is possible to specify a single line number or a line range, - separating two numbers with ``-``. Note that line numbers are purely - indicative and are not meant to be stable, as in some degenerate cases - (e.g., text files which mix different types of line terminators) it is - impossible to resolve them unambiguously. - -* software origin: where a given object has been found or observed in the wild, - as the URI that was used by Software Heritage to ingest the object into the - archive +* **origin** : the *software origin* where an object has been found or observed in the wild, + as an URI; +* **visit** : persistent identifier of a *snapshot* corresponding to a specific *visit* of a repository containing the designated object; +* **anchor** : a *designated node* in the Merkle DAG relative to which a *path to the object* is specified, + as a persistent identifier of a directory, a revision, a release or a snapshot; +* **path** : the *absolute file path*, from the *root directory* associated to the *anchor node*, to the object; + when the anchor denotes a directory or a revision, and almost always when it's a release, + the root directory is uniquely determined; when the anchor denotes a snapshot, the root + directory is the one pointed to by ``HEAD`` (possibly indirectly), + and undefined if such a reference is missing; +* **lines** : *line number(s)* of interest, usually within a content object + +We recommend to equip identifiers meant to be shared with as many qualifiers as +possible. While qualifiers may be listed in any order, it is good practice +to present them in the order given above, i.e. ``origin``, ``visit``, ``anchor``, ``path``, ``lines``. +Redundant information should be omitted: for example, if the *visit* +is present, and the *path* is relative to the snapshot indicated there, then the +*anchor* qualifier is superfluous. + +Example +------- + +The following `fully qualified identifier `_ +denotes the lines 9 to 15 of a file content that +can be found at absolute path ``/Examples/SimpleFarm/simplefarm.ml`` from the root directory +of the revision ``swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0`` that is contained +in the snapshot ``swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9`` taken from +the origin ``https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git``. + +.. code-block:: url + + swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b; + origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git; + visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9; + anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0; + path=/Examples/SimpleFarm/simplefarm.ml; + lines=9-15 + + +And this is an example of `a fully qualified identifier with a percent escaped file path `_ + +.. code-block:: url + + swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04; + origin=https://github.com/web-platform-tests/wpt; + visit=swh:1:snp:b37d435721bbd450624165f334724e3585346499; + anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96; + path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%3Burl=foo/ Resolution