Page MenuHomeSoftware Heritage

Extend SWH PID definition with additional context qualifiers.
ClosedPublic

Authored by rdicosmo on Mar 28 2020, 3:17 PM.

Diff Detail

Repository
rDMOD Data model
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D2924 (id=10362)

Rebasing onto 4a2233c5f7...

Current branch diff-target is up to date.
Changes applied before test
commit 0767c811c859a798a7cfaa4247c6b2b15cc7e8aa
Author: Roberto Di Cosmo <roberto@dicosmo.org>
Date:   Sat Mar 28 15:16:04 2020 +0100

    Extend SWH PID definition with additional context qualifiers.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/2/ for more details.

vlorentz added inline comments.
docs/persistent-identifiers.rst
138

I think it would be nice to mention that these qualifiers are extrinsic and only contextualize one occurence of the object, even if that's redundant with the formal definition

163–165
path-absolute = "/" [ segment-nz *( "/" segment ) ]

segment       = *pchar
segment-nz    = 1*pchar
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
            / "*" / "+" / "," / ";" / "="

So the grammar is ambiguous because both ; and = are allowed in path_absolute.

A possible fix is to require them to be percent-encoded.

177–212

Object don't have to be ingested by SWH for their swh-id to be valid

178–179

I'm really confused by this. Is it the PID of a snapshot?

182–185

I'd change the punctuation to make this more readable:

* **path** : the *absolute file path* from the *root directory* associated to the *anchor node*, to the object.
  When the anchor denotes a directory or a revision, and almost always when it's a release,
  the root directory is uniquely determined.
  When the anchor denotes a snapshot, the root directory is considered to
  be the one associated to the main branch of that snapshot;

And you must define what the "main branch" is. You should also mention what to do if the anchor resolves to a snapshot doesn't have any branch that fits your definition of main branch.

rdicosmo marked 3 inline comments as done.
  • Clarify ambiguities in PID extensions

Build is green

Patch application report for D2924 (id=10368)

Rebasing onto 4a2233c5f7...

Current branch diff-target is up to date.
Changes applied before test
commit d14883e8d24efcf0c43e1037bb19179d3cbfb57b
Author: Roberto Di Cosmo <roberto@dicosmo.org>
Date:   Sat Mar 28 18:22:11 2020 +0100

    Clarify ambiguities in PID extensions

commit 0767c811c859a798a7cfaa4247c6b2b15cc7e8aa
Author: Roberto Di Cosmo <roberto@dicosmo.org>
Date:   Sat Mar 28 15:16:04 2020 +0100

    Extend SWH PID definition with additional context qualifiers.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/3/ for more details.

You could add this example of a path that need extra percent escapes:

swh:1:cnt:f10371aa7b8ccabca8479196d6cd640676fd4a04?origin=https://github.com/web-platform-tests/wpt;anchor=swh:1:rev:259d0612af038d14f2cd889a14a3adb6c9e96d96;path=/html/semantics/document-metadata/the-meta-element/pragma-directives/attr-meta-http-equiv-refresh/support/x%3Burl=foo/

(It's almost the only file/dir I could find in a real-life repository with a ;...)

Unrelatedly, could you add a (non-normative) recommendation that qualifiers should be in the same order as the spec (unlike the current example), so they are ordered from the most generic (origin) to the most specific (line)? I doesn't change anything to how they are parsed, but it makes them more readable imo

docs/persistent-identifiers.rst
163–165

So now this doesn't match the examples, as the definition requires all characters to be percent-encoded. You should redefine <path_absolute_encoded> as <path_absolute_escaped>, which is <path_absolute> where all ; must be percent-encoded (as %3B)

178–179

I think it's confusing to call it visit if it's a snapshot PID. If it's for the lack of visit identifiers, maybe we should fix that issue first.

182–185

I think "the one pointed to by the HEAD (possibly indirectly)" should be enough. HEAD might be a branch by itself

rdicosmo marked 5 inline comments as done.
  • Further clarifications in the PID extension

Build is green

Patch application report for D2924 (id=10379)

Rebasing onto 4a2233c5f7...

Current branch diff-target is up to date.
Changes applied before test
commit accca603c42ad68252532222ca6467a19691524e
Author: Roberto Di Cosmo <roberto@dicosmo.org>
Date:   Mon Mar 30 14:13:59 2020 +0200

    Typo

commit b6e92eae7a798dcb203ce786e984fd96a6f7ee07
Author: Roberto Di Cosmo <roberto@dicosmo.org>
Date:   Mon Mar 30 14:11:55 2020 +0200

    Further clarifications in the PID extension

commit d14883e8d24efcf0c43e1037bb19179d3cbfb57b
Author: Roberto Di Cosmo <roberto@dicosmo.org>
Date:   Sat Mar 28 18:22:11 2020 +0100

    Clarify ambiguities in PID extensions

commit 0767c811c859a798a7cfaa4247c6b2b15cc7e8aa
Author: Roberto Di Cosmo <roberto@dicosmo.org>
Date:   Sat Mar 28 15:16:04 2020 +0100

    Extend SWH PID definition with additional context qualifiers.

See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/13/ for more details.

This revision is now accepted and ready to land.Mar 30 2020, 4:18 PM