Page MenuHomeSoftware Heritage

Persistent identifiers (PIDs): add a way to describe Merkle DAG paths
Closed, ResolvedPublic

Description

[updated on the basis of T1241#42722 below]

The goal of this task is to define the canonical way of describing paths in the SWH Merkle DAG.
This is formally a description of how one goes from a given node in the Merkle DAG, that we call an anchor to another node following the edges in the DAG, the endpoint.

We observe that when the anchor denotes a revision (and most often when it's a release), it's trivial to find in the DAG the root directory of the source code, and we only need the file path to identify the content we are interested in. When it's a snapshot, there is a default root directory to point to.

Hence we have concluded that for the vast majority of use cases it is enough to
extend the syntax and semantics of our SWH-IDs with the following optional
elements:

  • anchor : the swh-id of the anchor node in the DAG: this can be a snapshot, a release, a revision, or a directory
  • path : the full path from the root directory of the anchor to the endpoint object, that can be a directory or a file content
  • visit : the swh-id of the snapshot in whose context the anchor must be shown

Here is a full example:

swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;
  anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;
  path=/Examples/SimpleFarm/simplefarm.ml;
  visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;
  origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;
  lines=12-23

We checked with @anlambert that all the pieces of information needed to generate such optional elements for the swh-ids are already available in the WebApp view, so it will be straightforward to provide the final user with this kind of links.

On the receiving end, we also have all the information needed to check that the object and its context match:

  • for the path part, we just need to follow the path from the anchor and check that the endpoint has the declared swh-id
  • for the visit part, we might (if we want) make a request to the swh graph to check that the anchor node is well in the subgraph rooted at the given snapshot

Event Timeline

rdicosmo created this task.

This is a spin-off of the discussion started in T1098

zack renamed this task from Describing paths in the Merkle DAG to Persistent identifiers (PIDs): add a way to describe Merkle DAG paths.Apr 13 2019, 4:47 PM

For file paths it would be nice to also support steps that use usual file/dir names foo/bar/baz, as a more readable alternative to number-based steps.

A related (and very popular) use case is the need of referencing a file content in the archive (by hash) and also specifying its filename, so that it can be downloaded with a meaningful default file name when the browser offers to save it.

rdicosmo changed the task status from Open to Work in Progress.EditedMar 23 2020, 3:56 PM

As part of the discussion about the revamped UX, we have simplified the proposal for describing paths in the Merkle DAG. When the anchor denotes a revision (and most often when it's a release), it's trivial to find in the DAG the root directory of the source code, and we only need the file path to identify the content we are interested in. When it's a snapshot, there is a default root directory to point to.

Hence it is enough to extend the syntax and semantics of our SWH-IDs with the following optional elements:

  • anchor : the swh-id of the anchor node in the DAG: this can be a snapshot, a release, a revision, or a directory
  • path : the full path from the root directory of the anchor to the endpoint object, that can be a directory or a file content
  • visit : the swh-id of the snapshot in whose context the anchor must be shown

Here is a full example:

swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;
  anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;
  path=/Examples/SimpleFarm/simplefarm.ml;
  visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;
  origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;
  lines=12-23

We checked with @anlambert that all the pieces of information needed to generate such optional elements for the swh-ids are already available in the WebApp view, so it will be straightforward to provide the final user with this kind of links.

On the receiving end, we also have all the information needed to check that the object and its context match:

  • for the path part, we just need to follow the path from the anchor and check that the endpoint has the declared swh-id
  • for the visit part, we might (if we want) make a request to the swh graph to check that the anchor node is well in the subgraph rooted at the given snapshot
rdicosmo raised the priority of this task from Low to Normal.Mar 23 2020, 4:00 PM

LGTM in general.

A couple of questions/nitpicks follows.

First, about anchor:

  • in our data model, releases do not necessarily point to revisions (or other releases), they can point to any arbitrary object. So in the case anchor is a release object, we cannot /always/ find a (root) directory by peeling it.
  • conversely, most (recent) snapshots in the archive have a HEAD symbolic ref that points to the "default branch" of that snapshot.

taken together, these two aspects make me think that we are either not strict enough (we should also exclude releases, if we want to be sure of having a root dir) or not liberal enough in what type of objects are allowed in anchor.

I'm tempted to say that we should apply Occam's Razor and allow any object type in anchor. In case of snapshots, following the HEAD branch (if it exists) will be no less of an heuristic than following the directory pointed by the release (if the release points to one).

Second, about snp, I wonder if we can find a more general name, as that seems to be an implementation detail for a naming scheme that is supposed to be readable.

(removed the last point, the hierarchy thing is in fact not relevant here, as we're pointing upward, not downward)

About the anchor point: no objection to having also shapshot as a possible anchor in the schema.

Also fully agree we need something better than "snp"; we had thought of "visit", what about that?

Update the proposal with visit instead of snp

@rdicosmo: the current version of the full example above LGTM (the surrounding text is inconsistent, e.g., it still mentions "snp" as a key and forbids snapshot anchors, but I suspect it's just that you didn't bother editing everything. Hence, we're good! :-))

@zack thanks for spotting the missing pieces... now fixed in the description, we're ready to go! :-)
Would you take care of extending the definition in https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html ?

Just a question about using a path with a different branch, for example for a tag of a version (which is not a release):

  • in this case, the anchor is the snp and the branch name (the tag) is in the path?

Here a link as an example:
https://archive.softwareheritage.org/browse/origin/https://gitlab.inria.fr/cado-nfs/cado-nfs/directory/?branch=refs/tags/2.3.0

swh:1:dir:da1f541c4b85fc216fbe1ca512cbd8718f8356cb
anchor=swh:1:snp:c34b5ed5c5e737e4d2f8d6b5bd887fae92af0362
path=refs/tags/2.3.0;
visit=swh:1:snp:c34b5ed5c5e737e4d2f8d6b5bd887fae92af0362;
origin=https://gitlab.inria.fr/cado-nfs/cado-nfs;
anlambert reopened this task as Work in Progress.EditedApr 17 2020, 9:59 AM

@moranegg , for the branch case the anchor will be the revision it points to. For your example, it will be

swh:1:dir:da1f541c4b85fc216fbe1ca512cbd8718f8356cb
anchor=swh:1:rev:2acb184f417f8629946a0ca2db36dbdbd1741d11;
path=/;
visit=swh:1:snp:c34b5ed5c5e737e4d2f8d6b5bd887fae92af0362;
origin=https://gitlab.inria.fr/cado-nfs/cado-nfs;

It is then easy to match the branch from the revision by looking at the snapshot content.

It seems Phabricator reopened the task automatically with my last comment, that was not intended.