Persistent identifiers (PIDs): add a way to describe Merkle DAG paths
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	rdicosmo
	Oct 4 2018, 12:27 PM

Description

[updated on the basis of T1241#42722 below]

The goal of this task is to define the canonical way of describing paths in the SWH Merkle DAG.
This is formally a description of how one goes from a given node in the Merkle DAG, that we call an anchor to another node following the edges in the DAG, the endpoint.

We observe that when the anchor denotes a revision (and most often when it's a release), it's trivial to find in the DAG the root directory of the source code, and we only need the file path to identify the content we are interested in. When it's a snapshot, there is a default root directory to point to.

Hence we have concluded that for the vast majority of use cases it is enough to
extend the syntax and semantics of our SWH-IDs with the following optional
elements:

anchor : the swh-id of the anchor node in the DAG: this can be a snapshot, a release, a revision, or a directory
path : the full path from the root directory of the anchor to the endpoint object, that can be a directory or a file content
visit : the swh-id of the snapshot in whose context the anchor must be shown

Here is a full example:

swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;
  anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;
  path=/Examples/SimpleFarm/simplefarm.ml;
  visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;
  origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;
  lines=12-23

We checked with @anlambert that all the pieces of information needed to generate such optional elements for the swh-ids are already available in the WebApp view, so it will be straightforward to provide the final user with this kind of links.

On the receiving end, we also have all the information needed to check that the object and its context match:

for the path part, we just need to follow the path from the anchor and check that the endpoint has the declared swh-id
for the visit part, we might (if we want) make a request to the swh graph to check that the anchor node is well in the subgraph rooted at the given snapshot

Revisions and Commits

rDWAPPS Web applications
	D3129	rDWAPPS6d00ef0a2829 common/identifiers: Add SWHIDs contextual information computation

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T2190 Archive Navigation (Web UI)
Migrated	gitlab-migration	T2192 UX improvements
Migrated	gitlab-migration	T2330 Simplify Permalinks box
Migrated	gitlab-migration	T2366 Review Persistent identifiers (PIDs) with context in deposit
Migrated	gitlab-migration	T1241 Persistent identifiers (PIDs): add a way to describe Merkle DAG paths

Event Timeline

This is a spin-off of the discussion started in T1098

zack renamed this task from Describing paths in the Merkle DAG to Persistent identifiers (PIDs): add a way to describe Merkle DAG paths.Apr 13 2019, 4:47 PM

For file paths it would be nice to also support steps that use usual file/dir names foo/bar/baz, as a more readable alternative to number-based steps.

A related (and very popular) use case is the need of referencing a file content in the archive (by hash) and also specifying its filename, so that it can be downloaded with a meaningful default file name when the browser offers to save it.

As part of the discussion about the revamped UX, we have simplified the proposal for describing paths in the Merkle DAG. When the anchor denotes a revision (and most often when it's a release), it's trivial to find in the DAG the root directory of the source code, and we only need the file path to identify the content we are interested in. When it's a snapshot, there is a default root directory to point to.

Hence it is enough to extend the syntax and semantics of our SWH-IDs with the following optional elements:

anchor : the swh-id of the anchor node in the DAG: this can be a snapshot, a release, a revision, or a directory
path : the full path from the root directory of the anchor to the endpoint object, that can be a directory or a file content
visit : the swh-id of the snapshot in whose context the anchor must be shown

Here is a full example:

swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;
  anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;
  path=/Examples/SimpleFarm/simplefarm.ml;
  visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;
  origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;
  lines=12-23

On the receiving end, we also have all the information needed to check that the object and its context match:

for the path part, we just need to follow the path from the anchor and check that the endpoint has the declared swh-id
for the visit part, we might (if we want) make a request to the swh graph to check that the anchor node is well in the subgraph rooted at the given snapshot

rdicosmo raised the priority of this task from Low to Normal.Mar 23 2020, 4:00 PM

rdicosmo added projects: Roadmap 2020, 2019 UX audit.Mar 23 2020, 4:11 PM

rdicosmo mentioned this in T2330: Simplify Permalinks box.Mar 23 2020, 4:27 PM

rdicosmo removed a project: Roadmap 2020.

rdicosmo added a parent task: T2330: Simplify Permalinks box.

LGTM in general.

A couple of questions/nitpicks follows.

First, about anchor:

in our data model, releases do not necessarily point to revisions (or other releases), they can point to any arbitrary object. So in the case anchor is a release object, we cannot /always/ find a (root) directory by peeling it.
conversely, most (recent) snapshots in the archive have a HEAD symbolic ref that points to the "default branch" of that snapshot.

taken together, these two aspects make me think that we are either not strict enough (we should also exclude releases, if we want to be sure of having a root dir) or not liberal enough in what type of objects are allowed in anchor.

I'm tempted to say that we should apply Occam's Razor and allow any object type in anchor. In case of snapshots, following the HEAD branch (if it exists) will be no less of an heuristic than following the directory pointed by the release (if the release points to one).

Second, about snp, I wonder if we can find a more general name, as that seems to be an implementation detail for a naming scheme that is supposed to be readable.

(removed the last point, the hierarchy thing is in fact not relevant here, as we're pointing upward, not downward)

About the anchor point: no objection to having also shapshot as a possible anchor in the schema.

Also fully agree we need something better than "snp"; we had thought of "visit", what about that?

Update the proposal with visit instead of snp

@rdicosmo: the current version of the full example above LGTM (the surrounding text is inconsistent, e.g., it still mentions "snp" as a key and forbids snapshot anchors, but I suspect it's just that you didn't bother editing everything. Hence, we're good! :-))

@zack thanks for spotting the missing pieces... now fixed in the description, we're ready to go! :-)
Would you take care of extending the definition in https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html ?

rdicosmo updated the task description. (Show Details)Mar 27 2020, 1:40 PM

rdicosmo mentioned this in D2924: Extend SWH PID definition with additional context qualifiers..Mar 28 2020, 3:17 PM

rdicosmo mentioned this in T2340: Improve handling of file names with special characters.Mar 30 2020, 4:56 PM

This is now done in the few commits leading to https://forge.softwareheritage.org/rDMODaccca603c42ad68252532222ca6467a19691524e

anlambert mentioned this in T2342: Add resolving of new SWHIDs contextual information.Mar 31 2020, 4:13 PM

Just a question about using a path with a different branch, for example for a tag of a version (which is not a release):

in this case, the anchor is the snp and the branch name (the tag) is in the path?

Here a link as an example:
https://archive.softwareheritage.org/browse/origin/https://gitlab.inria.fr/cado-nfs/cado-nfs/directory/?branch=refs/tags/2.3.0

swh:1:dir:da1f541c4b85fc216fbe1ca512cbd8718f8356cb
anchor=swh:1:snp:c34b5ed5c5e737e4d2f8d6b5bd887fae92af0362
path=refs/tags/2.3.0;
visit=swh:1:snp:c34b5ed5c5e737e4d2f8d6b5bd887fae92af0362;
origin=https://gitlab.inria.fr/cado-nfs/cado-nfs;

moranegg added a parent task: T2366: Review Persistent identifiers (PIDs) with context in deposit.Apr 16 2020, 11:26 PM

@moranegg , for the branch case the anchor will be the revision it points to. For your example, it will be

swh:1:dir:da1f541c4b85fc216fbe1ca512cbd8718f8356cb
anchor=swh:1:rev:2acb184f417f8629946a0ca2db36dbdbd1741d11;
path=/;
visit=swh:1:snp:c34b5ed5c5e737e4d2f8d6b5bd887fae92af0362;
origin=https://gitlab.inria.fr/cado-nfs/cado-nfs;

It is then easy to match the branch from the revision by looking at the snapshot content.

It seems Phabricator reopened the task automatically with my last comment, that was not intended.

anlambert added a revision: D3129: common/identifiers: Add SWHIDs contextual information computation.May 6 2020, 4:09 PM

anlambert added a commit: rDWAPPS6d00ef0a2829: common/identifiers: Add SWHIDs contextual information computation.May 7 2020, 1:20 PM

anlambert moved this task from Backlog to Deployed on the 2019 UX audit board.Jun 9 2020, 4:26 PM

anlambert edited projects, added UX; removed 2019 UX audit.Jun 18 2020, 11:52 AM

This task has been migrated to GitLab.

Persistent identifiers (PIDs): add a way to describe Merkle DAG pathsClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Persistent identifiers (PIDs): add a way to describe Merkle DAG paths
Closed, MigratedEdits Locked
Actions

Related Objects
Search...