Page MenuHomeSoftware Heritage

Add full contextual information in a swh-id of an object
Closed, MigratedEdits Locked

Description

We need a clean, simple, elegant way to provide full contextual information as (optional) attributes in the swh-id of any object in the archive.

Event Timeline

moranegg created this task.

Glad that you created that task, I was also thinking to add this information as without it code highlighting is not working great in most situations
(for instance https://archive.softwareheritage.org/swh:1:cnt:8e9ea0378070a06c574fd06b3f81e49445bd10c4/, this Python file has been
detected as text/x-c++ by the file command which is wrong and thus without the file extension info, it gets badly highlighted).

zack added subscribers: rdicosmo, zack.

(tagging as General, while we discuss it)

Would a single additional <filename, the_file_name.c> key/value pair in the list here be enough?
It should go first, before line numbers, both for conceptual reasons and to avoid potential ambiguities (@anlambert I know the Web UI is more flexible than that, but the spec should better remain strict).

The next difficult question is what constraints should we impose on the filename value. And this is a tough one as most characters are valid in filenames, but they will clash with our separators for key/value pairs.
So we're now at having to add escaping to our PIDs, which we have managed to avoid thus far…

Thanks for starting this... it's an important discussion, and it goes quite beyond the need of a "filename" attribute in our family of context attributes :-)

Let me try to give here an overview of how I see this issue.

General discussion

Our goal is to add to our intrinsic identifiers attributes that provide useful pieces of context information, that can be internal or external.

Currently, we have two important attributes handled already:

  • line numbers: this is internal context, specifying a region inside an object (we use it for cnt, but it makes sense even for the inner nodes of the Merkle tree, that are text files)
  • origin: this provide the topmost context information, that is the origin from which an object has been retrieved

Unfortunately, this is not enough contextual information to allow a user to share with another user the same "view" of the object (s)he has in front of her/him when browsing the archive.

Indeed, the first thing we remark when typing
swh:1:cnt:41ddb23118f92d7218099a5e7a990cf58f1d07fa;origin=https://github.com/chrislgarry/Apollo-11;lines=64-72

is that we see the nice lines inside the BURN_BABY_BURN source code of the Apollo 11, but we do not see the file name any more, so one feels compelled to add an
attribute "filename".

Now, suppose the filename attirbute is there: we still have no way to know what is the name of the directory where the file is located, so one will feel compelled to add an attribute "directory", and, after a few iterations, we'll find out we need a "path".

But even with a path in our hands, we still do not know inside which commit we see this path, and inside which branch, and inside which snapshot...

So, yes, this is really a can of worms, and we need to take the time to think all this through, and avoid adding an endless list of other attributes :-)

A first proposal

Digging up some old ideas from the zipper approach to navigating our Merkle tree, here is a first stab at a proposal.

What we are looking for is the information needed to properly position the particular object identified by the intrinsic identifier in the context of a particular subtree rooted at a particular visit of the origin.

In the most general case we need three pieces of information:

  1. the snapshot of the origin we are visiting
  2. the path in our DAG from the snapshot to the particular commit where the tree containing the object is rooted
  3. the path inside this tree to reach the object designated by the hash

Since the DAG is immutable, we do not need to store inside our attributes the commit hashes for (2), nor the filename and the file path etc. for (3): they can be trivially recovered by walking the DAG.

What we need is a language to describe efficiently the paths to follow in the DAG from the snapshot.

An example of a language to do this kind of walktrhrough in a DAG is already present in git, see for example the explanations of the ^ and ~ in https://git-scm.com/book/en/v2/Git-Tools-Revision-Selection (I do not say we need to use it as it is, only that we can look at it for inspiration).

Instead of a single attribute conflating 1), 2) and 3), we could think to have attributes for each of 1), 2) and 3), to ease parsing (but this is really a minor point :-))

  • anchor: the snapshot id (swh:1:snp:....)
  • cpath: the path in the commit graph
  • fpath: the path in the tree

Having all that, the full identification of our Apollo-11 code fragment
would be

swh:1:cnt:41ddb23118f92d7218099a5e7a990cf58f1d07fa;
origin=https://github.com/chrislgarry/Apollo-11;
lines=64-72;
anchor=swh:1:snp:.....;
cpath=1; (the first commit of the first branch of the snapshot)
fpath=2.9 ; (the 9th element of the 2nd element from the root, that is indeed Luminary099/BURN_BABY_BURN--MASTER_IGNITION_ROUTINE.agc)

When presented with such an ID, the webapp would, in this order:

  • resolve the anchor, finding the corresponding snapshot,
  • follow the cpath trail from there, reaching the designated commit
  • follow the fpath trail (building the breacrumbs as it goes) up to the designated object
  • check that the pointed object really is swh:1:cnt:41ddb23118f92d7218099a5e7a990cf58f1d07fa
  • render it, and display the selected line numbers.

Any thoughts on all this?

rdicosmo renamed this task from Add file-name as contextual information in a swh-id of a content object to Add full contextual information in a swh-id of an object.Jun 13 2018, 3:57 PM
rdicosmo updated the task description. (Show Details)
rdicosmo raised the priority of this task from Low to Normal.Jun 14 2018, 9:10 AM
rdicosmo added a project: Web app.

Actually, we can generalize the approach even a bit more.

Indeed, the anchor can be placed at any point in the Merkle DAG, it must not necessarily be located at a snapshot: it specifies the topmost context that is deemed pertinent for navigating/exploring the referenced content.

It just occurred to me that this works (in the sense that the paths will be resolvable) only if we have all the objects in the path from the snapshot down to the pointed object, which is not something we can guarantee in general — e.g., we might have archived a repository which had missing objects in the first place.
It is all contextual information which would not make it impossible to see the final object you're pointing too. But this issue calls into question the robustness of integer-based paths for our purposes here. For instance, an fpath based on actual file/directory names will always be displayable, one based on integers will not be.
Though trade-off…

I see your point, but let's remember that here we want to provide a means for a user A to encode efficiently the context information necessary for another user B to be shown the same view of the archive as the one A has.

For this purpose, the only solution is the anchor/path based approach, because the context that A wants to share is exactly the position in the Merkle DAG + the path to get there, and filenames and paths are not enough.

This information is not brittle: when A gets it from the webapp (in the permalinks box), we are sure that it is in the graph, and when B uses it, it will find the same view
(unless, of course, (all copies of) the Merkle graph got corrupted in the meanwhile, but in that case, we have much greater issues to handle than a broken link :-))

Well, there are other scenarios: like us being forced to remove content for legal reasons. But note that I'm not arguing against the path-based approach. The risk exists only for path encoded using *integers*, because they're by construction relative to the object you traverse. You can have paths that contain the full-step information (e.g., a file/directory name, or a commit identifier), and those paths would be resolvable even if you lose access to intermediate objects. The problem with those kind of paths is that they are much longer than the integer-based ones. That robustness-v-compactness trade-off is the though one I was referring to.

Here is a concrete proposal for the path language:

Cpath in the commit graph: copy-paste the language used in Git (to avoid the NIH syndrome :-))

<cpath> := <empty> | <sel> <cpath>
<sel> := "~" [<count>] | "^" <branch_selector>
<count> := integer
<branch_selector> := integer

As an example, ~3^2~21 is the commit found following the first parent 3 times, then moving to the second parent then following the first parent 21 times

For the fpath in the file tree we can use for uniformity exactly the same syntax, even if in this case probably the path compression allowed by ~ will have very little impact.

I completely agree that 'filename' is not enough and adding each time a new piece of context isn't a good solution.
Both path strategies (integers vs identifiers) are interesting.

But with the integers counters, human readability depends on Merkle DAG comprehension and won't be easy to produce by a user
(not saying that this is a deal breaker but the user is choosing the identifier from the permalinks box and will also need to choose which pieces of contexts to add
and might ask should i just use the url?).

As a user, I see two scenarios:

  1. I want to add the path information to the object to know the name of the source-code package it was found (regardless of the visit, or dev-history)
  2. I want to recreate the exact same view I'm seeing.

In the first, a path from a root directory SHA1 to the content SHA1 is enough (textual or integer form)
( https://archive.softwareheritage.org/browse/origin/https://github.com/Vaufreyd/RGBDSyncSDK/visit/2018-02-16T07:45:15/content/RecordKinect/Main.cpp/#L75-L79)
textual:

swh:1:cnt:488133a8a68211545aa6f928b10391e75439348a;
origin=https://github.com/Vaufreyd/RGBDSyncSDK;
lines=75-79;
root=swh:1:dir:880d8f1477da65a883929e28234e1df34a904626;
path=RecordKinect/Main.cpp/

integer:

swh:1:cnt:488133a8a68211545aa6f928b10391e75439348a;
origin=https://github.com/Vaufreyd/RGBDSyncSDK;
lines=75-79;
anchor=swh:1:dir:880d8f1477da65a883929e28234e1df34a904626;
fpath=2.2

While for the second we need to define each junction from the origin:

  • snapshot
  • branch/release
  • revision (has only one target directory, so if in root, dir-id not needed)
  • directory and all sub-directories

textual:

swh:1:cnt:488133a8a68211545aa6f928b10391e75439348a;
origin=https://github.com/Vaufreyd/RGBDSyncSDK;
lines=75-79;
visit=swh:1:snp:d298ac61a47d5fc6c92e3b8f06911400f3105664;
branch=HEAD;
commit=swh:1:rev:3813416d64504178579c42bc382a1bcb6be26a3d;
path=RecordKinect/Main.cpp/                              # from the commit's target directory (root)

integer:

swh:1:cnt:488133a8a68211545aa6f928b10391e75439348a;
origin=https://github.com/Vaufreyd/RGBDSyncSDK;
lines=75-79;
anchor=swh:1:snp:d298ac61a47d5fc6c92e3b8f06911400f3105664;
cpath=~1^1                                              # first commit of HEAD branch- not sure it's right
fpath=~2^2

Both are very cumbersome and not twitter friendly.

zack claimed this task.

this has been done in bc30e8bc60ac3a310f91a15b5692e6b9bc6a30a3