Page MenuHomeSoftware Heritage

specify the URI scheme swh:... to point to software heritage objects
Closed, MigratedEdits Locked

Description

We need a URI scheme to point to Software Heritage objects from external artefacts (e.g., research papers, but many more).

It needs not be an URL (locator) but can be a more general URI (now that the terminology "URN" is deprecated).
As such, it does not need to be heavily hierarchical, modulo the distinction on object kind (revision, content, release, etc).

It would be a good idea to have the scheme be internally versioned, so that we can change in the future, e.g., the kind of checksums that are used as identifiers.

Event Timeline

As a first approximation, the URI scheme might be something like:

swh:VERSION:OBJECT_KIND:OBJECT_ID

with more specific instances like:

  • swh:1:content:SHA1 (for blobs)
  • swh:1:revision:SHA1 (commits)
  • swh:1:directory:SHA1 (directory trees)
  • swh:1:release:SHA1 (tags)
  • etc.

Open questions:

  • do we want to allow using different (and/or multiple at the same time) kinds of checksums to identify objects? Right now that would be useful only for blobs. The alternative to this is to fix a specific kind of checksum (e.g., sha1) for the time being and relegate to future versions of the URI scheme ("swh:2:…") the possibility of using other checksum algorithms.
  • as cosmetic issue: do we want to use the extended object names above or shorter versions of them ("rev", "dir", "rel", "con")? FWIW the gain in space is negligible w.r.t. the size of the checksum.
zack renamed this task from URI scheme swh:... to point to software heritage objects to speicify the URI scheme swh:... to point to software heritage objects.Mar 4 2016, 12:59 PM
zack renamed this task from speicify the URI scheme swh:... to point to software heritage objects to specify the URI scheme swh:... to point to software heritage objects.
olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:09 PM

I've been thinking about this in relation to T836.

I'm not so sure we should use version numbers for our URI scheme but rather explicit names for different versions of the identifiers:

  • swh:revision:sha1_git:<git sha1 of a revision>
  • swh:content:blake2s256:<blake2s256 of a content>

This does mean that it's harder to see at a glance whether a given locator scheme is deprecated, but it makes it somewhat more future proof as you can "reverse engineer" the object identifier, at least for simple cases.

It also makes it really easier to transition gradually from one scheme to the next, without a flag day.

As for related work, Bitcoin and IPFS use Base58 to encode identifiers or addresses. This is basically Base64 stripped out of non-alphanumeric and ambiguous (I and l, 0 and O) characters. I think it's an interesting scheme to consider.

I'm not opposed to having explicit hash scheme names in the IDs—it is a good idea, only to be weighed against the cost in terms of length.
But we should also have schema version numbers, in case more radical changes will be needed in the future, e.g., renaming the object types in the graph.
If we retain both suggestions, that would give:

  • swh:1:revision:sha1_git:<git sha1 of a revision>
  • swh:1:content:blake2s256:<blake2s256 of a content>

(And yes, we might just skip "1" and assume that's implicit, and hope we'll never have to introduce a "2". But given these identifiers are supposed to be stored in places we do not control, I'd rather err on the side of paranoia.)

I agree with all the suggestions: the full id should definitely contain all
this information.
Nevertheless, the sheer length of the result *may* turn out to be a blocker
for adoption as a reference to software in the academic publishing
framework. We can propose this, and see if we need to also provide a
shorter backup if really there is a strong negative feedback.

We can always allow people to truncate the identifier to some arbitrary (shorter) length. The canonical URI would be the full identifier, but our URI resolver can recognize shortened identifiers and point to a disambiguation page with all the objects whose identifier starts with the given string.

zack raised the priority of this task from Normal to High.EditedJan 12 2018, 2:04 PM

concrete, tentative proposal (EBNF):

identifier = "swh" ":" scheme_version ":" obj_type ":" obj_id ;
scheme_version = "1" ;
obj_type =
    "snp"  # snapshot
  | "rel"  # release
  | "rev"  # revision
  | "dir"  # directory
  | "cnt"  # content
  ;
obj_id = object sha1, hex-encoded with (lowercase) ASCII characters ;

examples:

  • swh:1:snp:34973274ccef6ab4dfaaf86599792fa9c3fe4689
  • swh:1:rel:23e182506f4b883d8aae3d29d08e044c55b04deb
  • swh:1:rev:0c86a6bd85ff0629cd2c5141027fc1c8bb6cde9c
  • swh:1:dir:f54ee8e79bad1e592b319eb890a47c7c27fd3cae
  • swh:1:cnt:8624bcdae55baeef00cd11d5dfcfa60f68710a02

3-character object types strike a good balance between being compact, being mnemonic, and not having to support multiple names for the same object types (like allowing arbitrary non-ambiguous prefixes).

in the future, if we switch to blake2/256 (or equivalent length checksums), the examples would become something like:

  • swh:1:rel:63d2032f11087bbce68982cf207c847481ffff2980c2f9b1a76c276f08a2f8918a2a7546373ebf3d01314e8c8f2b546d39a75cfe4b79be6179066796b12c2b73
  • swh:1:rel:5237f6025774c1853bc59e919e3b0d2f13d02650402c57b7aab5a0a16f3b4726a19f7d083a49ca0d74e9bc6e8c4c64ba1c17786ddfed4c5d3f0af056731a4055
  • swh:1:rel:a45a4c4883cce4b50d844fab460414cc2080ca83690e74d850a9253e757384366382625b218c8585daee80f34dc9eb2f2fde5fb959db81cd48837f9216e7b0fa
  • swh:1:rel:aa76a8c2e85c8f7c50d82252ace0b417025e01731e991ec9e6d6fd543a0183764937ea94237e23937f106d2c65333c61cb8b0eecc3e61a36bdf525f3cb8a61d5
  • swh:1:rel:ce97be6ba3ae649bfbb0891fd1e0a6b01e55fe89a6f0f39485694651abb9fb0833a67a06e3654ccf4c06850dfea7104476544db317ad2c4fd2a833812dde8ed6

but we can find more compact encoding than hex—like base64/85—when it will come to that

zack changed the task status from Open to Work in Progress.Jan 12 2018, 6:11 PM
zack added subscribers: anlambert, ardumont.
In T335#16990, @zack wrote:
identifier = "swh" ":" scheme_version ":" obj_type ":" obj_id ;
scheme_version = "1" ;
obj_type =
    "snp"  # snapshot
  | "rel"  # release
  | "rev"  # revision
  | "dir"  # directory
  | "cnt"  # content
  ;
obj_id = object sha1, hex-encoded with (lowercase) ASCII characters ;

This proposal is now a GO, I've validated it with @rdicosmo.
I'm keeping this task open because the actual implementation will be documenting it, which I'm going to do.
But in the meantime you can already rely on it for the HAL integration (cc: @moranegg/@ardumont) and for the Web UI resolver (cc: @anlambert/T926).

When writing the documentation, please be sure to be explicit whether content identifier its sha1 or its salted sha1_git, because that's not clear which it is from this discussion :)

yeah, i was thinking about it while running earlier on today :) i'm not yet sure if i'll specify the meaning of the sha1 of each object here, or just say that the sha1 is the primary key of the object and refer to swh-model, we'll see

zack mentioned this in Unknown Object (Maniphest Task).Jan 15 2018, 3:05 PM