specify the URI scheme swh:... to point to software heritage objects
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	zack
	Mar 4 2016, 12:48 PM

Description

We need a URI scheme to point to Software Heritage objects from external artefacts (e.g., research papers, but many more).

It needs not be an URL (locator) but can be a more general URI (now that the terminology "URN" is deprecated).
As such, it does not need to be heavily hierarchical, modulo the distinction on object kind (revision, content, release, etc).

It would be a good idea to have the scheme be internally versioned, so that we can change in the future, e.g., the kind of checksums that are used as identifiers.

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T338 manifest browser / renderer
Migrated	gitlab-migration	T337 specify a manifest format for documenting archived software
Migrated	gitlab-migration	T926 Web UI: support resolution of external pointers into the archive
Migrated	gitlab-migration	T335 specify the URI scheme swh:... to point to software heritage objects

Event Timeline

zack created this task.Mar 4 2016, 12:48 PM

As a first approximation, the URI scheme might be something like:

swh:VERSION:OBJECT_KIND:OBJECT_ID

with more specific instances like:

swh:1:content:SHA1 (for blobs)
swh:1:revision:SHA1 (commits)
swh:1:directory:SHA1 (directory trees)
swh:1:release:SHA1 (tags)
etc.

Open questions:

do we want to allow using different (and/or multiple at the same time) kinds of checksums to identify objects? Right now that would be useful only for blobs. The alternative to this is to fix a specific kind of checksum (e.g., sha1) for the time being and relegate to future versions of the URI scheme ("swh:2:…") the possibility of using other checksum algorithms.
as cosmetic issue: do we want to use the extended object names above or shorter versions of them ("rev", "dir", "rel", "con")? FWIW the gain in space is negligible w.r.t. the size of the checksum.

zack renamed this task from URI scheme swh:... to point to software heritage objects to speicify the URI scheme swh:... to point to software heritage objects.Mar 4 2016, 12:59 PM

zack renamed this task from speicify the URI scheme swh:... to point to software heritage objects to specify the URI scheme swh:... to point to software heritage objects.

zack mentioned this in T337: specify a manifest format for documenting archived software.Mar 4 2016, 1:05 PM

zack added a parent task: T337: specify a manifest format for documenting archived software.

rdicosmo awarded a token.Mar 5 2016, 12:24 PM

rdicosmo added a subscriber: rdicosmo.

zack removed projects: Developers, Staff.Mar 10 2016, 5:51 PM

zack added a project: General.Apr 1 2016, 10:15 AM

olasd changed the visibility from "All Users" to "Public (No Login Required)".May 13 2016, 5:09 PM

I've been thinking about this in relation to T836.

I'm not so sure we should use version numbers for our URI scheme but rather explicit names for different versions of the identifiers:

swh:revision:sha1_git:<git sha1 of a revision>
swh:content:blake2s256:<blake2s256 of a content>

This does mean that it's harder to see at a glance whether a given locator scheme is deprecated, but it makes it somewhat more future proof as you can "reverse engineer" the object identifier, at least for simple cases.

It also makes it really easier to transition gradually from one scheme to the next, without a flag day.

As for related work, Bitcoin and IPFS use Base58 to encode identifiers or addresses. This is basically Base64 stripped out of non-alphanumeric and ambiguous (I and l, 0 and O) characters. I think it's an interesting scheme to consider.

I'm not opposed to having explicit hash scheme names in the IDs—it is a good idea, only to be weighed against the cost in terms of length.
But we should also have schema version numbers, in case more radical changes will be needed in the future, e.g., renaming the object types in the graph.
If we retain both suggestions, that would give:

swh:1:revision:sha1_git:<git sha1 of a revision>
swh:1:content:blake2s256:<blake2s256 of a content>

(And yes, we might just skip "1" and assume that's implicit, and hope we'll never have to introduce a "2". But given these identifiers are supposed to be stored in places we do not control, I'd rather err on the side of paranoia.)

I agree with all the suggestions: the full id should definitely contain all
this information.
Nevertheless, the sheer length of the result *may* turn out to be a blocker
for adoption as a reference to software in the academic publishing
framework. We can propose this, and see if we need to also provide a
shorter backup if really there is a strong negative feedback.

We can always allow people to truncate the identifier to some arbitrary (shorter) length. The canonical URI would be the full identifier, but our URI resolver can recognize shortened identifiers and point to a disambiguation page with all the objects whose identifier starts with the given string.

moranegg added a subscriber: moranegg.Jan 12 2018, 1:38 PM

zack mentioned this in T926: Web UI: support resolution of external pointers into the archive.Jan 12 2018, 1:53 PM

zack added a parent task: T926: Web UI: support resolution of external pointers into the archive.

concrete, tentative proposal (EBNF):

identifier = "swh" ":" scheme_version ":" obj_type ":" obj_id ;
scheme_version = "1" ;
obj_type =
    "snp"  # snapshot
  | "rel"  # release
  | "rev"  # revision
  | "dir"  # directory
  | "cnt"  # content
  ;
obj_id = object sha1, hex-encoded with (lowercase) ASCII characters ;

examples:

swh:1:snp:34973274ccef6ab4dfaaf86599792fa9c3fe4689
swh:1:rel:23e182506f4b883d8aae3d29d08e044c55b04deb
swh:1:rev:0c86a6bd85ff0629cd2c5141027fc1c8bb6cde9c
swh:1:dir:f54ee8e79bad1e592b319eb890a47c7c27fd3cae
swh:1:cnt:8624bcdae55baeef00cd11d5dfcfa60f68710a02

3-character object types strike a good balance between being compact, being mnemonic, and not having to support multiple names for the same object types (like allowing arbitrary non-ambiguous prefixes).

zack claimed this task.Jan 12 2018, 2:05 PM

in the future, if we switch to blake2/256 (or equivalent length checksums), the examples would become something like:

swh:1:rel:63d2032f11087bbce68982cf207c847481ffff2980c2f9b1a76c276f08a2f8918a2a7546373ebf3d01314e8c8f2b546d39a75cfe4b79be6179066796b12c2b73
swh:1:rel:5237f6025774c1853bc59e919e3b0d2f13d02650402c57b7aab5a0a16f3b4726a19f7d083a49ca0d74e9bc6e8c4c64ba1c17786ddfed4c5d3f0af056731a4055
swh:1:rel:a45a4c4883cce4b50d844fab460414cc2080ca83690e74d850a9253e757384366382625b218c8585daee80f34dc9eb2f2fde5fb959db81cd48837f9216e7b0fa
swh:1:rel:aa76a8c2e85c8f7c50d82252ace0b417025e01731e991ec9e6d6fd543a0183764937ea94237e23937f106d2c65333c61cb8b0eecc3e61a36bdf525f3cb8a61d5
swh:1:rel:ce97be6ba3ae649bfbb0891fd1e0a6b01e55fe89a6f0f39485694651abb9fb0833a67a06e3654ccf4c06850dfea7104476544db317ad2c4fd2a833812dde8ed6

but we can find more compact encoding than hex—like base64/85—when it will come to that

In T335#16990, @zack wrote:

identifier = "swh" ":" scheme_version ":" obj_type ":" obj_id ;
scheme_version = "1" ;
obj_type =
    "snp"  # snapshot
  | "rel"  # release
  | "rev"  # revision
  | "dir"  # directory
  | "cnt"  # content
  ;
obj_id = object sha1, hex-encoded with (lowercase) ASCII characters ;

This proposal is now a GO, I've validated it with @rdicosmo.
I'm keeping this task open because the actual implementation will be documenting it, which I'm going to do.
But in the meantime you can already rely on it for the HAL integration (cc: @moranegg/@ardumont) and for the Web UI resolver (cc: @anlambert/T926).

When writing the documentation, please be sure to be explicit whether content identifier its sha1 or its salted sha1_git, because that's not clear which it is from this discussion :)

yeah, i was thinking about it while running earlier on today :) i'm not yet sure if i'll specify the meaning of the sha1 of each object here, or just say that the sha1 is the primary key of the object and refer to swh-model, we'll see

zack mentioned this in rDMODb61c6665661c: docs: document the naming scheme for persistent identifiers.Jan 14 2018, 10:31 PM

Closed in rDMODb61c6665661c823080192b351af4744dddb35f1e

moranegg mentioned this in T933: change the swh-id for a deposit following the new PID scheme .Jan 15 2018, 10:42 AM

zack mentioned this in Unknown Object (Maniphest Task).Jan 15 2018, 3:05 PM

ardumont mentioned this in rDMOD122326dd81f3: swh.models.hashutil: Add persistent identifier function.Jan 15 2018, 7:28 PM

ardumont mentioned this in rDDEPf48d8bbc8ac5: swh.deposit: Deposit returns persistent identifiers.

ardumont mentioned this in rDMODbdf26f5314ee: swh.model.identifiers: persistent_identifier takes object as input.Jan 17 2018, 11:02 AM

ardumont mentioned this in rDDEPdb503c0fea3f: swh.deposit.api.private: Update persistent identifier computation.Jan 17 2018, 11:32 AM

This task has been migrated to GitLab.

specify the URI scheme swh:... to point to software heritage objectsClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

specify the URI scheme swh:... to point to software heritage objects
Closed, MigratedEdits Locked
Actions

Related Objects
Search...