Page MenuHomeSoftware Heritage

Intrinsic identifiers for origins
Open, NormalPublic

Description

We currently use an incrementing integer to uniquely identify origins.
This does not work well with a distributed database (eg. Cassandra), and is not an intrinsic identifier like most of the archive.

So we should define a new identifier for origins. Current options:

  1. A 2-tuple: (type, url). Pros: useful information can be derived that identifier without an API request.
  2. A hash of the type and url. Pros: fixed-size and compact

Event Timeline

vlorentz triaged this task as Normal priority.
zack added a subscriber: zack.May 22 2019, 12:01 PM

Tangential, but impactful on this discussion, we have had in the past a discussion about removing origin types from our notion of origin (there might be a task about it, but I couldn't find it right now).

The idea is that an origin is just an URL, what you fill find there will depend on the *loader* that visits it, so the type will become a *visit* type rather than an origin type. I.e., we can visit the same origin with a git loader or a svn loader, and obtain different snapshots.

If we go that way, the intrinsic id for URLs become just a hash of a URL (if we really need a shorter version).

This sounds like a good idea.

But it has some weird implications on components that use the concept of "origin head" (web UI and metadata indexers); because they'll use radically different content depending on which loader visited last.
But having two VCSs at the same URL is weird in itself, so 🤷

ardumont added a subscriber: ardumont.EditedMay 23 2019, 11:07 AM

One way to answer the question use the hash vs tuple (or plain url) is to know whether those identifier are destined to be persistent ones or not.
If they do, the hash would be more consistent with the existing ones (swh:1:ori:<hash>?).
Also, they'd be simpler to use (read/type) in a url (vs a url within a url).

For removing the origin's type, do we need to keep that info or not (i also checked for the initial discussion but did not find it either).
If we still want to keep those (i think yes), that could go down one layer to the origin_visit table.

In any case, to minimize the impacts on production (and be able to go forward), that could go in multiple steps:

  • Hash the new ids and add that column to the schema and keep the old identifiers there
  • Add an adaptation layer in the (current) storage implementation to be able to work with both the old id and the new one
  • Deploy
  • Then adapt the remaining clients (scheduler, vault, webapp, deposit, loaders, listers) as soon as possible
  • Deploy
  • In the end, remove the adaptation code to only use the new ids

Cheers,

zack added a comment.EditedMay 28 2019, 5:09 PM

This sounds like a good idea.
But it has some weird implications on components that use the concept of "origin head" (web UI and metadata indexers); because they'll use radically different content depending on which loader visited last.
But having two VCSs at the same URL is weird in itself, so 🤷

Right, it is weird. Which in practical terms, I think we'll have just one type of loader/visit for each URL, like 99.9999% of the time (and we can actually check that right now, if needed).

But even for the remaining ε % remaining cases, we can just have an order of priorities among visit types, show the first one by default, and allow the user to pick another one.
And it can be implemented incrementally, like hard-coding the preference list in the beginning and act as it was the only one—which provides an easy transition path from the status quo—refining later.

Zooming out a bit: is this the only concern with the URL-only idea? Because if it is, the intrinsic ID for origins is then easy: swh:1:ori:SHA1, as @ardumont said.

show the first one by default, and allow the user to pick another one.

In this case, we'll also need to have an identifier for URL + type, if they want to cite/link to the non-default one.
We could use the "contextual information" mechanism, eg. swh:1:ori:SHA1;type=git

But using a SHA1 here is inconsistent with the ;origin= contextual info, which uses the plain URL and not a hash.
On the other hand, we can't use the plain URL instead of the hash for swh:1:ori, because there would be no way to tell whether ;type= is part of the URL.

Or, if we decide on keeping the type as part of the identifier, we could use: swh:1:ori:TYPE:URL; but again, that's not consistent with ;origin= (which does not have the type)

I don't really like any of these options... :/

zack added a comment.May 29 2019, 11:44 AM

In this case, we'll also need to have an identifier for URL + type, if they want to cite/link to the non-default one.
We could use the "contextual information" mechanism, eg. swh:1:ori:SHA1;type=git

I disagree. type/url is not an origin, I don't see why it should be referencable at all. (Remember that you can always reference a snapshot and associate it with contextual information, including the origin URL, already with the current PIDs.)

But using a SHA1 here is inconsistent with the ;origin= contextual info, which uses the plain URL and not a hash.

Let's not conflate issues. The ;origin= parameter is meant to be a human readable thing, so it needs to remain an URL and not a non-reverseable hash.
We're creating a hash only because you need one for technical reasons, but while doing so we create it in a way that is consistent with other IDs. Nothing more, nothing less.

Or, if we decide on keeping the type as part of the identifier, we could use: swh:1:ori:TYPE:URL; but again, that's not consistent with ;origin= (which does not have the type)

Nack.

So, again, what are the remaining issues that inhibits you to just go ahead and use URI hashes as Cassandra origin IDs?

Okay then. I'll work on updating the identifier specification.

So, again, what are the remaining issues that inhibits you to just go ahead and use URI hashes as Cassandra origin IDs?

Those I listed above, which were more "philosophical" than technical. I started implementing it last Monday anyway, and it looks good.

zack added a comment.May 29 2019, 6:04 PM

Okay then. I'll work on updating the identifier specification.

So, again, what are the remaining issues that inhibits you to just go ahead and use URI hashes as Cassandra origin IDs?

Those I listed above, which were more "philosophical" than technical. I started implementing it last Monday anyway, and it looks good.

Thanks a lot !

This discussion does highlight important points, in particular: one thing is having intrinsic IDs for origins, another is deciding where and when to use them. We implicitly addressed some of it in this discussion, agreeing that the ?origin=... parameter should not use them, but clearly a lot more of these questions will arise—already in the modifications needed for the web app, but probably elsewhere too.

We'll cross that bridge when we get there :-)

I'm for the hashed origin only if we make it available as an identifier under our PID schema:

swh:1:org:<hash>

It can also work with

swh:1:org:<url>

I was under the impression that an origin was identified by url and type and identifying an origin wasn't possible with our PID resolver.

Use cases for an origin identifier:

  1. Wikidata: a reference to the archived repository, not only to an exact release (which is the only possible Wikidata property today)
  2. referencing the archived copy of a repository instead of a dead url

Example use this: https://archive.softwareheritage.org/browse/origin/https://gitorious.org/parmap/parmap.git/directory/ instead of: https://gitorious.org/parmap/parmap.git
People will use this link which is not persistent and unstable, it is not an IDO, it is an access to the entire dev history of a repository

After having a PID for this entry point we can choose where it lands:

  • the last snp
  • the timeline view.
zack added a comment.Jun 4 2019, 2:40 PM

Thanks @moranegg.

Just a couple of comments:

  • the current proposal is ori instead of org as 3-letter stem
  • your use cases are all valid, but would equally work with a full URL and with a hashed URL

The second point is what remain to be discussed.

vlorentz added a subscriber: olasd.Jun 5 2019, 10:47 AM

Summary of IRL chat with @moranegg @zack @olasd :

Our options for origin ids are:

optionproscons
plain URLstraightforwardmay be arbitrarily large
integer sequencewhat we currently do + small sizedoesn't work for distributed systems (eg. Cassandra)
cryptographic hashconsistent with other swh-ids + constant-size
some text encoding, eg. base64can be reversed into the plain URL without an API lookupmay be arbitrarily large

Note that origin ids are internal stuff, that will be used by the databases and internal APIs. Public APIs and user interfaces are out of scope of this discussion.

Currently, all databases that store an origin id also store the origin URL (and type). So using an origin id other than the plain URL does not provide any benefit, so we decided to go with it.

zack added a subscriber: haltode.Jul 8 2019, 11:25 AM

As it turns out, intrinsic origin identifiers are indeed handy for graph compression, so I'd like to see this task resolved.

Can we revive D1523 or is there any objection? (Which, by the way, is fine by me and fully compatible with what we are using for graph compression in T1867.)

Note that that diff will only take care of the standardization part; in addition we will also need to modify storage to store and allow retrieval of hashed origin URLs.

vlorentz closed this task as Resolved.Jul 10 2019, 4:59 PM
vlorentz claimed this task.
vlorentz reopened this task as Open.
vlorentz added a project: Storage manager.