Page MenuHomeSoftware Heritage

identifiers: support optional contextual parts for line numbers and origin
Closed, MigratedEdits Locked

Description

We have discussed (yours truly, @rdicosmo , and @anlambert) how to add to our persistent identifier scheme optional parts that denote the context of interest of the identified object. In particular we have discussed adding an optional origin, and optional line numbers.

The desired syntax is the following, where square brackets denote optional "qualifiers" (piggybacking on the ARK identifier terminology):

swh:1:…[/Lnn[-mm]][/ORIGIN]

where "L" and "-" are concrete syntax, ORIGIN an URI starting with a scheme, "nn" and "mm" are integer numbers.

This should be added to the persistent identifier documentation and supported by the web app resolver.

Event Timeline

As I am currently implementing the task, I am wondering if adding optional parts to a swh identifier v1 is the adequate solution.

To my point of view, those optional parts are only used to add some context to the raw archived swh objects in order to modify accordingly
the swh-web view requested by the user. In my opinion, as the swh ids already contain quite amount of characters, we should only rely on url query parameters and fragments to add these optional context information. As the swh ids will mainly be used to browse the archive through swh-web, I think it is more appropriate to proceed this way. Moreover, the proposed syntax for optional parts in a swh id is quite a nightmare to parse as using '/' to separate the different parts complicates the origin url extraction.

For instance, to highlight the first 18 lines of a file in the root directory of cpython, the url would be:
/swh:1:cnt:9c95815d9e9d91b8dae8e05d8bbc696fe19f796b?origin=https://github.com/python/cpython#L1-18

Regarding content and directory objects with extra origin context, having the path information for these objects would
also be interesting as it will allow:

  • better syntax highlighting (without a provided file extension, it usually do not detect the adequate language)
  • to easily redirect to origin context views (as they use the file/directory path to resolve the object to display)

However, I understand adding optional parts directly to the identifiers is also of interest but I think at the moment
my proposal is more adequate as the url scheme is self explanatory and there will be no difficulty to keep it persistent
over the years.

@zack, @rdicosmo What do you think?

the problems I see with optional URL parameters instead of modifying the identifiers themselves are the following:

  • it will be impossible to only cite "short" identifiers in non-web media such as scientific articles. That is, if you want to also include contextual information, you will have to write down in the paper a full URL like http://archive.softwareheritage.org/swh:1:cnt:9c95815d9e9d91b8dae8e05d8bbc696fe19f796b?origin=https://github.com/python/cpython#L1-18 . Which implies that the persistence will have to be guaranteed for the entire URL (which includes stuff like a host/domain name) rather than "only" for the identifier, which is already challenging in itself. It will also mean a lot of redundancy: stuff like "swh:1:cnt" is entirely redundant if you put it into a more complete URL
  • (Yes, you can resort to short URLs to avoid some of the above problems, but that's a huge fail for long-term persistence, as we know from other identifier schemes)
  • URL parameters are not in fact as nice as in the above example. If you assume other optional parameters are present, you have to escape the content of the origin parameter. Which makes your example looks more like this /swh:1:cnt:9c95815d9e9d91b8dae8e05d8bbc696fe19f796b?origin=https%3A%2F%2Fgithub.com%2Fpython%2Fcpython#L1-18 , which is ugly beyond repair :-)

I sympathize with the parsing problems, but the above issues look more severe to me.
We can reconsider separators, if / is a nightmare to support, and maybe use indeed named parameters rather than just positional parameters. But if supporting contextual information in non-web media is a requirement (which according to @rdicosmo is), I don't think we can say "let's just use URLs" to avoid the problem.

Thanks for the clear explanation.

So I think the best option here is to used named parameters as optional parts in the identifiers.
This will give us some flexibility regarding the adding of new ones in the future.
Regarding the separator, we could either used \ or | as they should not interfere with
origin urls to extract.

For instance, this should result for the sample cited above in:

swh:1:cnt:9c95815d9e9d91b8dae8e05d8bbc696fe19f796b\origin=https://github.com/python/cpython\lines=1-18\

or

swh:1:cnt:9c95815d9e9d91b8dae8e05d8bbc696fe19f796b|origin=https://github.com/python/cpython|lines=1-18|

Nevertheless, this should be properly formalized in the identifiers specification and the optional parts extraction should also
be handled in the swh.model.identifiers.parse_persistent_identifier function.

Keep me posted of your decision as I need this to finish implementing the task.

So I think the best option here is to used named parameters as optional parts in the identifiers. This will give us some flexibility regarding the adding of new ones in the future. Regarding the separator, we could either used \ or | as they should not interfere with origin urls to extract.

If the problem is the separator, yeah, we can choose something else than '/'. But note that if you allow origin URL to *not* appear as last named parameter, then almost *any* separator will be a problem. Both "|" and "\" can appear in URLs, unescaped. So if we want to avoid escaping, we should really enforce the fact that the origin URL comes last.
Considering all this, my proposal is, for maximizing readability and at the same time keep parsing simple:

swh:1:…[;lines=NN[-MM]][;origin=URL]

Example:

swh:1:cnt:9c95815d9e9d91b8dae8e05d8bbc696fe19f796b;lines=1-18;origin=https://github.com/python/cpython

Note that:

  • the order is *fixed*: if both are present, lines must come before origin
  • lines is plural even if a single line is provided (because I don't think is worth supporting both "line" and "lines", honestly)
  • for implementing this, you can start by splitting on ";", but you should not split more than twice, to avoid hitting semicolons in the origin URL (as they are theoretically allowed there). Alternatively, we can specify in the identifier definition that semicolons cannot appear in the origin URL, they should be percent-encoded to appear there (but I'm a fan of keeping this simple, and allowing them there)

What do you think?

I agree with your proposal.

A quick search on origin table to see if we have any url containing the ; character gives me the following result:

softwareheritage=> select * from origin where url ilike '%;%' order by id limit 100;
 id | type | url | lister | project 
----+------+-----+--------+---------
(0 ligne)

softwareheritage=>

So it hopefully seems there is currently no such corner case of finding a ';' as part of an origin url.
Considering we have actually 83 801 775 archived origins, the probability that such a case appear
is really low. So in order to simplify the parsing, we should force ; to be percent-encoded.

This way the order of parameters does not matter anymore and and it will ease the adding of new
optional parts in the future.

Anyway, I got what I need. Let's finish this task.

zack added a project: Restricted Project.May 29 2018, 11:23 AM
zack moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.

closing, now that all sub-tasks have been completed

zack moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Jun 5 2018, 11:05 AM
zack moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Jun 5 2018, 11:19 AM