Page MenuHomeSoftware Heritage

Handle gracefully trailing slashes when resolving SWHID in search box
Closed, ResolvedPublic

Description

The following URL, made up of a SWHID and a resolver works perfectly:

https://archive.softwareheritage.org/swh:1:cnt:bb0faf6919fc60636b2696f32ec9b3c2adb247fe;origin=https://github.com/id-Software/Quake-III-Arena;lines=549-572/

On the other side, entering the string obtained removing the resolver from the above URL is not accepted if used in the search box at archive.softwareheritage.org (incomplete format error)

swh:1:cnt:bb0faf6919fc60636b2696f32ec9b3c2adb247fe;origin=https://github.com/id-Software/Quake-III-Arena;lines=549-572/

The issue is with the trailing / that is not part of the SWHID grammar, and is hence properly rejected, but it can be surprising for the users.

Nonetheless, we should handle this case gracefully: silently remove any trailing / in case of parsing error, and retrying before failing.

Event Timeline

rdicosmo triaged this task as Normal priority.Apr 10 2021, 11:16 AM
rdicosmo created this task.

What about removing the trailing / from generated URLs like this one instead?

There are already many URLs in the open, so even if we remove the trailing slash now, that does not solve the problem.

The issue is to handle gracefully a SWHID which is not perfectly formed, with the usual blessed approach: "be strict in what you generate, and liberal in what you accept"

There are already many URLs in the open

Okay, fair

The issue is to handle gracefully a SWHID which is not perfectly formed, with the usual blessed approach: "be strict in what you generate, and liberal in what you accept"

My issue with this interpretation of Postel's principle is that, when we implement this, malformed SWHIDs will be accepted by archive.softwareheritage.org but not other interfaces that validate SWHIDs; and this could harm usability.

As a compromise, we could accept this trailing slash, but show a warning on the interface and/or codify in the SWHID specification an exhaustive list of "fixes" that user interfaces can/should do.

As a compromise, we could accept this trailing slash, but show a warning on the interface and/or codify in the SWHID specification an exhaustive list of "fixes" that user interfaces can/should do.

Absolutely ! :-)

I guess the simplest fix will be to automatically remove the trailing slash of such malformed SWHID frontend side after a user copied pasted it in the search box.

This could be done by checking the latest qualifier of the SWHID:

  • anchor, visit and lines cannot have a trailing slash, remove it
  • origin and path cannot be ended with double slashes, transform into single slash in that case
  • origin and path cannot be ended with double slashes, transform into single slash in that case

Why not?

There is only one origin URL in the archive ending with double slashes:

softwareheritage=> select * from origin where url ilike '%//' limit 10;
    id    |                          url                           
----------+--------------------------------------------------------
 89946580 | https://github.com/wookey-project/app-wookey-crypto///

We also have the origin URL without double slashes archived targeting the same snapshot so the proposed fix will work correctly.

For path, directory and content name do not contain any slash so a correct path cannot end with double slashes

@vlorentz , @anlambert : thanks for progressing the discussion on this issue.
After mulling over your inputs, here is my current understanding:

  1. we must make sure that the SWHIDs and the URLS we propose in the Permalinks tab are coherent, in the following sense:
  2. since there are already a number of these problematic URLs out there, we need to be able to resolve at least the SWHIDs that are obtained from these erroneous URLs
    • the current discovered issue is with trailing slashes in the lines= qualifier; there may be other cases
    • such errors may better be spotted/filtered in the frontend code
    • we may want to notify the user about the error, even if we fix it on the fly

Are we on the same page?

hence, a fix is needed in the code underlying the URL generation for the Permalinks tab (this is for @anlambert :-))

Ok got what you mean, this is an easy fix indeed plus the URL is still valid.

we may want to notify the user about the error, even if we fix it on the fly

I am not a big fan of autofixing a SWHID, validating it using the Web API and reporting errors will help users
to understand what is wrong in it. The Python validation code produces great parsing errors report so we
should take advantage of it.

For instance when the erroneous SWHID in that task is processed, the following
error will be reported: Invalid format for the lines qualifier: 549-572/
which is quite precise.

I am not a big fan of autofixing a SWHID, validating it using the Web API and reporting errors will help users
to understand what is wrong in it. The Python validation code produces great parsing errors report so we
should take advantage of it.

For instance when the erroneous SWHID in that task is processed, the following
error will be reported: Invalid format for the lines qualifier: 549-572/
which is quite precise.

If we report a precise error that helps the user fix the issue, instead of the current "Internal format error", then I'm with you :-)

If we report a precise error that helps the user fix the issue, instead of the current "Internal format error", then I'm with you :-)

The error was not properly reported due to a small bug in swh-model, fixed in D5492.

Fix has just been deployed to production, SWHID URLs have no more trailing slash.

That's not a SWHID URL but rather the resolved browse one (here the trailing slash is part of the snapshot query parameter).

SWHID resolver always produced the same browse URL without trailing / so we are good here (I mean the 404 error was already raised prior my changes to SWHID URLs).

That's not a SWHID URL but rather the resolved browse one (here the trailing slash is part of the snapshot query parameter).

SWHID resolver always produced the same browse URL without trailing / so we are good here (I mean the 404 error was already raised prior my changes to SWHID URLs).

Arf, no you are right, there is still an issue with trailing slash SWHID URL, SWHID validation fails.

I will fix that asap.

Thanks, it is indeed an urgent matter, as various journals depend on this!

Thanks, it is indeed an urgent matter, as various journals depend on this!

Fixed in D5556, will deploy it once landed.

D5556 landed and deployed to production, my bad for this, closing this again.