Page MenuHomeSoftware Heritage

canonicalize gitlab urls in origin API
Closed, MigratedEdits Locked

Description

non-canonical Gitlab urls aren't managed by origin api (when missing ".git" suffix)

example :
https://archive.softwareheritage.org/api/1/origin/https://gitlab.com/checkscale-gitlab/git-wtf.git/visit/latest/ >> found
https://archive.softwareheritage.org/api/1/origin/https://gitlab.com/checkscale-gitlab/git-wtf/visit/latest/ >> not found

it would be interesting to canonize these urls to improve the matches.

For example, when using the UpdateSWH browser extension (https://www.softwareheritage.org/browser-extensions/, the gitlab repositories are initially marked as not archived yet (gray tab). Then if you save code now via the browser extension, it is archived with its non-canonical url so then it is recognized as archived (green tab)...

Event Timeline

Nice catch.... actually, the GitLab API uses the project slug (e.g.: checkscale-gitlab/git-wtf) without the .git, we should make sure we can handle this same slug

For the record, we recently disabled any origin URL processing in the Web API part (we were checking an origin URL with and without
trailing slash, but this should only be performed in the Web UI as the Web API should be as dumb as possible, see D7988).

What we could do for save code now is apply the same processing as for submitted GitHub URLs, get the canonical URL using the
GitLab Rest API to ensure canonical URL is used when saving the origin.

If we want to reinstate some origin URLs processing in the Web API part, we could also add a new query parameter to the /origin endpoint
indicating that some heuristics (applying lowercase, with and without trailing slash, with and without .git) will be applied to try to find a
similar origin URL if the provided one is not found.

anlambert triaged this task as Normal priority.Jul 1 2022, 3:24 PM
anlambert added a project: Web app.