Page MenuHomeSoftware Heritage

Origin URL duplicates due to caps and .git URL
Closed, MigratedEdits Locked

Description

Hi Team,

I've just stumbled across a peculiarity that I didn't expect. On seeing that the last archived version didn't include the latest releases, I have "saved code now" via my mobile, using a GitHub .git URL. This led to a duplication of the entry.

First up, sorry for not checking the origin URL of the existing record first.

Here's the results of a current search for "hexatomic":

As you can see, there are two issues, which aren't easy to separate in hindsight unfortunately.

  1. I guess the second (.git-suffixed) URL is treated as having a different "target" than the one without the suffix, although they point to the same target.
  2. Capitalization is preserved in the URL. It was first introduced by auto-completion on my phone, but I had reckoned it would be fixed on the SWH end, at least for the sensible parts (protocol, TLD).

As for 1., I'm not sure if there is an actual semantic difference in Git between a .git-suffixed URL and one without suffix. Perhaps this is platform-dependent and changing it would threaten genericity in the back-end. As an end user - however - I'd have expected for these two to be treated as the same, i.e., the .git snapshot overwriting (or adding to) the existing snapshot without the suffix.

As for 2., I know that GitHub URLs are case-sensitive with regards to at least the repository path, perhaps even the user/org path, and also that there's an awful lot of forwarding involved, e.g., when a repository name has changed. Perhaps it would be worthwhile though to look into unifying the interchangeable parts of the URL, which I think would be protocol, and top-level domain.

Event Timeline

sdruskat created this object in space S1 Public.
sdruskat updated the task description. (Show Details)

p.s. Happy to help if I can in anyway that you may deem useful!

Hi,

Sorry for not replying earlier.

Adding .git indeed does change the URL (even if it does not on GitHub), so we can't normalize this in general.

Case in the path part also changes the URL, so we can't normalize this either.

We could however case-normalize the scheme (http://) and domain part; but this might be more trouble than having duplicated origins.

vlorentz triaged this task as Normal priority.Jan 31 2020, 4:22 PM
vlorentz added a project: Data Model.

Hi @sdruskat,
Thanks for your input.
I agree with you that for the end user this looks like a problem.
On the other hand, we are an infrastructure that needs to consider many platforms for the long term, so we can't pick and choose whatever feels more comfortable for the user.

The .git normally is created because someone has entered in Save Code Now a URL with the .git.
@vlorentz, would it be possible to check this on the Save Code Now end for GitHub repositories only? and normalize the URL before injection?

By the way, if you do already have a snapshot on SWH, you can take a new snapshot on the repository page in the actions menu.

Cheers,
Morane

A recent discussion occurred on the #swh-devel irc channel about this issue. The gist of
it is that regarding github repositories (in the save code now [1]), the webapp should
be evolved to query the github api to determine the canonical url used for a repository
and use it as origin.

This should happen both client and server side (as fallback).

[1] The lister already uses the canonical urls.

This should happen both client and server side (as fallback).

Ideally this should be performed client side as GitHub API endpoint we need to query is rate limited, see below:

anlambert@carnavalet:/tmp$ curl -I https://api.github.com/repos/codemeta/codemeta
HTTP/2 200 
server: GitHub.com
date: Mon, 14 Jun 2021 11:24:55 GMT
content-type: application/json; charset=utf-8
cache-control: public, max-age=60, s-maxage=60
vary: Accept, Accept-Encoding, Accept, X-Requested-With
etag: W/"8249a4c52852c2e8b7c0ba9dd574b01f9259bb357a852831889810dbb8b72f0a"
last-modified: Thu, 13 May 2021 06:33:42 GMT
x-github-media-type: github.v3; format=json
access-control-expose-headers: ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset
access-control-allow-origin: *
strict-transport-security: max-age=31536000; includeSubdomains; preload
x-frame-options: deny
x-content-type-options: nosniff
x-xss-protection: 0
referrer-policy: origin-when-cross-origin, strict-origin-when-cross-origin
content-security-policy: default-src 'none'
x-ratelimit-limit: 60
x-ratelimit-remaining: 54
x-ratelimit-reset: 1623670405
x-ratelimit-resource: core
x-ratelimit-used: 6
accept-ranges: bytes
content-length: 6546
x-github-request-id: 84CA:CBF0:FDA422:10E1360:60C73C87

If we also implement it as a fallback on the backend side, we should find a way to determine if an input save code now request has been created from the Web UI or through a direct call to the Web API.

If we also implement it as a fallback on the backend side, we should find a way to determine if an input save code now request has been created from the Web UI or through a direct call to the Web API.

Exactly.

In the mean time, we can implement the first part, client side which sounds simpler.