Page MenuHomeSoftware Heritage

Origin URL duplicates due to caps and .git URL
Open, NormalPublic

Description

Hi Team,

I've just stumbled across a peculiarity that I didn't expect. On seeing that the last archived version didn't include the latest releases, I have "saved code now" via my mobile, using a GitHub .git URL. This led to a duplication of the entry.

First up, sorry for not checking the origin URL of the existing record first.

Here's the results of a current search for "hexatomic":

As you can see, there are two issues, which aren't easy to separate in hindsight unfortunately.

  1. I guess the second (.git-suffixed) URL is treated as having a different "target" than the one without the suffix, although they point to the same target.
  2. Capitalization is preserved in the URL. It was first introduced by auto-completion on my phone, but I had reckoned it would be fixed on the SWH end, at least for the sensible parts (protocol, TLD).

As for 1., I'm not sure if there is an actual semantic difference in Git between a .git-suffixed URL and one without suffix. Perhaps this is platform-dependent and changing it would threaten genericity in the back-end. As an end user - however - I'd have expected for these two to be treated as the same, i.e., the .git snapshot overwriting (or adding to) the existing snapshot without the suffix.

As for 2., I know that GitHub URLs are case-sensitive with regards to at least the repository path, perhaps even the user/org path, and also that there's an awful lot of forwarding involved, e.g., when a repository name has changed. Perhaps it would be worthwhile though to look into unifying the interchangeable parts of the URL, which I think would be protocol, and top-level domain.

Event Timeline

sdruskat created this task.Jan 15 2020, 4:36 PM
sdruskat created this object in space S1 Public.
sdruskat updated the task description. (Show Details)

p.s. Happy to help if I can in anyway that you may deem useful!

Hi,

Sorry for not replying earlier.

Adding .git indeed does change the URL (even if it does not on GitHub), so we can't normalize this in general.

Case in the path part also changes the URL, so we can't normalize this either.

We could however case-normalize the scheme (http://) and domain part; but this might be more trouble than having duplicated origins.

vlorentz triaged this task as Normal priority.Jan 31 2020, 4:22 PM
vlorentz added a project: Data Model.

Hi @sdruskat,
Thanks for your input.
I agree with you that for the end user this looks like a problem.
On the other hand, we are an infrastructure that needs to consider many platforms for the long term, so we can't pick and choose whatever feels more comfortable for the user.

The .git normally is created because someone has entered in Save Code Now a URL with the .git.
@vlorentz, would it be possible to check this on the Save Code Now end for GitHub repositories only? and normalize the URL before injection?

By the way, if you do already have a snapshot on SWH, you can take a new snapshot on the repository page in the actions menu.

Cheers,
Morane