Page MenuHomeSoftware Heritage

Listers: Canonicalize listed github origins
Closed, MigratedEdits Locked

Description

As part of the maven lister, it's been put into attention that some urls can be listed
without being the main canonical urls. This can result in origins duplication for no
good reason.

So let's reuse some existing url canonicalization code (for gh origins) in listers
and reuse when possible. That code should exist in swh-web and be refactored out into
swh.core then be reused both in swh-web and listers (starting with the maven one,
possibly nixguix, and packagist listers can be done later as well).

Plan:

  • D7836: Compute canonical gh urls in an exposed library function in swh.core
  • D7840: Refactor GitHubSession request management out of swh.lister in swh.core
  • Release [2.6.0)
  • Unstuck debian build if problem (new deps)
  • D7870: Use GitHubSession to make the canonical computation deal with rate limit
  • Release (2.7.0)
  • D7877: Refactor swh.lister to reuse the code moved in swh.core
  • D7880: Add missing canonical case in swh.core
  • Release (2.8.0)
  • D7879: (Goal) Adapt maven lister to list canonical gh urls if any
  • D7946: Extra work for exotic github urls (deployed on staging)

Extra plan got extracted out of this task [1]

[1] T4279

Note: gh refers to GitHub

Event Timeline

ardumont triaged this task as Normal priority.May 11 2022, 11:55 AM
ardumont created this task.

note that swh-web currently does it purely on the client side, using the GitHub API. It may be good to move it server-side, though (avoids leaking users' IP addresses to GitHub, and also fixes some failure conditions mentioned in T4055)

note that swh-web currently does it purely on the client side, using the GitHub API. It may be good to move it server-side, though (avoids leaking users' IP addresses to GitHub, and also fixes some failure conditions mentioned in T4055)

Yes that's also my point of view on that.

ardumont updated the task description. (Show Details)
ardumont updated the task description. (Show Details)
ardumont updated the task description. (Show Details)
ardumont updated the task description. (Show Details)
ardumont updated the task description. (Show Details)