Page MenuHomeSoftware Heritage

Deal with GitHub removing support for git:// URLs
Open, HighPublic

Description

GitHub just announced they will remove support for the git:// protocol in mid-march (with a brownout in mid-january)

Unfortunately, we depend on it to clone efficiently.

So we need to either:

  • Ask GitHub to keep supporting it (at least for us)
  • Figure out an alternative

Related Objects

Event Timeline

vlorentz triaged this task as High priority.Sep 1 2021, 9:11 PM
vlorentz created this task.

The dulwich HTTP(s) support is implemented on top of urllib(3?).

As such, on top of it being unable to enforce a limit for pack files, it's not able to have the bidirectional conversation with the git server that's needed for the negotiation of the objects that need to be fetched.

I can see a few alternatives to using git:// over tcp:

  • Give our swh bot accounts SSH keys, and use that to clone from GitHub over ssh.
  • Improve the Dulwich git-over-http(s) implementation.
  • Switch over to another implementation of the git remote protocol (which one?) to retrieve packfiles.

Obviously last two options would benefit all origins, not just GitHub.

In T3544#69746, @olasd wrote:

I can see a few alternatives to using git:// over tcp:

  • Give our swh bot accounts SSH keys, and use that to clone from GitHub over ssh.

Unless I'm missing something (complicated key management from an operations perspective, perhaps?) this one seems to be both easier to implement and less risky (e.g., in terms of unforeseen impact on the implementation side). And hence the way to go.

  • Improve the Dulwich git-over-http(s) implementation.

If this is what we want, we should probably consider contracting dulwich maintainer(s) directly, rather then doing it ourselves.

The first github milestone was reached today.
All the github loading are failing since 00:00 UTC.

Jan 11 07:42:13 worker01 python3[2189804]: [2022-01-11 07:42:13,599: INFO/ForkPoolWorker-1491] Load origin 'https://github.com/Joceilso/feitosa' with type 'git'
Jan 11 07:42:13 worker01 python3[2189804]: [2022-01-11 07:42:13,819: ERROR/ForkPoolWorker-1491] Loading failure, updating to `failed` status
                                           Traceback (most recent call last):
                                             File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 338, in load
                                               more_data_to_fetch = self.fetch_data()
                                             File "/usr/lib/python3/dist-packages/swh/loader/git/loader.py", line 273, in fetch_data
                                               self.origin.url, base_repo, do_progress
                                             File "/usr/lib/python3/dist-packages/swh/loader/git/loader.py", line 209, in fetch_pack_from_origin
                                               progress=do_activity,
                                             File "/usr/lib/python3/dist-packages/dulwich/client.py", line 1021, in fetch_pack
                                               refs, server_capabilities = read_pkt_refs(proto)
                                             File "/usr/lib/python3/dist-packages/dulwich/client.py", line 235, in read_pkt_refs
                                               raise GitProtocolError(ref.decode("utf-8", "replace"))
                                           dulwich.errors.GitProtocolError: The unauthenticated git protocol on port 9418 is no longer supported.
                                           Please see https://github.blog/2021-09-01-improving-git-protocol-security-github/ for more information.

For now, I've disabled our hardcoding of the TCP transport for GitHub origins.

Just to recap, the original (and pretty much only) problem that this hardcoding of the TCP transport was working around, is that the dulwich http(s) client is fetching the full packfile in memory before streaming it to the user-provided do_pack function. This function is what enforces our pack file size limit.

We're not actually using any negotiation of objects to fetch in the current implementation of the git loader. We only send a static list of known object ids.

I think we should go forward with a solution that would work on all repos, not only GitHub (so the ssh transport is pretty much out). We only did this workaround on github because it's our (very) biggest source of git repos.

Considering that there's now a number of repositories with legitimate multi-gigabyte packfiles (e.g. chromium), our best way forward is probably working on a strategy to fetch multiple, smaller, packfiles instead of a big one when loading such a repo. We should also work on doing cross-repo deduplication again to avoid fetching large packfiles for obvious forks.

Oh, and now that we've moved workers to have a large swap space, the issue of downloading the full packfile in ram before rejecting it should be less disruptive than it's been in the past (where the whole worker would get killed because it ran out of its memory allocation).

I guess this is then related to T3653 somehow