URLs
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vlorentz
	Sep 1 2021, 9:11 PM

Description

GitHub just announced they will remove support for the git:// protocol in mid-march (with a brownout in mid-january)

Unfortunately, we depend on it to clone efficiently.

So we need to either:

Ask GitHub to keep supporting it (at least for us)
Figure out an alternative

Revisions and Commits

rDLDG Git loader
	rDLDGdb5b80b47cbe Disable hardcoding the TCP transport on GitHub

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T2206 Quality of Service
Migrated	gitlab-migration	T4080 Minimize archival lag w.r.t. upstream code hosting platforms
Migrated	gitlab-migration	T2207 Improve ingestion efficiency
Migrated	gitlab-migration	T3544 Deal with GitHub removing support for git:// URLs

Event Timeline

vlorentz triaged this task as High priority.Sep 1 2021, 9:11 PM

vlorentz created this task.

The dulwich HTTP(s) support is implemented on top of urllib(3?).

As such, on top of it being unable to enforce a limit for pack files, it's not able to have the bidirectional conversation with the git server that's needed for the negotiation of the objects that need to be fetched.

I can see a few alternatives to using git:// over tcp:

Give our swh bot accounts SSH keys, and use that to clone from GitHub over ssh.
Improve the Dulwich git-over-http(s) implementation.
Switch over to another implementation of the git remote protocol (which one?) to retrieve packfiles.

Obviously last two options would benefit all origins, not just GitHub.

In T3544#69746, @olasd wrote:

I can see a few alternatives to using git:// over tcp:

Give our swh bot accounts SSH keys, and use that to clone from GitHub over ssh.

Unless I'm missing something (complicated key management from an operations perspective, perhaps?) this one seems to be both easier to implement and less risky (e.g., in terms of unforeseen impact on the implementation side). And hence the way to go.

Improve the Dulwich git-over-http(s) implementation.

If this is what we want, we should probably consider contracting dulwich maintainer(s) directly, rather then doing it ourselves.

vsellier added a subscriber: vsellier.Sep 6 2021, 10:49 AM

vlorentz updated the task description. (Show Details)Nov 15 2021, 3:14 PM

The first github milestone was reached today.
All the github loading are failing since 00:00 UTC.

Jan 11 07:42:13 worker01 python3[2189804]: [2022-01-11 07:42:13,599: INFO/ForkPoolWorker-1491] Load origin 'https://github.com/Joceilso/feitosa' with type 'git'
Jan 11 07:42:13 worker01 python3[2189804]: [2022-01-11 07:42:13,819: ERROR/ForkPoolWorker-1491] Loading failure, updating to `failed` status
                                           Traceback (most recent call last):
                                             File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 338, in load
                                               more_data_to_fetch = self.fetch_data()
                                             File "/usr/lib/python3/dist-packages/swh/loader/git/loader.py", line 273, in fetch_data
                                               self.origin.url, base_repo, do_progress
                                             File "/usr/lib/python3/dist-packages/swh/loader/git/loader.py", line 209, in fetch_pack_from_origin
                                               progress=do_activity,
                                             File "/usr/lib/python3/dist-packages/dulwich/client.py", line 1021, in fetch_pack
                                               refs, server_capabilities = read_pkt_refs(proto)
                                             File "/usr/lib/python3/dist-packages/dulwich/client.py", line 235, in read_pkt_refs
                                               raise GitProtocolError(ref.decode("utf-8", "replace"))
                                           dulwich.errors.GitProtocolError: The unauthenticated git protocol on port 9418 is no longer supported.
                                           Please see https://github.blog/2021-09-01-improving-git-protocol-security-github/ for more information.

olasd added a commit: rDLDGdb5b80b47cbe: Disable hardcoding the TCP transport on GitHub.Jan 11 2022, 9:33 AM

For now, I've disabled our hardcoding of the TCP transport for GitHub origins.

Just to recap, the original (and pretty much only) problem that this hardcoding of the TCP transport was working around, is that the dulwich http(s) client is fetching the full packfile in memory before streaming it to the user-provided do_pack function. This function is what enforces our pack file size limit.

We're not actually using any negotiation of objects to fetch in the current implementation of the git loader. We only send a static list of known object ids.

I think we should go forward with a solution that would work on all repos, not only GitHub (so the ssh transport is pretty much out). We only did this workaround on github because it's our (very) biggest source of git repos.

Considering that there's now a number of repositories with legitimate multi-gigabyte packfiles (e.g. chromium), our best way forward is probably working on a strategy to fetch multiple, smaller, packfiles instead of a big one when loading such a repo. We should also work on doing cross-repo deduplication again to avoid fetching large packfiles for obvious forks.

Oh, and now that we've moved workers to have a large swap space, the issue of downloading the full packfile in ram before rejecting it should be less disruptive than it's been in the past (where the whole worker would get killed because it ran out of its memory allocation).

I guess this is then related to T3653 somehow

douardda added a parent task: T2207: Improve ingestion efficiency .Feb 18 2022, 11:21 AM

dealt with (at least in terms of only using https for clones)

bchauvet lowered the priority of this task from High to Normal.Apr 11 2022, 11:57 AM

vlorentz mentioned this in T1739: Define an architecture to fetch extrinsic metadata outside listers and loaders.Apr 11 2022, 1:45 PM

I'm closing this. I've submitted T4216 to track the actual packfile limit issue.

This task has been migrated to GitLab.

Deal with GitHub removing support for git:// URLsClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Deal with GitHub removing support for git:// URLs
Closed, MigratedEdits Locked
Actions

Related Objects
Search...