HomeSoftware Heritage

Hardcode the use of the tcp transport for GitHub origins

Description

Hardcode the use of the tcp transport for GitHub origins

This change is necessary because of a shortcoming in the Dulwich HTTP
transport: even if the Dulwich API lets us process the packfile in
chunks as it's received, the HTTP transport implementation needs to
entirely allocate the packfile in memory *twice*, once in the HTTP
library, and once in a BytesIO managed by Dulwich, before passing it on
to us as a chunked reader. Overall this triples the memory usage before
we can even try to interrupt the loader before it overruns its memory limit.

In contrast, the Dulwich TCP transport just gives us the read handle on
the underlying socket, doing no processing or copying of the bytes. We
can interrupt it as soon as we've received too many bytes.

Details

Provenance
olasdAuthored on Feb 25 2021, 3:59 PM
olasdPushed on Feb 25 2021, 6:46 PM
Differential Revision
D5148: Hardcode the use of the tcp transport for GitHub origins
Parents
rDLDG61afbc56b035: Stop processing packfiles before sending objects
Branches
Unknown
Tags
Unknown
References
tag: v0.9.0
Build Status
Buildable 19501
Build 30252: test-and-buildJenkins console · Jenkins