Page MenuHomeSoftware Heritage

Hardcode the use of the tcp transport for GitHub origins
ClosedPublic

Authored by olasd on Feb 25 2021, 4:19 PM.

Details

Summary

This change is necessary because of a shortcoming in the Dulwich HTTP
transport: even if the Dulwich API lets us process the packfile in
chunks as it's received, the HTTP transport implementation needs to
entirely allocate the packfile in memory *twice*, once in the HTTP
library, and once in a BytesIO managed by Dulwich, before passing it on
to us as a chunked reader. Overall this triples the memory usage before
we can even try to interrupt the loader before it overruns its memory limit.

In contrast, the Dulwich TCP transport just gives us the read handle on
the underlying socket, doing no processing or copying of the bytes. We
can interrupt it as soon as we've received too many bytes.

Test Plan

This used to be the behavior of this loader, and has been tested on
some GitHub origins.

Diff Detail

Repository
rDLDG Git loader
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D5148 (id=18410)

Could not rebase; Attempt merge onto 11fe5b2770...

Updating 11fe5b2..50cefb7
Fast-forward
 swh/loader/git/from_disk.py |  11 +---
 swh/loader/git/loader.py    | 147 +++++++++++---------------------------------
 2 files changed, 36 insertions(+), 122 deletions(-)
Changes applied before test
commit 50cefb7fed5cb104f4daeb688188fd2999f61879
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Feb 25 15:59:43 2021 +0100

    Hardcode the use of the tcp transport for GitHub origins
    
    This change is necessary because of a shortcoming in the Dulwich HTTP
    transport: even if the Dulwich API lets us process the packfile in
    chunks as it's received, the HTTP transport implementation needs to
    entirely allocate the packfile in memory *twice*, once in the HTTP
    library, and once in a BytesIO managed by Dulwich, before passing it on
    to us as a chunked reader. Overall this triples the memory usage before
    we can even try to interrupt the loader before it overruns its memory limit.
    
    In contrast, the Dulwich TCP transport just gives us the read handle on
    the underlying socket, doing no processing or copying of the bytes. We
    can interrupt it as soon as we've received too many bytes.

commit 076abc14fff85aa5ab687820932ee5c6332c953c
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Feb 25 15:42:21 2021 +0100

    Stop processing packfiles before sending objects
    
    Since its creation, the git loader would process the packfile downloaded
    from the remote repository, to make an index of all objects, filtering
    them before sending them on to the storage. Since this functionality has
    been implemented as a filter proxy in the storage API itself, the
    built-in filtering by the git loader is now redundant.
    
    The way the filtering was implemented in the loader would run through
    the packfile six times: once for the basic object id indexing, once to
    get content ids, then once for each object type. This change removes the
    first two runs. By eschewing the double filtering, we should also reduce
    the load on the backend storage (we would call the <object_type>_missing
    endpoints twice).
    
    Finally, as this change removes the global index of objects, and sends
    the converted objects to the storage as soon as they're read, the memory
    usage decreases substantially for large loads.

commit e66c0927023fcb8eb098e5f1f2ea1705aa11f9d7
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Feb 4 16:54:15 2021 +0100

    Drop unused get_fetch_history_result methods

commit 7baa8117050ed4eb1f616422351851866b4252f1
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Feb 25 16:16:01 2021 +0100

    Drop unused type: ignore annotation

Link to build: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/88/
See console output for more information: https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/88/console

Harbormaster returned this revision to the author for changes because remote builds failed.Feb 25 2021, 4:20 PM
Harbormaster failed remote builds in B19488: Diff 18410!

Build is green

Patch application report for D5148 (id=18416)

Could not rebase; Attempt merge onto 11fe5b2770...

Updating 11fe5b2..cb620b0
Fast-forward
 swh/loader/git/from_disk.py |   9 ---
 swh/loader/git/loader.py    | 147 +++++++++++---------------------------------
 2 files changed, 35 insertions(+), 121 deletions(-)
Changes applied before test
commit cb620b0aecf8d660ba70897435df7642161946ce
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Feb 25 15:59:43 2021 +0100

    Hardcode the use of the tcp transport for GitHub origins
    
    This change is necessary because of a shortcoming in the Dulwich HTTP
    transport: even if the Dulwich API lets us process the packfile in
    chunks as it's received, the HTTP transport implementation needs to
    entirely allocate the packfile in memory *twice*, once in the HTTP
    library, and once in a BytesIO managed by Dulwich, before passing it on
    to us as a chunked reader. Overall this triples the memory usage before
    we can even try to interrupt the loader before it overruns its memory limit.
    
    In contrast, the Dulwich TCP transport just gives us the read handle on
    the underlying socket, doing no processing or copying of the bytes. We
    can interrupt it as soon as we've received too many bytes.

commit 61afbc56b0356f31fd12a6c44713d16617db48da
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Feb 25 15:42:21 2021 +0100

    Stop processing packfiles before sending objects
    
    Since its creation, the git loader would process the packfile downloaded
    from the remote repository, to make an index of all objects, filtering
    them before sending them on to the storage. Since this functionality has
    been implemented as a filter proxy in the storage API itself, the
    built-in filtering by the git loader is now redundant.
    
    The way the filtering was implemented in the loader would run through
    the packfile six times: once for the basic object id indexing, once to
    get content ids, then once for each object type. This change removes the
    first two runs. By eschewing the double filtering, we should also reduce
    the load on the backend storage (we would call the <object_type>_missing
    endpoints twice).
    
    Finally, as this change removes the global index of objects, and sends
    the converted objects to the storage as soon as they're read, the memory
    usage decreases substantially for large loads.

commit 5e434d6f6e1c2f072477deb50f073333a0e9f45b
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Feb 4 16:54:15 2021 +0100

    Drop unused get_fetch_history_result methods

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/90/ for more details.

olasd requested review of this revision.Feb 25 2021, 5:54 PM

I guess we want this for github as a first approximation, and maybe on other forges as well at some point?

swh/loader/git/loader.py
157

Shouldn't it be git+?

This deserves a comment IMO (you could copy-paste the diff description)

swh/loader/git/loader.py
157

no, it replaces https://github.com with git://github.com.

git+https://github.com wouldn't change anything

swh/loader/git/loader.py
157

d'oh, i misread!

Thanks

I'm fine with this change.

Maybe add a comment inline with the code as val suggested.

This revision is now accepted and ready to land.Feb 25 2021, 6:12 PM

add inline comment suggested by @vlorentz

Build is green

Patch application report for D5148 (id=18419)

Could not rebase; Attempt merge onto 11fe5b2770...

Updating 11fe5b2..342f8fd
Fast-forward
 swh/loader/git/from_disk.py |   9 ---
 swh/loader/git/loader.py    | 160 +++++++++++++-------------------------------
 2 files changed, 48 insertions(+), 121 deletions(-)
Changes applied before test
commit 342f8fde25605a68174aba4b971d45cbd95db5d3
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Feb 25 15:59:43 2021 +0100

    Hardcode the use of the tcp transport for GitHub origins
    
    This change is necessary because of a shortcoming in the Dulwich HTTP
    transport: even if the Dulwich API lets us process the packfile in
    chunks as it's received, the HTTP transport implementation needs to
    entirely allocate the packfile in memory *twice*, once in the HTTP
    library, and once in a BytesIO managed by Dulwich, before passing it on
    to us as a chunked reader. Overall this triples the memory usage before
    we can even try to interrupt the loader before it overruns its memory limit.
    
    In contrast, the Dulwich TCP transport just gives us the read handle on
    the underlying socket, doing no processing or copying of the bytes. We
    can interrupt it as soon as we've received too many bytes.

commit 61afbc56b0356f31fd12a6c44713d16617db48da
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Feb 25 15:42:21 2021 +0100

    Stop processing packfiles before sending objects
    
    Since its creation, the git loader would process the packfile downloaded
    from the remote repository, to make an index of all objects, filtering
    them before sending them on to the storage. Since this functionality has
    been implemented as a filter proxy in the storage API itself, the
    built-in filtering by the git loader is now redundant.
    
    The way the filtering was implemented in the loader would run through
    the packfile six times: once for the basic object id indexing, once to
    get content ids, then once for each object type. This change removes the
    first two runs. By eschewing the double filtering, we should also reduce
    the load on the backend storage (we would call the <object_type>_missing
    endpoints twice).
    
    Finally, as this change removes the global index of objects, and sends
    the converted objects to the storage as soon as they're read, the memory
    usage decreases substantially for large loads.

commit 5e434d6f6e1c2f072477deb50f073333a0e9f45b
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Feb 4 16:54:15 2021 +0100

    Drop unused get_fetch_history_result methods

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/91/ for more details.