Page MenuHomeSoftware Heritage

loader: Add support for dumb HTTP transfer protocol
ClosedPublic

Authored by anlambert on Sep 24 2021, 2:42 PM.

Details

Summary

Git supports two HTTP based transfer protocols to exchange data
between two repositories: the dumb protocol and the smart protocol.

Nowadays, the smart protocol is a common method of transferring
data because it is more efficient but there is still some git
servers in the wild (mostly cgit) that only support the dumb protocol.

Unfortunately the dulwich package does not support such protocol
so this kind of git repository could not be loaded into the archive.

That commit adds support to load such git repository by fetching
objects according to the dumb HTTP transfer protocol specification.

Below are some useful resources on the subject:

I have successfully tested the loading with numerous git repositories
only supporting dumb protocol, below are some of them found by inspecting
the failed save code now requests for git in production:

Debian packaging has also been successfully tested, git must added
as build dependency though as it is used in newly added test suite.

Related to T2489

Test Plan

A new test suite for the dumb protocol support has been added.
The sample repository used in tests is bare cloned and served
by a local HTTP server as specified by the dumb protocol.

Diff Detail

Repository
rDLDG Git loader
Branch
add-git-dumb-protocol-support-alternate
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 24041
Build 37512: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 37511: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D6342 (id=23057)

Rebasing onto 05d6d762d0...

Current branch diff-target is up to date.
Changes applied before test
commit 6fa2636f9f99890f83b57baedd0cbb9495632c1c
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Sep 16 16:00:15 2021 +0200

    loader: Add support for dumb HTTP transfer protocol
    
    Git supports two HTTP based transfer protocols to exchange data
    between two repositories: the dumb protocol and the smart protocol.
    
    Nowadays, the smart protocol is a common method of transferring
    data because it is more efficient but there is still some git
    servers in the wild that only support the dumb protocol.
    
    Unfortunately the dulwich package does not support such protocol
    so this kind of git repository could not be loaded into the archive.
    
    That commit adds support to load such git repository by fetching
    objects according to the dumb HTTP transfer protocol specification.
    
    Related to T2489

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/113/ for more details.

vlorentz added inline comments.
swh/loader/git/loader.py
256–258

Why move it in the try block?

259–262

Can you remove this from the try block?

swh/loader/git/tests/test_loader.py
161–170

I still believe it would be simpler to patch DumbHttpGitClient to return objects directly instead of spawning an HTTP server

swh/loader/git/loader.py
256–258

ack, will move above

259–262

sure, will move it after the exceptions handling

swh/loader/git/tests/test_loader.py
161–170

I think it is simpler to proceed like this, I prefer to avoid mocking things when it is possible.
Plus tests performance are more than acceptable.

LGTM overall

swh/loader/git/dumb.py
97

Any reason for it to be a BFS rather than DFS? deque is a linked list, so not very efficient; you could use a simple list as a stack instead.
(plus, you can spare the casts)

Probably doesn't matter in practice though.

103–107
121–122

I'd do it like this, but it doesn't really matter

163

is it always UTF-8?

194–195

what is this?

swh/loader/git/loader.py
272–285
swh/loader/git/tests/test_loader.py
161–170

did you check it would work in Debian tests? They block networking so it might affect it.

swh/loader/git/dumb.py
97

No particular reason, I pick the first graph traversal I had in mind.
I will check the DFS approach then.

103–107

better indeed, thanks

121–122

Ack, will update

163

Yes, this is the kind of response we get for the packs list:

anlambert@carnavalet:/tmp$ curl -i https://git.openembedded.org/openembedded-core/objects/info/packs
HTTP/1.1 200 OK
Server: nginx/1.10.3 (Ubuntu)
Date: Fri, 24 Sep 2021 13:15:23 GMT
Content-Type: text/plain; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Content-Disposition: inline; filename="objects/info/packs"
Last-Modified: Fri, 24 Sep 2021 13:15:23 GMT
Expires: Fri, 24 Sep 2021 13:16:23 GMT

P pack-5422adfb2f751e0a68908e78bcbc2b64da05aeee.pack
P pack-6eeecf497baac9fbf038439a4d6d744239f46c5d.pack
P pack-8cc751bf481805b703aa952a1895188d1f17ab76.pack
P pack-94073fd7b717046e62ee98407afec2d7053df264.pack
P pack-c95f4d0ab59a40084209281317356f7b32aca171.pack
P pack-c97afddc6240cc65acf0bb0b8deed553346c263e.pack
P pack-7eacd96820ba5e7d59e12b2cd5bb6ad8cefed593.pack
P pack-46515b894bd9e84e83ccb6f99ef29b0a7bf41611.pack
P pack-dada3bb0e007956d6c91c6bd8d037a2f090d7a36.pack
P pack-1195ec9c538bff2fe8f09dcd2c7a5d8e1663725b.pack
P pack-1cd35da490cef9ec46dfe9a9696e4aded2a3e383.pack
P pack-fe0be52545a2f95de32dc0fe613bc939cc75cca0.pack
P pack-f585f3a38ce7836bbe916fa92c16433812c0470d.pack
P pack-e51701de8cf701947763209e24bbf104e1bc135c.pack
P pack-1f3a65f2b481b6040978160e97739bac43b439cf.pack
P pack-c10b904d5c4c2ceeda4d2683e8f98090ae857c0f.pack
P pack-587bc9e2864af9411e0938ffe57fa1cb3bf332e9.pack
P pack-c0b1099e7faf2b4604b9ed09cb40d1b288aff562.pack
P pack-21ae990c3753469aca8531b66a5763d50a691382.pack
P pack-5a229741a8da560a00d713498ac3597d033b3f19.pack
P pack-b7d73c1882ff5c82d23906a6ff75b81fe54f96e3.pack
P pack-1027a60b270b168c29f12fc0041b840c29aaec46.pack
P pack-46e9b109657fbd4eca28bd5a4538f81d10f63257.pack
P pack-ff88065963b0bce2822f46913a7e42fb7beb3f7d.pack
P pack-da9b86e62250f1025d03bd987d7cea8036660ec4.pack
P pack-1c6de8793c2074b4429984d239cdc0533df19484.pack
P pack-a667d8977a2a4b2ce23117e156c3fd4a8a8adc5a.pack
P pack-04a9da669b4187a5bbe276434d69275f77a4b00f.pack
P pack-7061abeaab5c6bd5befcbe0e5b9456d98d57bae0.pack
P pack-018260388c9cdd5fbb1234212a6d0f100ea1e2c2.pack
P pack-a7a8db6093ca339ff83ff37782d9ffdff8e20a42.pack
P pack-7104febf3aa00b6f0b59aacb7ea9c2a54c87748e.pack
P pack-9a95f313831680977128ddfc5215e5f0c691e497.pack
P pack-dae1ea96cdd5fd421da7aa4bfc9dfdc187503c42.pack
P pack-2e28a3dbb0d35b9d766549dd45503c9b8a4a09ef.pack
P pack-8f94e403e4588588c8c04a9553e65b5663d3419b.pack
P pack-4fb16e27ea809417b3f125bbd0b9466f7e16a391.pack
194–195

A reference to an external tree object (git submodule), we must skip it as it is not hosted on the git server.

swh/loader/git/loader.py
272–285

No this must be kept that way.

Depending on dulwich version the dumb protocol can be detected during the call to fetch_pack without errors (upstream) or exceptions can be raised (dulwich version used in production, see sentry reports), so we must handle both cases.

swh/loader/git/tests/test_loader.py
161–170

Yes, of course.

swh/loader/git/dumb.py
163

alright

swh/loader/git/loader.py
272–285

Oh, I missed that. Could you add a comment to explain this?

swh/loader/git/loader.py
272–285

There is already one in the exception block above, I will make it clearer.

Update:

  • address most of @vlorentz comments
  • use a DFS to walk on the commits graph instead of a BFS
  • improve some comments
  • add missing docstring for new test suite

Build is green

Patch application report for D6342 (id=23060)

Rebasing onto 05d6d762d0...

Current branch diff-target is up to date.
Changes applied before test
commit 0d89e6c66ad4a31fc0b75bec7a9bcb881cbf32d6
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Sep 16 16:00:15 2021 +0200

    loader: Add support for dumb HTTP transfer protocol
    
    Git supports two HTTP based transfer protocols to exchange data
    between two repositories: the dumb protocol and the smart protocol.
    
    Nowadays, the smart protocol is a common method of transferring
    data because it is more efficient but there is still some git
    servers in the wild that only support the dumb protocol.
    
    Unfortunately the dulwich package does not support such protocol
    so this kind of git repository could not be loaded into the archive.
    
    That commit adds support to load such git repository by fetching
    objects according to the dumb HTTP transfer protocol specification.
    
    Related to T2489

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/114/ for more details.

ardumont added 1 blocking reviewer(s): Reviewers.
ardumont added a subscriber: ardumont.

lgtm

swh/loader/git/dumb.py
181–184

Can you add a test for this?

194–195

could you add a comment?

swh/loader/git/dumb.py
181–184

It seems the bare clone of the test repository put all objects in a single pack file so that code is not covered.
I think I can serve the .git folder of the test repository, need to check.

194–195

sure

Update:

  • Increase tests coverage: handle bare repo without pack files and check dumb repo can still be loaded if dulwich raises an exception in fetch_pack
  • Improve comment

Build is green

Patch application report for D6342 (id=23125)

Rebasing onto 05d6d762d0...

Current branch diff-target is up to date.
Changes applied before test
commit 87ff9be90567beec0bfcbefe39ed63907f1cbfbd
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Sep 16 16:00:15 2021 +0200

    loader: Add support for dumb HTTP transfer protocol
    
    Git supports two HTTP based transfer protocols to exchange data
    between two repositories: the dumb protocol and the smart protocol.
    
    Nowadays, the smart protocol is a common method of transferring
    data because it is more efficient but there is still some git
    servers in the wild that only support the dumb protocol.
    
    Unfortunately the dulwich package does not support such protocol
    so this kind of git repository could not be loaded into the archive.
    
    That commit adds support to load such git repository by fetching
    objects according to the dumb HTTP transfer protocol specification.
    
    Related to T2489

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/120/ for more details.

swh/loader/git/dumb.py
181–184

hmm you could try moving the packfile out of the repository, and use git unpack-objects

swh/loader/git/dumb.py
181–184

I have implemented my first idea and it worked, so we now have a test suite processing pack files only and another one fetching objects from the objects/ folder.

vlorentz added inline comments.
swh/loader/git/dumb.py
191–193

sorry, last nitpick

This revision is now accepted and ready to land.Sep 28 2021, 2:39 PM

Update: move comment inside if block

Build is green

Patch application report for D6342 (id=23132)

Rebasing onto 05d6d762d0...

Current branch diff-target is up to date.
Changes applied before test
commit d3976ca6504c57d35467daf443d435f9d329f9ec
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Sep 16 16:00:15 2021 +0200

    loader: Add support for dumb HTTP transfer protocol
    
    Git supports two HTTP based transfer protocols to exchange data
    between two repositories: the dumb protocol and the smart protocol.
    
    Nowadays, the smart protocol is a common method of transferring
    data because it is more efficient but there is still some git
    servers in the wild that only support the dumb protocol.
    
    Unfortunately the dulwich package does not support such protocol
    so this kind of git repository could not be loaded into the archive.
    
    That commit adds support to load such git repository by fetching
    objects according to the dumb HTTP transfer protocol specification.
    
    Related to T2489

See https://jenkins.softwareheritage.org/job/DLDG/job/tests-on-diff/121/ for more details.