Page MenuHomeSoftware Heritage

git loader: fail to ingest our own hello world repository
Closed, MigratedEdits Locked

Description

In the dogfooding category, it would be nice that we can ingest our self-hosted Git repositories without relying on the fact that they are also on GitHub :-)

Unfortunately, trying to run the git loader on, e.g., the hello world repo, fails like this:

2018-09-14 17:00:54,400 25707 Creating git origin for https://forge.softwareheritage.org/source/helloworld.git
2018-09-14 17:00:54,404 25707 Starting new HTTP connection (1): localhost
2018-09-14 17:00:54,408 25707 http://localhost:5002 "POST /origin/add HTTP/1.1" 200 1
2018-09-14 17:00:54,408 25707 Done creating git origin for https://forge.softwareheritage.org/source/helloworld.git
2018-09-14 17:00:54,409 25707 Creating origin_visit for origin 2 at time 2018-09-14 15:00:54.400801+00:00
2018-09-14 17:00:54,411 25707 Resetting dropped connection: localhost
2018-09-14 17:00:54,415 25707 http://localhost:5002 "POST /origin/visit/add HTTP/1.1" 200 16
2018-09-14 17:00:54,415 25707 Done Creating origin_visit for origin 2 at time 2018-09-14 15:00:54.400801+00:00
2018-09-14 17:00:54,417 25707 Resetting dropped connection: localhost
2018-09-14 17:00:54,420 25707 http://localhost:5002 "POST /fetch_history/start HTTP/1.1" 200 1
2018-09-14 17:00:54,422 25707 Resetting dropped connection: localhost
2018-09-14 17:00:54,425 25707 http://localhost:5002 "POST /snapshot/latest HTTP/1.1" 200 1
2018-09-14 17:00:54,427 25707 Resetting dropped connection: localhost
2018-09-14 17:00:54,431 25707 http://localhost:5002 "POST /snapshot/latest HTTP/1.1" 200 1
2018-09-14 17:00:54,432 25707 Starting new HTTPS connection (1): forge.softwareheritage.org
2018-09-14 17:00:54,760 25707 https://forge.softwareheritage.org:443 "GET /source/helloworld.git/info/refs?service=git-upload-pack HTTP/1.1" 200 None
2018-09-14 17:00:54,762 25707 Loading failure, updating to `partial` status
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/dulwich/protocol.py", line 200, in read_pkt_line
    sizestr = read(4)
  File "/usr/lib/python3.6/gzip.py", line 276, in read
    return self._buffer.read(size)
  File "/usr/lib/python3.6/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/usr/lib/python3.6/gzip.py", line 463, in read
    if not self._read_gzip_header():
  File "/usr/lib/python3.6/gzip.py", line 411, in _read_gzip_header
    raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'00')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/zack/dati/projects/sw-heritage/git/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 889, in load
    more_data_to_fetch = self.fetch_data()
  File "/home/zack/dati/projects/sw-heritage/git/swh-environment/swh-loader-git/swh/loader/git/updater.py", line 260, in fetch_data
    do_progress)
  File "/home/zack/dati/projects/sw-heritage/git/swh-environment/swh-loader-git/swh/loader/git/updater.py", line 202, in fetch_pack_from_origin
    progress=do_activity)
  File "/usr/lib/python3/dist-packages/dulwich/client.py", line 1544, in fetch_pack
    b"git-upload-pack", url)
  File "/usr/lib/python3/dist-packages/dulwich/client.py", line 1449, in _discover_references
    [pkt] = list(proto.read_pkt_seq())
  File "/usr/lib/python3/dist-packages/dulwich/protocol.py", line 254, in read_pkt_seq
    pkt = self.read_pkt_line()
  File "/usr/lib/python3/dist-packages/dulwich/protocol.py", line 212, in read_pkt_line
    raise GitProtocolError(e)
dulwich.errors.GitProtocolError: Not a gzipped file (b'00')
2018-09-14 17:00:54,771 25707 Resetting dropped connection: localhost
2018-09-14 17:00:54,779 25707 http://localhost:5002 "POST /fetch_history/end HTTP/1.1" 200 1
2018-09-14 17:00:54,781 25707 Updating origin_visit for origin 2 with status partial
2018-09-14 17:00:54,785 25707 Resetting dropped connection: localhost
2018-09-14 17:00:54,793 25707 http://localhost:5002 "POST /origin/visit/update HTTP/1.1" 200 1
2018-09-14 17:00:54,795 25707 Done updating origin_visit for origin 2 with status partial

git clone on the same URL works just fine. I suspect this affects all our repos hosted on forge.softwareheritage.org, but haven't tried.

Event Timeline

zack triaged this task as Normal priority.Sep 14 2018, 5:01 PM
zack created this task.

Thanks to

export GIT_TRACE_PACKET=1
export GIT_TRACE=1
export GIT_CURL_VERBOSE=1

I've compared traces from git cloning on our forge and on other repositories with actual git clone.

The only difference I can see between our forge and the rest of the world is that we honor Accept-Encoding: gzip requests with gzipped responses (Content-Encoding: gzip). Git is happy with that but dulwich gets confused for some reason.

The hunt for another git remote which gzips HTTP responses begins...

Tracked down the dulwich bug. Pull Request : https://github.com/dulwich/dulwich/pull/659

I'm still tempted to configure our Apache (or whichever component does the compression) to stop doing it as I couldn't find a single other git HTTP remote doing this.

zack claimed this task.