Page MenuHomeSoftware Heritage

gitorious import: UnicodeDecodeError when reading references
Closed, ResolvedPublic

Description

When dealing with refs, dulwich expects utf-8 and it's visibly not always the case.
This fails ungracefully.

Steps to reproduce with latest swh-loader-git:

repo = 'test-project2009.git'
origin_url = 'http://foo/bar/git/%s' % repo

import logging
logging.basicConfig(level=logging.DEBUG)

from swh.loader.git.tasks import LoadDiskGitRepository

t = LoadDiskGitRepository()
t.run(origin_url=origin_url, directory=repo, date='2016-05-03T15:16:32+00:00')

source: uffizi:/srv/storage/space/mirrors/gitorious.org/mnt/repositories/test-project2009/test-project2009.git

Full stack trace:

python3
Python 3.5.3 (default, Jan 19 2017, 14:11:04)
[GCC 6.3.0 20170118] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> repo = 'test-project2009.git'
repo, date='2016-05-03T15:16:32+00:00')
>>> origin_url = 'http://foo/bar/git/%s' % repo
>>>
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>>
>>> from swh.loader.git.tasks import LoadDiskGitRepository
>>>
>>> t = LoadDiskGitRepository()
>>> t.run(origin_url=origin_url, directory=repo, date='2016-05-03T15:16:32+00:00')
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Creating git origin for http://foo/bar/git/test-project2009.git
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done creating git origin for http://foo/bar/git/test-project2009.git
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Creating origin_visit for origin 2 at time 2016-05-03 15:16:32+00:00
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done Creating origin_visit for origin 2 at time 2016-05-03 15:16:32+00:00
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Sending 5 contents
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done sending 5 contents
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Sending 5 directories
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done sending 5 directories
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Sending 5 revisions
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done sending 5 revisions
ERROR:swh.scheduler.task.LoadDiskGitRepository:Loading failure, updating to `partial` status
Traceback (most recent call last):
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 896, in load
    self.store_data()
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 1005, in store_data
    self.send_batch_occurrences(self.get_occurrences())
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 693, in send_batch_occurrences
    send_in_packets(occurrences, self.send_occurrences, packet_size)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 35, in send_in_packets
    for obj in objects:
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-git/swh/loader/git/loader.py", line 218, in get_occurrences
    for refs, target in self.repo.refs.as_dict().items()
  File "/usr/lib/python3/dist-packages/dulwich/refs.py", line 164, in as_dict
    keys = self.keys(base)
  File "/usr/lib/python3/dist-packages/dulwich/refs.py", line 143, in keys
    return self.allkeys()
  File "/usr/lib/python3/dist-packages/dulwich/refs.py", line 470, in allkeys
    sys.getfilesystemencoding())
UnicodeEncodeError: 'utf-8' codec can't encode character '\udccd' in position 11: surrogates not allowed
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Updating origin_visit for origin 2 with status partial
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done updating origin_visit for origin 2 with status partial
DEBUG:amqp:Start from server, version: 0.9, properties: {'platform': 'Erlang/OTP', 'copyright': 'Copyright (C) 2007-2016 Pivotal Software, Inc.', 'version': '3.6.6', 'product': 'RabbitMQ', 'cluster_name': 'rabbit@corellia.lan', 'capabilities': {'connection.blocked': True, 'per_consumer_qos': True, 'direct_reply_to': True, 'exchange_exchange_bindings': True, 'publisher_confirms': True, 'consumer_cancel_notify': True, 'basic.nack': True, 'consumer_priorities': True, 'authentication_failure_close': True}, 'information': 'Licensed under the MPL.  See http://www.rabbitmq.com/'}, mechanisms: ['AMQPLAIN', 'PLAIN'], locales: ['en_US']
DEBUG:amqp:Open OK!
DEBUG:amqp:using channel_id: 1
DEBUG:amqp:Channel open
{'status': 'failed'}

Event Timeline

Trying to analyze a bit further that repository, we can see this:

$ tree test-project2009.git/refs
test-project2009.git/refs
├── heads
│   ├── \32104411000翍\32104411000翍\21353075460嗻\21353075460嗻\21353073041嗻\21353073055嗻\21353073071嗻\5021
│   └── master
└── tags

2 directories, 2 files

When cloning that repository, the unreadable ref file disappeared:

$ git clone test-project2009.git ./test-project2009-wc
Cloning into './test-project2009-wc'...
done.
$ tree ./test-project2009-wc/.git/refs
./test-project2009-wc/.git/refs
├── heads
│   └── master
├── remotes
│   └── origin
│       └── HEAD
└── tags

4 directories, 2 files

That's still mentioned in logs though (with another look)

$ git log --all
commit fcf492990b9e83452491e000ed56aacad56a6681 (<CD><EE><E2><E0><FF><E2><E5><F2><EA><E0>1)
Author: Tester <only_for_del@mail.ru>
Date:   Wed Dec 16 04:08:41 2009 +0300

    Коммит 2 новой ветки1

commit b8218970b1b25035fb5b29c46a0fc4a773e8599e
Author: Tester <only_for_del@mail.ru>
Date:   Wed Dec 16 04:07:06 2009 +0300

    Коммит1 новой ветки1

commit 01893098d620ff74302c5e2c357b979a51074682 (HEAD -> master)
Author: Tester <only_for_del@mail.ru>
Date:   Wed Dec 16 03:40:20 2009 +0300

    edit2

commit e34446d87b71cba9bd5e0029ece446eb06cba063
Author: Tester <only_for_del@mail.ru>
Date:   Wed Dec 16 03:34:46 2009 +0300

    edit1

commit 5a505c839ec253d1ff6f0974317be2a284a3751e
Author: Vitaly <ghjcnjdbnfkbq@gmail.com>
Date:   Wed Dec 16 03:15:13 2009 +0300

    new

The first commit in the git log output is indeed the content of the unreadable ref:

$ cat refs/heads/$'\315'$'\356'$'\342'$'\340'$'\377'$'\342'$'\345'$'\362'$'\352'$'\340'1
fcf492990b9e83452491e000ed56aacad56a6681

The russian commit message means 'commit 1 new branch'

After some digging, it seems an encoding problem:

python3
Python 3.6.4 (default, Jan  5 2018, 02:13:53)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> list(os.walk(b'refs/heads'))
[(b'refs/heads', [], [b'\xcd\xee\xe2\xe0\xff\xe2\xe5\xf2\xea\xe01', b'master'])]
>>> s = b'\xcd\xee\xe2\xe0\xff\xe2\xe5\xf2\xea\xe01'
>>> s.decode('latin1')
'Íîâàÿâåòêà1'  # rubbish
>>> s.decode('cp1251')
'Новаяветка1'  # looks like russian

which means: newlight1
so sounds like a good guess.

Translation

I was initially opened to clean up the repository because i thought it was some form of corruption.
But now, i no longer think that's the case. And don't want to tamper with sources.

I opened a discussion at at https://github.com/jelmer/dulwich/issues/608 about this case.

With latest dulwich (> 0.19.1, current head) we break somewhere else now, still encoding related:

$ python3
Python 3.6.5rc1 (default, Mar 14 2018, 06:54:23)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> repo = 'test-project2009.git'
>>> origin_url = 'http://foo/bar/git/%s' % repo
>>>
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>>
>>> from swh.loader.git.tasks import LoadDiskGitRepository
>>>
>>> t = LoadDiskGitRepository()
>>> t.run(origin_url=origin_url, directory=repo, date='2016-05-03T15:16:32+00:00')
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Creating git origin for http://foo/bar/git/test-project2009.git
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done creating git origin for http://foo/bar/git/test-project2009.git
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Creating origin_visit for origin 2 at time 2016-05-03 15:16:32+00:00
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done Creating origin_visit for origin 2 at time 2016-05-03 15:16:32+00:00
ERROR:swh.scheduler.task.LoadDiskGitRepository:Loading failure, updating to `partial` status
Traceback (most recent call last):
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 890, in load
    self.store_data()
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 1005, in store_data
    self.send_snapshot(self.get_snapshot())
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-git/swh/loader/git/loader.py", line 215, in get_snapshot
    for ref, target in self.repo.refs.as_dict().items():
  File "/usr/lib/python3/dist-packages/dulwich/refs.py", line 188, in as_dict
    ret[key] = self[(base + b'/' + key).strip(b'/')]
  File "/usr/lib/python3/dist-packages/dulwich/refs.py", line 271, in __getitem__
    _, sha = self.follow(name)
  File "/usr/lib/python3/dist-packages/dulwich/refs.py", line 243, in follow
    contents = self.read_ref(refname)
  File "/usr/lib/python3/dist-packages/dulwich/refs.py", line 217, in read_ref
    contents = self.read_loose_ref(refname)
  File "/usr/lib/python3/dist-packages/dulwich/refs.py", line 604, in read_loose_ref
    filename = self.refpath(name)
  File "/usr/lib/python3/dist-packages/dulwich/refs.py", line 530, in refpath
    name = name.decode(sys.getfilesystemencoding())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 11: invalid continuation byte
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Updating origin_visit for origin 2 with status partial
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done updating origin_visit for origin 2 with status partial
{'status': 'failed'}

Patching dulwich to try and detect the encoding (when the problem arose) seems to do the trick:

From /usr/lib/python3/dist-packages/dulwich/refs.py:

def refpath(self, name):
        """Return the disk path of a ref.

        """
        if (getattr(self.path, "encode", None) and
                getattr(name, "decode", None)):
            name = name.decode(sys.getfilesystemencoding())
        if os.path.sep != "/":
            name = name.replace("/", os.path.sep)
        # TODO: as the 'HEAD' reference is working tree specific, it
        # should actually not be a part of RefsContainer
        if name == 'HEAD':
            return os.path.join(self.worktree_path, name)
        else:
            return os.path.join(self.path, name)

to:

def refpath(self, name):
    """Return the disk path of a ref.

    """
    if (getattr(self.path, "encode", None) and
            getattr(name, "decode", None)):
        try:
            name = name.decode(sys.getfilesystemencoding())
        except UnicodeDecodeError:
            from chardet import detect
            name = name.decode(detect(name)['encoding'])
    if os.path.sep != "/":
        name = name.replace("/", os.path.sep)
    # TODO: as the 'HEAD' reference is working tree specific, it
    # should actually not be a part of RefsContainer
    if name == 'HEAD':
        return os.path.join(self.worktree_path, name)
    else:
        return os.path.join(self.path, name)
Python 3.6.5rc1 (default, Mar 14 2018, 06:54:23)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> repo = 'test-project2009.git'
>>> origin_url = 'http://foo/bar/git/%s' % repo
>>>
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>>
>>> from swh.loader.git.tasks import LoadDiskGitRepository
>>>
>>> t = LoadDiskGitRepository()
>>> t.run(origin_url=origin_url, directory=repo, date='2016-05-03T15:16:32+00:00')
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Creating git origin for http://foo/bar/git/test-project2009.git
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done creating git origin for http://foo/bar/git/test-project2009.git
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Creating origin_visit for origin 2 at time 2016-05-03 15:16:32+00:00
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done Creating origin_visit for origin 2 at time 2016-05-03 15:16:32+00:00
DEBUG:chardet.charsetprober:SHIFT_JIS Japanese prober hit error at byte 12
DEBUG:chardet.charsetprober:EUC-JP Japanese prober hit error at byte 15
DEBUG:chardet.charsetprober:GB2312 Chinese prober hit error at byte 15
DEBUG:chardet.charsetprober:EUC-KR Korean prober hit error at byte 15
DEBUG:chardet.charsetprober:CP949 Korean prober hit error at byte 15
DEBUG:chardet.charsetprober:Big5 Chinese prober hit error at byte 15
DEBUG:chardet.charsetprober:EUC-TW Taiwan prober hit error at byte 15
DEBUG:chardet.charsetprober:windows-1251 Russian confidence = 0.99
DEBUG:chardet.charsetprober:KOI8-R Russian confidence = 0.0
DEBUG:chardet.charsetprober:ISO-8859-5 Russian confidence = 0.36407453561439684
DEBUG:chardet.charsetprober:MacCyrillic Russian confidence = 0.9101863390359921
DEBUG:chardet.charsetprober:IBM866 Russian confidence = 0.3982065233282465
DEBUG:chardet.charsetprober:IBM855 Russian confidence = 0.0
DEBUG:chardet.charsetprober:ISO-8859-7 Greek confidence = 0.31653944607179635
DEBUG:chardet.charsetprober:windows-1253 Greek confidence = 0.31653944607179635
DEBUG:chardet.charsetprober:ISO-8859-5 Bulgairan confidence = 0.4011678339504441
DEBUG:chardet.charsetprober:windows-1251 Bulgarian confidence = 0.99
DEBUG:chardet.charsetprober:TIS-620 Thai confidence = 0.0
DEBUG:chardet.charsetprober:ISO-8859-9 Turkish confidence = 0.48818828445754964
DEBUG:chardet.charsetprober:ISO-8859-8 Hebrew confidence = 0.0
DEBUG:chardet.charsetprober:ISO-8859-8 Hebrew confidence = 0.451669347324243
DEBUG:chardet.charsetprober:ISO-8859-8 Hebrew confidence = 0.7527822455404052
DEBUG:chardet.charsetprober:windows-1251 Russian confidence = 0.99
DEBUG:chardet.charsetprober:KOI8-R Russian confidence = 0.0
DEBUG:chardet.charsetprober:ISO-8859-5 Russian confidence = 0.36407453561439684
DEBUG:chardet.charsetprober:MacCyrillic Russian confidence = 0.9101863390359921
DEBUG:chardet.charsetprober:IBM866 Russian confidence = 0.3982065233282465
DEBUG:chardet.charsetprober:IBM855 Russian confidence = 0.0
DEBUG:chardet.charsetprober:ISO-8859-7 Greek confidence = 0.31653944607179635
DEBUG:chardet.charsetprober:windows-1253 Greek confidence = 0.31653944607179635
DEBUG:chardet.charsetprober:ISO-8859-5 Bulgairan confidence = 0.4011678339504441
DEBUG:chardet.charsetprober:windows-1251 Bulgarian confidence = 0.99
DEBUG:chardet.charsetprober:TIS-620 Thai confidence = 0.0
DEBUG:chardet.charsetprober:ISO-8859-9 Turkish confidence = 0.48818828445754964
DEBUG:chardet.charsetprober:ISO-8859-8 Hebrew confidence = 0.0
DEBUG:chardet.charsetprober:ISO-8859-8 Hebrew confidence = 0.451669347324243
DEBUG:chardet.charsetprober:ISO-8859-8 Hebrew confidence = 0.7527822455404052
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Updating origin_visit for origin 2 with status full
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done updating origin_visit for origin 2 with status full
{'status': 'eventful'}

After discussion with jelmer (dulwich's author), he proposed and implemented the real solution, deal with bytes (avoiding altogether encoding water mudding ;)
It's landed in dulwich/dulwich's master branch \m/.

And now, that bug is potentially fixed:

$ make -C $SWH_ENVIRONMENT_HOME rebuild-testdata 
...
$ python3
Python 3.6.5rc1 (default, Mar 14 2018, 06:54:23)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>>
>>> repo = 'test-project2009.git'
>>> origin_url = 'http://foo/bar/git/%s' % repo
>>>
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>>
>>> from swh.loader.git.tasks import LoadDiskGitRepository
>>>
>>> t = LoadDiskGitRepository()
>>> t.run(origin_url=origin_url, directory=repo, date='2016-05-03T15:16:32+00:00')
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Creating git origin for http://foo/bar/git/test-project2009.git
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done creating git origin for http://foo/bar/git/test-project2009.git
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Creating origin_visit for origin 1 at time 2016-05-03 15:16:32+00:00
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done Creating origin_visit for origin 1 at time 2016-05-03 15:16:32+00:00
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Sending 5 contents
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done sending 5 contents
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Sending 5 directories
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done sending 5 directories
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Sending 5 revisions
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done sending 5 revisions
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Updating origin_visit for origin 1 with status full
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done updating origin_visit for origin 1 with status full
{'status': 'eventful'}

Potentially since we still need to either wait for that new version or packaging it ourselves ;)

ardumont claimed this task.

python3-dulwich (fix included) packaged and pushed to our debian repository.