When dealing with refs, dulwich expects utf-8 and it's visibly not always the case.
This fails ungracefully.
Steps to reproduce with latest swh-loader-git:
repo = 'test-project2009.git' origin_url = 'http://foo/bar/git/%s' % repo import logging logging.basicConfig(level=logging.DEBUG) from swh.loader.git.tasks import LoadDiskGitRepository t = LoadDiskGitRepository() t.run(origin_url=origin_url, directory=repo, date='2016-05-03T15:16:32+00:00')
source: uffizi:/srv/storage/space/mirrors/gitorious.org/mnt/repositories/test-project2009/test-project2009.git
Full stack trace:
python3 Python 3.5.3 (default, Jan 19 2017, 14:11:04) [GCC 6.3.0 20170118] on linux Type "help", "copyright", "credits" or "license" for more information. >>> repo = 'test-project2009.git' repo, date='2016-05-03T15:16:32+00:00') >>> origin_url = 'http://foo/bar/git/%s' % repo >>> >>> import logging >>> logging.basicConfig(level=logging.DEBUG) >>> >>> from swh.loader.git.tasks import LoadDiskGitRepository >>> >>> t = LoadDiskGitRepository() >>> t.run(origin_url=origin_url, directory=repo, date='2016-05-03T15:16:32+00:00') DEBUG:swh.scheduler.task.LoadDiskGitRepository:Creating git origin for http://foo/bar/git/test-project2009.git DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done creating git origin for http://foo/bar/git/test-project2009.git DEBUG:swh.scheduler.task.LoadDiskGitRepository:Creating origin_visit for origin 2 at time 2016-05-03 15:16:32+00:00 DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done Creating origin_visit for origin 2 at time 2016-05-03 15:16:32+00:00 DEBUG:swh.scheduler.task.LoadDiskGitRepository:Sending 5 contents DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done sending 5 contents DEBUG:swh.scheduler.task.LoadDiskGitRepository:Sending 5 directories DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done sending 5 directories DEBUG:swh.scheduler.task.LoadDiskGitRepository:Sending 5 revisions DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done sending 5 revisions ERROR:swh.scheduler.task.LoadDiskGitRepository:Loading failure, updating to `partial` status Traceback (most recent call last): File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 896, in load self.store_data() File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 1005, in store_data self.send_batch_occurrences(self.get_occurrences()) File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 693, in send_batch_occurrences send_in_packets(occurrences, self.send_occurrences, packet_size) File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 35, in send_in_packets for obj in objects: File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-git/swh/loader/git/loader.py", line 218, in get_occurrences for refs, target in self.repo.refs.as_dict().items() File "/usr/lib/python3/dist-packages/dulwich/refs.py", line 164, in as_dict keys = self.keys(base) File "/usr/lib/python3/dist-packages/dulwich/refs.py", line 143, in keys return self.allkeys() File "/usr/lib/python3/dist-packages/dulwich/refs.py", line 470, in allkeys sys.getfilesystemencoding()) UnicodeEncodeError: 'utf-8' codec can't encode character '\udccd' in position 11: surrogates not allowed DEBUG:swh.scheduler.task.LoadDiskGitRepository:Updating origin_visit for origin 2 with status partial DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done updating origin_visit for origin 2 with status partial DEBUG:amqp:Start from server, version: 0.9, properties: {'platform': 'Erlang/OTP', 'copyright': 'Copyright (C) 2007-2016 Pivotal Software, Inc.', 'version': '3.6.6', 'product': 'RabbitMQ', 'cluster_name': 'rabbit@corellia.lan', 'capabilities': {'connection.blocked': True, 'per_consumer_qos': True, 'direct_reply_to': True, 'exchange_exchange_bindings': True, 'publisher_confirms': True, 'consumer_cancel_notify': True, 'basic.nack': True, 'consumer_priorities': True, 'authentication_failure_close': True}, 'information': 'Licensed under the MPL. See http://www.rabbitmq.com/'}, mechanisms: ['AMQPLAIN', 'PLAIN'], locales: ['en_US'] DEBUG:amqp:Open OK! DEBUG:amqp:using channel_id: 1 DEBUG:amqp:Channel open {'status': 'failed'}