Page MenuHomeSoftware Heritage

Gitorious import: loose object parsing error with corrupted file as empty one
Closed, ResolvedPublic

Description

This error seems to be relative to parsing error of legacy object within the dulwich dependency.

To reproduce, use the following repository /srv/storage/space/mirrors/gitorious.org/mnt/repositories/sortix/sortix-gitorious-wiki.git:

repo = 'sortix-gitorious-wiki.git'
origin_url = 'http://foo/bar/git'

import logging
logging.basicConfig(level=logging.DEBUG)

from swh.loader.git.tasks import LoadDiskGitRepository

t = LoadDiskGitRepository()
t.run(origin_url=origin_url, directory=repo, date='2016-05-03T15:16:32+00:00')

output:

DEBUG:swh.scheduler.task.LoadDiskGitRepository:Creating git origin for http://foo/bar/sortix-gitorious-wiki.git
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done creating git origin for http://foo/bar/sortix-gitorious-wiki.git
Traceback (most recent call last):
  File "./load-git-disk.py", line 20, in <module>
    main()
  File "/usr/lib/python3/dist-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "./load-git-disk.py", line 17, in main
    t.run(origin_url=origin_url, directory=repo, date='2016-05-03T15:16:32+00:00')
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-scheduler/swh/scheduler/task.py", line 35, in run
    raise e from None
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-scheduler/swh/scheduler/task.py", line 32, in run
    result = self.run_task(*args, **kwargs)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-git/swh/loader/git/tasks.py", line 39, in run_task
    return loader.load(origin_url, directory, dateutil.parser.parse(date))
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-git/swh/loader/git/base.py", line 422, in load
    self.fetch_data()
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-git/swh/loader/git/loader.py", line 48, in fetch_data
    type_name = self.repo[oid].type_name
  File "/usr/lib/python3/dist-packages/dulwich/repo.py", line 470, in __getitem__
    return self.object_store[name]
  File "/usr/lib/python3/dist-packages/dulwich/object_store.py", line 118, in __getitem__
    type_num, uncomp = self.get_raw(sha)
  File "/usr/lib/python3/dist-packages/dulwich/object_store.py", line 372, in get_raw
    ret = self._get_loose_object(hexsha)
  File "/usr/lib/python3/dist-packages/dulwich/object_store.py", line 521, in _get_loose_object
    return ShaFile.from_path(path)
  File "/usr/lib/python3/dist-packages/dulwich/objects.py", line 370, in from_path
    return cls.from_file(f)
  File "/usr/lib/python3/dist-packages/dulwich/objects.py", line 376, in from_file
    obj = cls._parse_file(f)
  File "/usr/lib/python3/dist-packages/dulwich/objects.py", line 346, in _parse_file
    if cls._is_legacy_object(map):
  File "/usr/lib/python3/dist-packages/dulwich/objects.py", line 338, in _is_legacy_object
    b0 = ord(magic[0:1])
TypeError: ord() expected a character, but string of length 0 found

Note: load-git-disk.py is a wrapper around the scenario described (cf. P185)

Event Timeline

ardumont renamed this task from Gitorious import: to Gitorious import: loose object parsing error.Oct 26 2017, 11:20 AM
ardumont triaged this task as Normal priority.
ardumont updated the task description. (Show Details)
ardumont updated the task description. (Show Details)

Tweaking the loader git to print the actual sha1:

swh.loader.git.GitLoader.iter_objects from:

def iter_objects(self):
    object_store = self.repo.object_store

    for pack in object_store.packs:
        objs = list(pack.index.iterentries())
        objs.sort(key=lambda x: x[1])
        for sha, offset, crc32 in objs:
            yield hashutil.hash_to_bytehex(sha)

    yield from object_store._iter_loose_objects():
    yield from object_store._iter_alternate_objects()

to

def iter_objects(self):
    object_store = self.repo.object_store

    for pack in object_store.packs:
        objs = list(pack.index.iterentries())
        objs.sort(key=lambda x: x[1])
        for sha, offset, crc32 in objs:
            yield hashutil.hash_to_bytehex(sha)

    for obj in object_store._iter_loose_objects():
        print(obj)
        yield obj
    yield from object_store._iter_alternate_objects()

Running again what's described in the task:

b'99ea0c61b44c16ddd9956f8bb58749d707d14b47'
Traceback (most recent call last):
  File "./load-git-disk.py", line 20, in <module>
    main()
  File "/usr/lib/python3/dist-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 534, in invoke
...

99ea0c61b44c16ddd9956f8bb58749d707d14b47 is the culprit.

Checking on disk:

$ cd sortix-gitorious-wiki.git
$ find -type f -iname *ea0c61b44c16ddd9956f8bb58749d707d14b47
./sortix-gitorious-wiki.git/objects/99/ea0c61b44c16ddd9956f8bb58749d707d14b47
# that file is empty alright
$ stat ./sortix-gitorious-wiki.git/objects/99/ea0c61b44c16ddd9956f8bb58749d707d14b47
  File: ./sortix-gitorious-wiki.git/objects/99/ea0c61b44c16ddd9956f8bb58749d707d14b47
  Size: 0               Blocks: 0          IO Block: 4096   regular empty file
Device: fe02h/65026d    Inode: 18918637    Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/    tony)   Gid: ( 1000/    tony)
Access: 2017-10-26 11:17:18.011739954 +0200
Modify: 2017-10-26 10:22:47.398426430 +0200
Change: 2017-10-26 10:22:47.398426430 +0200
 Birth: -

git fsck agrees with us:

$ git fsck
error: object file ./objects/99/ea0c61b44c16ddd9956f8bb58749d707d14b47 is empty
error: object file ./objects/99/ea0c61b44c16ddd9956f8bb58749d707d14b47 is empty
fatal: loose object 99ea0c61b44c16ddd9956f8bb58749d707d14b47 (stored in ./objects/99/ea0c61b44c16ddd9956f8bb58749d707d14b47) is corrupt
ardumont renamed this task from Gitorious import: loose object parsing error to Gitorious import: loose object parsing error with the empty file.Oct 26 2017, 11:55 AM
ardumont renamed this task from Gitorious import: loose object parsing error with the empty file to Gitorious import: loose object parsing error with corrupted file as empty one.Oct 26 2017, 4:14 PM

This error slipped under my radar last week.
I opened a related issue in dulwich since it should be handled upstream.

Update on this:

  • Issue opened.
  • Pull Request (PR) proposed and merged.

No new dulwich release yet though.

Packaged it and pushed to our own repository.

Fixed with that latest version package:

$ python3
Python 3.5.3 (default, Jan 19 2017, 14:11:04)
[GCC 6.3.0 20170118] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> repo = 'sortix-gitorious-wiki.git'
>>> origin_url = 'http://foo/bar/git'
>>>
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>>
>>> from swh.loader.git.tasks import LoadDiskGitRepository
>>>
>>> t = LoadDiskGitRepository()
>>> t.run(origin_url=origin_url, directory=repo, date='2016-05-03T15:16:32+00:00')
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Creating git origin for http://foo/bar/git
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done creating git origin for http://foo/bar/git
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Creating origin_visit for origin 1 at time 2016-05-03 15:16:32+00:00
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done Creating origin_visit for origin 1 at time 2016-05-03 15:16:32+00:00
WARNING:swh.scheduler.task.LoadDiskGitRepository:object 99ea0c61b44c16ddd9956f8bb58749d707d14b47 corrupted (empty file), skipping
WARNING:swh.scheduler.task.LoadDiskGitRepository:object deedbadc6a4f04c99339a730f0e5cf1ed1bfb211 corrupted (empty file), skipping
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Sending 1000 contents
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done sending 1000 contents
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Sending 792 contents
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done sending 792 contents
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Sending 1793 directories
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done sending 1793 directories
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Sending 1000 revisions
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done sending 1000 revisions
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Sending 821 revisions
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done sending 821 revisions
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Sending 2 occurrences
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done sending 2 occurrences
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Updating origin_visit for origin 1 with status full
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done updating origin_visit for origin 1 with status full
DEBUG:amqp:Start from server, version: 0.9, properties: {'platform': 'Erlang/OTP', 'version': '3.6.6', 'information': 'Licensed under the MPL.  See http://www.rabbitmq.com/', 'copyright': 'Copyright (C) 2007-2016 Pivotal Software, Inc.', 'capabilities': {'connection.blocked': True, 'exchange_exchange_bindings': True, 'basic.nack': True, 'publisher_confirms': True, 'per_consumer_qos': True, 'authentication_failure_close': True, 'direct_reply_to': True, 'consumer_priorities': True, 'consumer_cancel_notify': True}, 'cluster_name': 'rabbit@corellia.lan', 'product': 'RabbitMQ'}, mechanisms: ['AMQPLAIN', 'PLAIN'], locales: ['en_US']
DEBUG:amqp:Open OK!
DEBUG:amqp:using channel_id: 1
DEBUG:amqp:Channel open
{'status': 'eventful'}
ardumont claimed this task.