Page MenuHomeSoftware Heritage

Gitorious import: unexisting object retrieval makes the loading fail
Closed, ResolvedPublic

Description

On some disk repository, errors occur when trying to retrieve some unexisting object.

Steps to reproduce on a local storage with latest swh-loader-git.

Use /srv/storage/space/mirrors/gitorious.org/mnt/repositories/fe6/441/641fb6e08ddb2e4fd096dcf18e80b894bf.git:

repo = '641fb6e08ddb2e4fd096dcf18e80b894bf.git'
origin_url = 'http://foo/bar/git'

import logging
logging.basicConfig(level=logging.DEBUG)

from swh.loader.git.tasks import LoadDiskGitRepository

t = LoadDiskGitRepository()
t.run(origin_url=origin_url, directory=repo, date='2016-05-03T15:16:32+00:00')

output:

>>> repo = '641fb6e08ddb2e4fd096dcf18e80b894bf.git'
>>> origin_url = 'http://foo/bar/git'
>>>
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>>
>>> from swh.loader.git.tasks import LoadDiskGitRepository
>>>
>>> t = LoadDiskGitRepository()
>>> t.run(origin_url=origin_url, directory=repo, date='2016-05-03T15:16:32+00:00')
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Creating git origin for http://foo/bar/git
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done creating git origin for http://foo/bar/git
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-scheduler/swh/scheduler/task.py", line 35, in run
    raise e from None
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-scheduler/swh/scheduler/task.py", line 32, in run
    result = self.run_task(*args, **kwargs)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-git/swh/loader/git/tasks.py", line 39, in run_task
    return loader.load(origin_url, directory, dateutil.parser.parse(date))
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-git/swh/loader/git/base.py", line 422, in load
    self.fetch_data()
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-git/swh/loader/git/loader.py", line 48, in fetch_data
    type_name = self.repo[oid].type_name
  File "/usr/lib/python3/dist-packages/dulwich/repo.py", line 474, in __getitem__
    return self.object_store[self.refs[name]]
  File "/usr/lib/python3/dist-packages/dulwich/refs.py", line 244, in __getitem__
    raise KeyError(name)
KeyError: b'f2bd56289283900cd1dcf2c52b93952f98e45144.deleted'

Event Timeline

git fsck on that repository shows that this entry is actually wrong.

$ git fsck
bad sha1 file: ./objects/f2/bd56289283900cd1dcf2c52b93952f98e45144.deleted
Checking object directories: 100% (256/256), done.
Checking objects: 100% (7040/7040), done.
dangling commit caa4103a80ef90db5eb9836f6b6028b7ce36c73a
dangling commit 83b615ce84fd333a4d1c107105ebd788657089b1
dangling commit 74baf8bff5d466fbc02b2e305f7cc7788fbade97
dangling commit 7adddddd79a50a4deae16ba4f5fda736ff286875
dangling commit 7865efa71e8dba289c9b1ac19454f0a9311c8328
dangling commit e6f075b2443b1ed5a2c074eed3edcff02fba1b25

Possibly related error.

Repo: /srv/storage/space/mirrors/gitorious.org/mnt/repositories/webkit/qt-haiku-webkit.git

DEBUG:swh.scheduler.task.LoadDiskGitRepository:Creating git origin for http://webkit/qt-haiku-webkit.git
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done creating git origin for http://webkit/qt-haiku-webkit.git
Traceback (most recent call last):
  File "./load-git-disk.py", line 26, in <module>
    main()
  File "/usr/lib/python3/dist-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "./load-git-disk.py", line 22, in main
    date='2016-05-03T15:16:32+00:00')
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-scheduler/swh/scheduler/task.py", line 35, in run
    raise e from None
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-scheduler/swh/scheduler/task.py", line 32, in run
    result = self.run_task(*args, **kwargs)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-git/swh/loader/git/tasks.py", line 39, in run_task
    return loader.load(origin_url, directory, dateutil.parser.parse(date))
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-git/swh/loader/git/base.py", line 422, in load
    self.fetch_data()
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-git/swh/loader/git/loader.py", line 48, in fetch_data
    type_name = self.repo[oid].type_name
  File "/usr/lib/python3/dist-packages/dulwich/repo.py", line 476, in __getitem__
    return self.object_store[self.refs[name]]
  File "/usr/lib/python3/dist-packages/dulwich/refs.py", line 256, in __getitem__
    raise KeyError(name)
KeyError: b'b2tmp_obj_SfIbZp'

Except that git fsck sees nothing wrong with this repository:

$ git fsck
Checking object directories: 100% (256/256), done.
Checking objects: 100% (809483/809483), done.
Checking connectivity: 928500, done.