Page MenuHomeSoftware Heritage

Gitorious import: Overflow error in revision time
Closed, MigratedEdits Locked

Description

Error with the gitorious task import:
{'args': {'directory': '/srv/storage/space/mirrors/gitorious.org/mnt/repositories/nginx-catap/mainline.git', 'date': 'Wed, 30 Mar 2016 09:40:04 +0200', 'origin_url': 'https://gitorious.org/nginx-catap/mainline.git'}, 'exception': "OverflowError('timestamp out of range for platform time_t',)"}

Trying to reproduce, i have not exactly that error though.

Repository: /srv/storage/space/mirrors/gitorious.org/mnt/repositories/nginx-catap/mainline.git

repo = 'mainline.git'
origin_url = 'http://nginx-catap/mainline.git'

import logging
logging.basicConfig(level=logging.DEBUG)

from swh.loader.git.tasks import LoadDiskGitRepository

t = LoadDiskGitRepository()
t.run(origin_url=origin_url, directory=repo, date='2016-05-03T15:16:32+00:00')

Output:

python3
Python 3.5.3 (default, Jan 19 2017, 14:11:04)
[GCC 6.3.0 20170118] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> repo = 'mainline.git'
>>> origin_url = 'http://nginx-catap/mainline.git'
>>>
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>>
>>> from swh.loader.git.tasks import LoadDiskGitRepository

>>>
>>> t = LoadDiskGitRepository()
>>> t.run(origin_url=origin_url, directory=repo, date='2016-05-03T15:16:32+00:00')
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Creating git origin for http://nginx-catap/mainline.git
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done creating git origin for http://nginx-catap/mainline.git
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Sending 8230 contents
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done sending 8230 contents
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Sending 4751 directories
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Done sending 4751 directories
DEBUG:swh.scheduler.task.LoadDiskGitRepository:Sending 847 revisions
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-scheduler/swh/scheduler/task.py", line 35, in run
    raise e from None
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-scheduler/swh/scheduler/task.py", line 32, in run
    result = self.run_task(*args, **kwargs)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-git/swh/loader/git/tasks.py", line 39, in run_task
    return loader.load(origin_url, directory, dateutil.parser.parse(date))
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-git/swh/loader/git/base.py", line 434, in load
    self.send_all_revisions(self.get_revisions())
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-git/swh/loader/git/base.py", line 370, in send_all_revisions
    send_in_packets(revisions, self.send_revisions, packet_size)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-git/swh/loader/git/base.py", line 40, in send_in_packets
    sender(formatted_objects)
  File "/usr/lib/python3/dist-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/usr/lib/python3/dist-packages/retrying.py", line 206, in call
    return attempt.get(self._wrap_exception)
  File "/usr/lib/python3/dist-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/lib/python3/dist-packages/six.py", line 686, in reraise
    raise value
  File "/usr/lib/python3/dist-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-git/swh/loader/git/base.py", line 279, in send_revisions
    self.storage.revision_add(revision_list)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-storage/swh/storage/storage.py", line 640, in revision_add
    lambda rev: parents_filtered.extend(rev['parents']))
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-storage/swh/storage/db.py", line 178, in copy_to
    for d in items:
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-storage/swh/storage/storage.py", line 633, in <genexpr>
    if revision['id'] in revisions_missing)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-storage/swh/storage/converters.py", line 166, in revision_to_db
    date = date_to_db(revision['date'])
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-storage/swh/storage/converters.py", line 150, in date_to_db
    timestamp = datetime.datetime.fromtimestamp(seconds, datetime.timezone.utc)
ValueError: year is out of range

Event Timeline

Debugging some more, the date generating this error is the following, which raises indeed the initial overflow error:

>>> date = {'offset': 25624204, 'timestamp': 18446743887488505614, 'negative_utc': None}
>>> from swh.storage import converters
>>> converters.date_to_db(date)
date offset {'offset': 25624204, 'timestamp': 18446743887488505614, 'negative_utc': None}
normalized {'offset': 25624204, 'timestamp': {'microseconds': 0, 'seconds': 18446743887488505614}, 'negative_utc': None}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-storage/swh/storage/converters.py", line 152, in date_to_db
    timestamp = datetime.datetime.fromtimestamp(seconds, datetime.timezone.utc)
OverflowError: timestamp out of range for platform time_t

The revision in question is:

$ python3
Python 3.5.3 (default, Jan 19 2017, 14:11:04)
[GCC 6.3.0 20170118] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> id = b'$\xb0<S\xf9\x1e\xc7q\x99(\xab\xfe\xeew\xd6\x1a\xd9\x95\xf5\xed'
>>> from swh.model import hashutil
>>> hashutil.hash_to_hex(id)
'24b03c53f91ec7719928abfeee77d61ad995f5ed'

git show 24b03c53f91ec7719928abfeee77d61ad995f5ed displays that commit's date as EPOCH (which seems off):

commit 24b03c53f91ec7719928abfeee77d61ad995f5ed
Author: Igor Sysoev <igor@sysoev.ru>
Date:   Thu Jan 1 00:00:00 1970 +0000

Better, using git show as raw format, we see more clearly the issue with date:

$ git show --format=raw 24b03c53f91ec7719928abfeee77d61ad995f5ed | head -5
commit 24b03c53f91ec7719928abfeee77d61ad995f5ed
tree 7b6189adcaa2f420f4aa44e1550c3a7de659ad2f
parent 416c46f96c648c0effb9f6adccae3e10d2d7b9f9
author Igor Sysoev <igor@sysoev.ru> 18446743887488505614 +42707004
committer Kirill A. Korinskiy <catap@catap.ru> 18446743887488505614 +42707004

And git fsck outputs that commit as being faulty in regards to date overflow:

$ git fsck
Checking object directories: 100% (256/256), done.
error in commit f08bd742418894b02313f44b7eaaf0d85fe53271: badTimezone: invalid author/committer line - bad time zone
error in commit c9c0bdaf1572337bbe0b19800997d3637a43a335: badTimezone: invalid author/committer line - bad time zone
error in commit 24b03c53f91ec7719928abfeee77d61ad995f5ed: badDateOverflow: invalid author/committer line - date causes integer overflow
Checking objects: 100% (14259/14259), done.
ardumont triaged this task as Normal priority.Oct 27 2017, 2:26 PM

PR got merged \m/

There is still a limitation on the loader-git though.

The patch in dulwich is git fsck compliant (as per the author's will which is completely reasonable).
So it will start rejecting date whose timesamp is above 2^63 - 1.
So the bug identified in the task will be caught.

But for that repository, it will fail later on (because of overflowing dates with another limit).
Because, in storage, we fail when a timestamp is above ~2^38 (due to datetime's current implementation).
So, this bug is still open.

I'm not so sure of the fix yet.
What's sure is that we must not break in the storage layer.
So the only fix i see for now is to explicitely check on those objects with date fields (commit, tag) in the loader-git.
And skip those objects if they fail the check.