Page MenuHomeSoftware Heritage

mercurial loader: What to do in case of .hgtags?
Closed, ResolvedPublic

Description

There exists dumps with corruption in their .hgtags file.
That file is important for the loader mercurial to create the release.

The question is, what to do when this kind of error arise?

Do we:

  1. trap that kind of exception and trigger a corruption state (that won't be the last time we see a corruption, cf. T955)
  2. try and interpret the diff (it's a diff format as manual edition is possible) -> no
  3. ignore the error and continue parsing the remaining tags (we already injected information on that repository at that time...)

    -> the wiki states 'In case of a merge conflict on your tags, the safest option is to take both sides. ', so that sounds like the more reasonable approach...

sample archive: /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/j/johnbrugge-pipeline-modules/johnbrugge-pipeline-modules-source-archive.zip

Corrupted file output:

d912bc55aaf5ab6bd0c0df1f5992d96715f7a8ab alpha-preview
91b816cabbf0e74ee57868ad0b261ed6d92a0ce9 1.0-beta1
bb5421567f337262e4dde8ccb2798ba28589b859 1.0-beta2
8546862ccc64f558d515383eac9892b6e8ff7cd0 1.0-beta3
aa2bf052410260b28119d4570fcea117e3005c23 1.0-rc
bdabdf77108a73fdf5112a30e17fca6ed66b9ebc 1.0
67d273e73e67961939f5f477fa47a99e48b2cbe9 1.0.1
5a1ee169c65625d69a047ddd11689b069cdc8e8e 1.1
bbc888b43fb16fae9ac07a495180293d39eeed84 1.2
015c906bbea25fedb250164acc76fed6635254f0 1.3-beta
<<<<<<< local
bc59558bb7053b68c57a16bcfde98fe1e6d403a1 bks-epub3-beta
4571d156ffd8246314e51269fc02ca60ded15b49 bks-epub3-beta2
bfc11100e4a48073cb2571b42a8a415daa8da163 bks-epub3-1.3
=======
ca2d0d49533e7b3a6309f62417fd042e060fbfa0 1.3
>>>>>>> other
99bca97d1bcc9b2c243d2d0c555519fe1f6f2f12 bks-epub3-1.3.1

Reproduction:

archive_name = 'johnbrugge-pipeline-modules-source-archive.zip'  # not hex found
rootpath = '/home/storage/hg/repo'
origin_url = 'https://%s/googlecode/hg' % archive_name

import logging
logging.basicConfig(level=logging.DEBUG)

from swh.loader.mercurial.tasks import LoadArchiveMercurialTsk

archive_path = '%s/%s' % (rootpath, archive_name)
t = LoadArchiveMercurialTsk()
t.run(origin_url=origin_url, archive_path=archive_path, visit_date='2016-05-03T15:16:32+00:00')

Output:

python3
Python 3.6.4 (default, Jan  5 2018, 02:13:53)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> archive_name = 'johnbrugge-pipeline-modules-source-archive.zip'  # not hex found
>>> rootpath = '/home/storage/hg/repo'
>>> origin_url = 'https://%s/googlecode/hg' % archive_name
>>>
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>>
>>> from swh.loader.mercurial.tasks import LoadArchiveMercurialTsk
>>>
>>> archive_path = '%s/%s' % (rootpath, archive_name)
>>> t = LoadArchiveMercurialTsk()
>>> t.run(origin_url=origin_url, archive_path=archive_path, visit_date='2016-05-03T15:16:32+00:00')
patool: Extracting /home/storage/hg/repo/johnbrugge-pipeline-modules-source-archive.zip ...
patool: running /usr/bin/7z x -y -o/tmp/swh.loader.mercurial.wsy_4dcj -- /home/storage/hg/repo/johnbrugge-pipeline-modules-source-archive.zip
patool: ... /home/storage/hg/repo/johnbrugge-pipeline-modules-source-archive.zip extracted to `/tmp/swh.loader.mercurial.wsy_4dcj'.
INFO:swh.scheduler.task.LoadArchiveMercurialTsk:From https://johnbrugge-pipeline-modules-source-archive.zip/googlecode/hg - Uncompressing archive johnbrugge-pipeline-modules-source-archive.zip at /tmp/swh.loader.mercurial.wsy_4dcj/johnbrugge-pipeline-modules-source-archive
DEBUG:swh.scheduler.task.LoadArchiveMercurialTsk:Bundling at /tmp/swh.loader.mercurial.wsy_4dcj/johnbrugge-pipeline-modules/HG20_none_bundle
DEBUG:swh.scheduler.task.LoadArchiveMercurialTsk:Creating hg origin for https://johnbrugge-pipeline-modules-source-archive.zip/googlecode/hg
DEBUG:swh.scheduler.task.LoadArchiveMercurialTsk:Done creating hg origin for https://johnbrugge-pipeline-modules-source-archive.zip/googlecode/hg
DEBUG:swh.scheduler.task.LoadArchiveMercurialTsk:Creating origin_visit for origin 1 at time 2016-05-03 15:16:32+00:00
DEBUG:swh.scheduler.task.LoadArchiveMercurialTsk:Done Creating origin_visit for origin 1 at time 2016-05-03 15:16:32+00:00
##### b'd912bc55aaf5ab6bd0c0df1f5992d96715f7a8ab alpha-preview' b'd912bc55aaf5ab6bd0c0df1f5992d96715f7a8ab' b'alpha-preview'
##### b'91b816cabbf0e74ee57868ad0b261ed6d92a0ce9 1.0-beta1' b'91b816cabbf0e74ee57868ad0b261ed6d92a0ce9' b'1.0-beta1'
##### b'bb5421567f337262e4dde8ccb2798ba28589b859 1.0-beta2' b'bb5421567f337262e4dde8ccb2798ba28589b859' b'1.0-beta2'
##### b'8546862ccc64f558d515383eac9892b6e8ff7cd0 1.0-beta3' b'8546862ccc64f558d515383eac9892b6e8ff7cd0' b'1.0-beta3'
##### b'aa2bf052410260b28119d4570fcea117e3005c23 1.0-rc' b'aa2bf052410260b28119d4570fcea117e3005c23' b'1.0-rc'
##### b'bdabdf77108a73fdf5112a30e17fca6ed66b9ebc 1.0' b'bdabdf77108a73fdf5112a30e17fca6ed66b9ebc' b'1.0'
##### b'67d273e73e67961939f5f477fa47a99e48b2cbe9 1.0.1' b'67d273e73e67961939f5f477fa47a99e48b2cbe9' b'1.0.1'
##### b'5a1ee169c65625d69a047ddd11689b069cdc8e8e 1.1' b'5a1ee169c65625d69a047ddd11689b069cdc8e8e' b'1.1'
##### b'bbc888b43fb16fae9ac07a495180293d39eeed84 1.2' b'bbc888b43fb16fae9ac07a495180293d39eeed84' b'1.2'
##### b'015c906bbea25fedb250164acc76fed6635254f0 1.3-beta' b'015c906bbea25fedb250164acc76fed6635254f0' b'1.3-beta'
##### b'<<<<<<< local' b'<<<<<<<' b'local'
ERROR:swh.scheduler.task.LoadArchiveMercurialTsk:Loading failure, updating to `partial` status
Traceback (most recent call last):
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 861, in load
    self.store_data()
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 974, in store_data
    self.send_batch_releases(self.get_releases())
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 660, in send_batch_releases
    send_in_packets(releases, self.send_releases, packet_size)
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 34, in send_in_packets
    for obj in objects:
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-mercurial/swh/loader/mercurial/bundle20_loader.py", line 423, in get_releases
    'target': hashutil.hash_to_bytes(node.decode()),
  File "/home/tony/work/inria/repo/swh/swh-environment/swh-model/swh/model/hashutil.py", line 261, in hash_to_bytes
    return bytes.fromhex(hash)
ValueError: non-hexadecimal number found in fromhex() arg at position 0
DEBUG:swh.scheduler.task.LoadArchiveMercurialTsk:Updating origin_visit for origin 1 with status partial
DEBUG:swh.scheduler.task.LoadArchiveMercurialTsk:Done updating origin_visit for origin 1 with status partial
{'status': 'failed'}
>>>

Event Timeline

ardumont created this task.Feb 16 2018, 3:43 PM
ardumont updated the task description. (Show Details)Feb 16 2018, 3:58 PM

I've chosen 3. to comply with the doc's suggestion.
As usual nothing is set in stone.

We can always change this later.

I agree with taking tags from both sides and discarding all lines that don't fit the pattern.