There exists [[ https://forge.softwareheritage.org/T965#17949 | dumps with corruption ]] in their .hgtags file.
That file is important for the loader mercurial to create the release.
The question is, what to do when this kind of error arise?
Do we:
- trap that kind of exception and trigger a corruption state (that won't be the last time we see a corruption, cf. T955)
- try and interpret the diff (it's a diff format as [[ https://www.mercurial-scm.org/wiki/Tag | manual edition is possible ]]) -> no
- ignore the error and continue parsing the remaining tags (we already injected information on that repository at that time...)
-> the [[ https://www.mercurial-scm.org/wiki/Tag#My_tags_had_a_conflict_when_I_was_merging.__Why.3F__How_should_I_merge_them.3F | wiki ]] states 'In case of a merge conflict on your tags, the safest option is to take both sides. ', so that sounds like the more reasonable approach...
sample archive: /srv/storage/space/mirrors/code.google.com/sources/v2/code.google.com/j/johnbrugge-pipeline-modules/johnbrugge-pipeline-modules-source-archive.zip
Corrupted file output:
```
d912bc55aaf5ab6bd0c0df1f5992d96715f7a8ab alpha-preview
91b816cabbf0e74ee57868ad0b261ed6d92a0ce9 1.0-beta1
bb5421567f337262e4dde8ccb2798ba28589b859 1.0-beta2
8546862ccc64f558d515383eac9892b6e8ff7cd0 1.0-beta3
aa2bf052410260b28119d4570fcea117e3005c23 1.0-rc
bdabdf77108a73fdf5112a30e17fca6ed66b9ebc 1.0
67d273e73e67961939f5f477fa47a99e48b2cbe9 1.0.1
5a1ee169c65625d69a047ddd11689b069cdc8e8e 1.1
bbc888b43fb16fae9ac07a495180293d39eeed84 1.2
015c906bbea25fedb250164acc76fed6635254f0 1.3-beta
<<<<<<< local
bc59558bb7053b68c57a16bcfde98fe1e6d403a1 bks-epub3-beta
4571d156ffd8246314e51269fc02ca60ded15b49 bks-epub3-beta2
bfc11100e4a48073cb2571b42a8a415daa8da163 bks-epub3-1.3
=======
ca2d0d49533e7b3a6309f62417fd042e060fbfa0 1.3
>>>>>>> other
99bca97d1bcc9b2c243d2d0c555519fe1f6f2f12 bks-epub3-1.3.1
```
Reproduction:
```
archive_name = 'johnbrugge-pipeline-modules-source-archive.zip' # not hex found
rootpath = '/home/storage/hg/repo'
origin_url = 'https://%s/googlecode/hg' % archive_name
import logging
logging.basicConfig(level=logging.DEBUG)
from swh.loader.mercurial.tasks import LoadArchiveMercurialTsk
archive_path = '%s/%s' % (rootpath, archive_name)
t = LoadArchiveMercurialTsk()
t.run(origin_url=origin_url, archive_path=archive_path, visit_date='2016-05-03T15:16:32+00:00')
```
Output:
```
python3
Python 3.6.4 (default, Jan 5 2018, 02:13:53)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> archive_name = 'johnbrugge-pipeline-modules-source-archive.zip' # not hex found
>>> rootpath = '/home/storage/hg/repo'
>>> origin_url = 'https://%s/googlecode/hg' % archive_name
>>>
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>>
>>> from swh.loader.mercurial.tasks import LoadArchiveMercurialTsk
>>>
>>> archive_path = '%s/%s' % (rootpath, archive_name)
>>> t = LoadArchiveMercurialTsk()
>>> t.run(origin_url=origin_url, archive_path=archive_path, visit_date='2016-05-03T15:16:32+00:00')
patool: Extracting /home/storage/hg/repo/johnbrugge-pipeline-modules-source-archive.zip ...
patool: running /usr/bin/7z x -y -o/tmp/swh.loader.mercurial.wsy_4dcj -- /home/storage/hg/repo/johnbrugge-pipeline-modules-source-archive.zip
patool: ... /home/storage/hg/repo/johnbrugge-pipeline-modules-source-archive.zip extracted to `/tmp/swh.loader.mercurial.wsy_4dcj'.
INFO:swh.scheduler.task.LoadArchiveMercurialTsk:From https://johnbrugge-pipeline-modules-source-archive.zip/googlecode/hg - Uncompressing archive johnbrugge-pipeline-modules-source-archive.zip at /tmp/swh.loader.mercurial.wsy_4dcj/johnbrugge-pipeline-modules-source-archive
DEBUG:swh.scheduler.task.LoadArchiveMercurialTsk:Bundling at /tmp/swh.loader.mercurial.wsy_4dcj/johnbrugge-pipeline-modules/HG20_none_bundle
DEBUG:swh.scheduler.task.LoadArchiveMercurialTsk:Creating hg origin for https://johnbrugge-pipeline-modules-source-archive.zip/googlecode/hg
DEBUG:swh.scheduler.task.LoadArchiveMercurialTsk:Done creating hg origin for https://johnbrugge-pipeline-modules-source-archive.zip/googlecode/hg
DEBUG:swh.scheduler.task.LoadArchiveMercurialTsk:Creating origin_visit for origin 1 at time 2016-05-03 15:16:32+00:00
DEBUG:swh.scheduler.task.LoadArchiveMercurialTsk:Done Creating origin_visit for origin 1 at time 2016-05-03 15:16:32+00:00
##### b'd912bc55aaf5ab6bd0c0df1f5992d96715f7a8ab alpha-preview' b'd912bc55aaf5ab6bd0c0df1f5992d96715f7a8ab' b'alpha-preview'
##### b'91b816cabbf0e74ee57868ad0b261ed6d92a0ce9 1.0-beta1' b'91b816cabbf0e74ee57868ad0b261ed6d92a0ce9' b'1.0-beta1'
##### b'bb5421567f337262e4dde8ccb2798ba28589b859 1.0-beta2' b'bb5421567f337262e4dde8ccb2798ba28589b859' b'1.0-beta2'
##### b'8546862ccc64f558d515383eac9892b6e8ff7cd0 1.0-beta3' b'8546862ccc64f558d515383eac9892b6e8ff7cd0' b'1.0-beta3'
##### b'aa2bf052410260b28119d4570fcea117e3005c23 1.0-rc' b'aa2bf052410260b28119d4570fcea117e3005c23' b'1.0-rc'
##### b'bdabdf77108a73fdf5112a30e17fca6ed66b9ebc 1.0' b'bdabdf77108a73fdf5112a30e17fca6ed66b9ebc' b'1.0'
##### b'67d273e73e67961939f5f477fa47a99e48b2cbe9 1.0.1' b'67d273e73e67961939f5f477fa47a99e48b2cbe9' b'1.0.1'
##### b'5a1ee169c65625d69a047ddd11689b069cdc8e8e 1.1' b'5a1ee169c65625d69a047ddd11689b069cdc8e8e' b'1.1'
##### b'bbc888b43fb16fae9ac07a495180293d39eeed84 1.2' b'bbc888b43fb16fae9ac07a495180293d39eeed84' b'1.2'
##### b'015c906bbea25fedb250164acc76fed6635254f0 1.3-beta' b'015c906bbea25fedb250164acc76fed6635254f0' b'1.3-beta'
##### b'<<<<<<< local' b'<<<<<<<' b'local'
ERROR:swh.scheduler.task.LoadArchiveMercurialTsk:Loading failure, updating to `partial` status
Traceback (most recent call last):
File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 861, in load
self.store_data()
File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 974, in store_data
self.send_batch_releases(self.get_releases())
File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 660, in send_batch_releases
send_in_packets(releases, self.send_releases, packet_size)
File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 34, in send_in_packets
for obj in objects:
File "/home/tony/work/inria/repo/swh/swh-environment/swh-loader-mercurial/swh/loader/mercurial/bundle20_loader.py", line 423, in get_releases
'target': hashutil.hash_to_bytes(node.decode()),
File "/home/tony/work/inria/repo/swh/swh-environment/swh-model/swh/model/hashutil.py", line 261, in hash_to_bytes
return bytes.fromhex(hash)
ValueError: non-hexadecimal number found in fromhex() arg at position 0
DEBUG:swh.scheduler.task.LoadArchiveMercurialTsk:Updating origin_visit for origin 1 with status partial
DEBUG:swh.scheduler.task.LoadArchiveMercurialTsk:Done updating origin_visit for origin 1 with status partial
{'status': 'failed'}
>>>
```