Page MenuHomeSoftware Heritage

Fix UnicodeDecodeError in revision metadata conversion
Closed, ResolvedPublic

Description

Trying to browse that url: https://archive.softwareheritage.org/browse/origin/https://www.mercurial-scm.org/repo/hg/ currently raises the following error:

Traceback (most recent call last):
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/web/browse/views/utils/snapshot_context.py", line 239, in browse_snapshot_directory
    browse_context='directory') # noqa
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/web/browse/views/utils/snapshot_context.py", line 135, in _process_snapshot_request
    origin_url, timestamp, visit_id)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/web/browse/utils.py", line 938, in get_snapshot_context
    snapshot_id)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/web/browse/utils.py", line 468, in get_origin_visit_snapshot
    return get_snapshot_content(visit_info['snapshot'])
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/web/browse/utils.py", line 425, in get_snapshot_content
    branches, releases = process_snapshot_branches(snapshot)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/web/browse/utils.py", line 356, in process_snapshot_branches
    for revision in revisions:
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/web/common/service.py", line 453, in 
    return (converters.from_revision(r) for r in revisions)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/web/common/converters.py", line 281, in from_revision
    dates={'date', 'committer_date'})
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/web/common/converters.py", line 149, in from_swh
    new_dict[key] = convert_fn(value)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/web/common/converters.py", line 242, in convert_revision_metadata
    return json.loads(json.dumps(metadata, cls=SWHMetadataEncoder))
  File "/usr/local/lib/python3.7/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/usr/local/lib/python3.7/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/local/lib/python3.7/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/web/common/converters.py", line 230, in default
    return obj.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x86 in position 4: invalid start byte

This needs to be fixed.

Event Timeline

anlambert triaged this task as Normal priority.May 21 2019, 1:59 PM
anlambert created this task.
anlambert added a comment.EditedMay 21 2019, 2:50 PM

The error comes from the decoding of the following revision metadata:

{'extra_headers': [['time_offset_seconds', b'-32400'], ['branch', b'stable'], ['transplant_source', b't>\x03\x1a\x86\xaa\xdfAS\xffM\x94N\xd7\x196nV\xc7\xb5']], 'node': '3fee7f7d2da04226914c2258cc2884dc27384fd7'}

The transplant_source entry stores a reference to a mercurial nodeid in binary [1].

I think it is up to the mercurial loader to convert that binary nodeid to hex format and encode it back to utf-8 before
generating a swh revision.

Nevertheless, I should better handle UnicodeDecodeError in convert_revision_metadata function.

[1] https://www.mercurial-scm.org/repo/hg/file/tip/hgext/convert/hg.py#l299