Page MenuHomeSoftware Heritage

mimetype indexer: edge case makes the indexer fail miserably
Closed, MigratedEdits Locked

Description

Some error occurs for some particular raw_content resulting in unclear error message.
We must understand this case and handle this properly.

Stacktrace:

Nov 29 08:00:42 worker01.euwest.azure python3[88934]: [2017-11-29 08:00:42,355: ERROR/Worker-5723] Problem when reading contents metadata.
                                                      Traceback (most recent call last):
                                                        File "/usr/lib/python3/dist-packages/swh/indexer/indexer.py", line 364, in run
                                                          res = self.index(sha1, raw_content)
                                                        File "/usr/lib/python3/dist-packages/swh/indexer/mimetype.py", line 90, in index
                                                          properties = compute_mimetype_encoding(data)
                                                        File "/usr/lib/python3/dist-packages/swh/indexer/mimetype.py", line 25, in compute_mimetype_encoding
                                                          r = magic.detect_from_content(raw_content)
                                                        File "/usr/lib/python3/dist-packages/magic.py", line 277, in detect_from_content
                                                          none_magic.buffer(byte_content))
                                                        File "/usr/lib/python3/dist-packages/magic.py", line 155, in buffer
                                                          return str(r, 'utf-8')
                                                      TypeError: coercing to str: need a bytes-like object, NoneType found

Event Timeline

ardumont renamed this task from mimetype indexer: when no result is returned, indexer fails miserably to mimetype indexer: edge case makes the indexer fails miserably.Nov 29 2017, 9:31 AM
ardumont updated the task description. (Show Details)
ardumont renamed this task from mimetype indexer: edge case makes the indexer fails miserably to mimetype indexer: edge case makes the indexer fail miserably.Nov 29 2017, 9:36 AM

Example Sha1 with that error is '099c7254742e2be54a86d03a3a1826a7b8e757d0':

$ python3
>>> from swh.objstorage import get_objstorage
>>> objstorage = get_objstorage(cls='remote', args={'url': 'http://uffizi.internal.sofwareheritage.org:5003'})
>>> h = '099c7254742e2be54a86d03a3a1826a7b8e757d0'
>>> r = objstorage.get(h)
>>> with open('blah', 'wb') as f: f.write(r)
...
1904315
>>> exit
$ file blah
blah: JPEG image data, Exif standard: [TIFF image data, little-endian, direntries=12, height=1208, bps=158, PhotometricIntepretation=RGB, orientation=upper-left, width=640]
$ python3
>>> with open('blah', 'rb') as f: raw_content = f.read()
...
>>> len(raw_content) > 0
True
>>> import magic
>>> magic.detect_from_content(raw_content)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/magic.py", line 277, in detect_from_content
    none_magic.buffer(byte_content))
  File "/usr/lib/python3/dist-packages/magic.py", line 155, in buffer
    return str(r, 'utf-8')
TypeError: coercing to str: need a bytes-like object, NoneType found

Sounds like a bug against python3-magic.

Following the breadcrumbs here to find the bug tracker is somewhat tedious.

apt-cache show python3-magic | grep Homepage
Homepage: http://www.darwinsys.com/file/
Homepage: http://www.darwinsys.com/file/

Homepage not responding...

So, taking a look through pypi - https://pypi.python.org/pypi/file-magic/0.3.0.
I found another homepage, https://github.com/file/file which is a read-only fork (issue tracker is not activated).

And in the readme, maintenance websites are down:

Mailing List: file@mx.gw.com  [currently down]
Mailing List archives: http://mx.gw.com/pipermail/file/  [currently down]
Bug tracker: http://bugs.gw.com/  [currently down]

Note: I no longer only trust what i read, i checked and they are not responding.

All maintenance seems down for that package and last upload date package is 2016-02-02 (almost 2 years ago).
Still hope though as the last commits (in the mirror) was 19 hours ago (today: Wed Nov 29 10:03:54 CET 2017).

Which means:

  1. I need to check against that latest version if the bug still happens
  2. Find some way to notify about that error and discuss with maintainer (either to propose a fix, or ask for a new release)

In the mean time, as my main concern is not the indexer, i'll work around this to avoid stopping entirely the indexers (as some batch can then be stuck in the rescheduling loop).

vlorentz lowered the priority of this task from Normal to Low.Jan 11 2019, 11:06 AM