Page MenuHomeSoftware Heritage

Indexer mimetype - Fix parsing error
Closed, MigratedEdits Locked

Description

Once in a while, the indexer mimetype raises an error when it tries to parse the output of file --mime error.

Oct 10 07:49:13 worker01.euwest.azure python3[48818]: Traceback (most recent call last):
Oct 10 07:49:13 worker01.euwest.azure python3[48818]:   File "/usr/lib/python3/dist-packages/swh/indexer/indexer.py", line 220, in run
Oct 10 07:49:13 worker01.euwest.azure python3[48818]:     res = self.index_content(sha1, raw_content)
Oct 10 07:49:13 worker01.euwest.azure python3[48818]:   File "/usr/lib/python3/dist-packages/swh/indexer/mimetype.py", line 97, in index_content
Oct 10 07:49:13 worker01.euwest.azure python3[48818]:     properties = compute_mimetype_encoding(raw_content, log=self.log)
Oct 10 07:49:13 worker01.euwest.azure python3[48818]:   File "/usr/lib/python3/dist-packages/swh/indexer/mimetype.py", line 33, in compute_mimetype_encoding
Oct 10 07:49:13 worker01.euwest.azure python3[48818]:     encoding = res[1].split(b'=')[1]
Oct 10 07:49:13 worker01.euwest.azure python3[48818]: IndexError: list index out of range
Oct 10 07:49:13 worker01.euwest.azure python3[49118]: [2017-10-10 07:49:13,252: ERROR/Worker-2] Problem when reading contents metadata.
                                                      Traceback (most recent call last):
                                                        File "/usr/lib/python3/dist-packages/swh/indexer/indexer.py", line 220, in run
                                                          res = self.index_content(sha1, raw_content)
                                                        File "/usr/lib/python3/dist-packages/swh/indexer/mimetype.py", line 97, in index_content
                                                          properties = compute_mimetype_encoding(raw_content)
                                                        File "/usr/lib/python3/dist-packages/swh/indexer/mimetype.py", line 33, in compute_mimetype_encoding
                                                          encoding = res[1].split(b'=')[1]
                                                      IndexError: list index out of range
Oct 10 07:49:13 worker01.euwest.azure python3[49118]: [2017-10-10 07:49:13,364: WARNING/Worker-2] Rescheduling batch

Event Timeline

Adding some log, this is the response from the output.

Oct 10 08:02:34 worker01.euwest.azure python3[52800]: [2017-10-10 08:02:34,061: INFO/Worker-1] res: [b'no read permission'], mimetype: b'no read permission', encoding: None

This indeed cannot be parsed as expected, thus the failure.

Now as to why this happened...

Forcing the encoding to None expectedly raises an integrity error which reveals the object's id.

Oct 10 08:07:21 worker01.euwest.azure python3[54698]: [2017-10-10 08:07:21,590: INFO/Worker-3] res: [b'no read permission'], mimetype: b'no read permission', encoding: None
Oct 10 08:07:33 worker01.euwest.azure python3[52793]: ERROR:swh.indexer:Problem when reading contents metadata.
Oct 10 08:07:33 worker01.euwest.azure python3[52793]: Traceback (most recent call last):
Oct 10 08:07:33 worker01.euwest.azure python3[52793]:   File "/usr/lib/python3/dist-packages/swh/indexer/indexer.py", line 224, in run
Oct 10 08:07:33 worker01.euwest.azure python3[52793]:     self.persist_index_computations(results, policy_update)
Oct 10 08:07:33 worker01.euwest.azure python3[52793]:   File "/usr/lib/python3/dist-packages/swh/indexer/mimetype.py", line 123, in persist_index_computations
Oct 10 08:07:33 worker01.euwest.azure python3[52793]:     results, conflict_update=(policy_update == 'update-dups'))
Oct 10 08:07:33 worker01.euwest.azure python3[52793]:   File "/usr/lib/python3/dist-packages/swh/storage/api/client.py", line 185, in content_mimetype_add
Oct 10 08:07:33 worker01.euwest.azure python3[52793]:     'conflict_update': conflict_update,
Oct 10 08:07:33 worker01.euwest.azure python3[52793]:   File "/usr/lib/python3/dist-packages/swh/core/api.py", line 62, in post
Oct 10 08:07:33 worker01.euwest.azure python3[52793]:     return self._decode_response(response)
Oct 10 08:07:33 worker01.euwest.azure python3[52793]:   File "/usr/lib/python3/dist-packages/swh/core/api.py", line 91, in _decode_response
Oct 10 08:07:33 worker01.euwest.azure python3[52793]:     raise pickle.loads(decode_response(response))
Oct 10 08:07:33 worker01.euwest.azure python3[52793]: psycopg2.IntegrityError: null value in column "encoding" violates not-null constraint
Oct 10 08:07:33 worker01.euwest.azure python3[52793]: DETAIL:  Failing row contains (\x299745b422fabc0dd504694fc8235ceed3e24da0, \x6e6f2072656164207065726d697373696f6e, null, 7).
Oct 10 08:07:33 worker01.euwest.azure python3[52793]: CONTEXT:  COPY tmp_content_mimetype, line 32: "\x299745b422fabc0dd504694fc8235ceed3e24da0,\x6e6f2072656164207065726d697373696f6e,,7"
Oct 10 08:07:33 worker01.euwest.azure python3[52793]: WARNING:swh.indexer:Rescheduling batch
Oct 10 08:07:33 worker01.euwest.azure python3[54698]: [2017-10-10 08:07:33,302: ERROR/Worker-3] Problem when reading contents metadata.
                                                      Traceback (most recent call last):
                                                        File "/usr/lib/python3/dist-packages/swh/indexer/indexer.py", line 224, in run
                                                          self.persist_index_computations(results, policy_update)
                                                        File "/usr/lib/python3/dist-packages/swh/indexer/mimetype.py", line 123, in persist_index_computations
                                                          results, conflict_update=(policy_update == 'update-dups'))
                                                        File "/usr/lib/python3/dist-packages/swh/storage/api/client.py", line 185, in content_mimetype_add
                                                          'conflict_update': conflict_update,
                                                        File "/usr/lib/python3/dist-packages/swh/core/api.py", line 62, in post
                                                          return self._decode_response(response)
                                                        File "/usr/lib/python3/dist-packages/swh/core/api.py", line 91, in _decode_response
                                                          raise pickle.loads(decode_response(response))
                                                      psycopg2.IntegrityError: null value in column "encoding" violates not-null constraint
                                                      DETAIL:  Failing row contains (\x299745b422fabc0dd504694fc8235ceed3e24da0, \x6e6f2072656164207065726d697373696f6e, null, 7).
                                                      CONTEXT:  COPY tmp_content_mimetype, line 32: "\x299745b422fabc0dd504694fc8235ceed3e24da0,\x6e6f2072656164207065726d697373696f6e,,7"

Ok, this is the empty file:

In [10]: from swh.objstorage import get_objstorage

In [11]: o = get_objstorage(**c['objstorage'])

In [17]: o.get('299745b422fabc0dd504694fc8235ceed3e24da0')
Out[17]: b''

In [19]: from swh.indexer import mimetype

In [20]: mimetype.compute_mimetype_encoding(r)
Out[20]: {'encoding': None, 'mimetype': b'no read permission'}

This is not reproduced on a stretch node (python3.5).

$ touch blah
$ stat blah
  File: blah
  Size: 0               Blocks: 0          IO Block: 4096   regular empty file
Device: fe02h/65026d    Inode: 22324741    Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/    tony)   Gid: ( 1000/    tony)
Access: 2017-10-10 13:40:55.982667021 +0200
Modify: 2017-10-10 13:40:55.982667021 +0200
Change: 2017-10-10 13:40:55.982667021 +0200
 Birth: -
$ python3
Python 3.5.3 (default, Jan 19 2017, 14:11:04)
[GCC 6.3.0 20170118] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> c = open('blah', 'rb').read()
>>> c
b''
>>> from swh.indexer import mimetype
>>> mimetype.compute_mimetype_encoding(c)
{'mimetype': b'application/x-empty', 'encoding': b'binary'}

whereas on oldstable (python3.4):

ardumont@worker01:~% touch blah
ardumont@worker01:~% stat blah
  File: ‘blah’
  Size: 0               Blocks: 0          IO Block: 4096   regular empty file
Device: 801h/2049d      Inode: 1842712     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1003/ardumont)   Gid: ( 1003/ardumont)
Access: 2017-10-10 11:38:56.524767468 +0000
Modify: 2017-10-10 11:38:49.988716250 +0000
Change: 2017-10-10 11:38:49.988716250 +0000
 Birth: -
ardumont@worker01:~% python3
Python 3.4.2 (default, Oct  8 2014, 10:45:20)
[GCC 4.9.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> c = open('blah', 'rb').read()
>>> from swh.indexer import mimetype
>>> mimetype.compute_mimetype_encoding(c)
{'mimetype': b'no read permission', 'encoding': None}
>>>