Page MenuHomeSoftware Heritage

Indexer - Retrieval error when contents is too big
Open, NormalPublic

Description

When the contents size is high, supposedly more than the 100*1024*1024 bytes limit (limit imposed on our loaders), the objstorage retrieval fails:

Oct 10 12:49:26 worker01.euwest.azure python3[15204]: [2017-10-10 12:49:26,600: INFO/Worker-1] sha1: b'\r5~\xe6\xb9\r\x86\nz\xb1\xa7S\x04\x03\xb3+\xbc\x97\x7f`'
Oct 10 12:51:06 worker01.euwest.azure python3[15204]: [2017-10-10 12:51:06,094: ERROR/Worker-1] Problem when reading contents metadata.
                                                      Traceback (most recent call last):
                                                        File "/usr/lib/python3/dist-packages/swh/indexer/indexer.py", line 216, in run
                                                          raw_content = self.objstorage.get(sha1)
                                                        File "/usr/lib/python3/dist-packages/swh/objstorage/multiplexer/multiplexer_objstorage.py", line 134, in get
                                                          return storage.get(obj_id)
                                                        File "/usr/lib/python3/dist-packages/swh/objstorage/multiplexer/filter/filter.py", line 69, in get
                                                          return self.storage.get(obj_id, *args, **kwargs)
                                                        File "/usr/lib/python3/dist-packages/swh/objstorage/multiplexer/multiplexer_objstorage.py", line 134, in get
                                                          return storage.get(obj_id)
                                                        File "/usr/lib/python3/dist-packages/swh/objstorage/multiplexer/filter/id_filter.py", line 56, in get
                                                          return self.storage.get(*args, obj_id=obj_id, **kwargs)
                                                        File "/usr/lib/python3/dist-packages/swh/objstorage/cloud/objstorage_azure.py", line 105, in get
                                                          return gzip.decompress(blob.content)
                                                        File "/usr/lib/python3.4/gzip.py", line 632, in decompress
                                                          return f.read()
                                                        File "/usr/lib/python3.4/gzip.py", line 360, in read
                                                          while self._read(readsize):
                                                        File "/usr/lib/python3.4/gzip.py", line 454, in _read
                                                          self._add_read_data( uncompress )
                                                        File "/usr/lib/python3.4/gzip.py", line 472, in _add_read_data
                                                          self.extrabuf = self.extrabuf[offset:] + data
                                                      MemoryError
Oct 10 12:51:06 worker01.euwest.azure python3[15204]: [2017-10-10 12:51:06,099: WARNING/Worker-1] Rescheduling batch

Here, the hash b'\r5~\xe6\xb9\r\x86\nz\xb1\xa7S\x04\x03\xb3+\xbc\x97\x7f`' is the one failing:

Converting its hash to be readable:

$ python3
>>> h = b'\r5~\xe6\xb9\r\x86\nz\xb1\xa7S\x04\x03\xb3+\xbc\x97\x7f`'
>>> from swh.model import hashutil
>>> hashutil.hash_to_hex(h)
'0d357ee6b90d860a7ab1a7530403b32bbc977f60'

Checking its length in the storage, we see that's indeed a quite huge file:

curl https://archive.softwareheritage.org/api/1/content/0d357ee6b90d860a7ab1a7530403b32bbc977f60/?fields=length
{"length":1707673600}

Event Timeline

ardumont created this task.Oct 10 2017, 3:04 PM