When the contents size is high, supposedly more than the 100*1024*1024 bytes limit (limit imposed on our loaders), the objstorage retrieval fails:
Oct 10 12:49:26 worker01.euwest.azure python3[15204]: [2017-10-10 12:49:26,600: INFO/Worker-1] sha1: b'\r5~\xe6\xb9\r\x86\nz\xb1\xa7S\x04\x03\xb3+\xbc\x97\x7f`' Oct 10 12:51:06 worker01.euwest.azure python3[15204]: [2017-10-10 12:51:06,094: ERROR/Worker-1] Problem when reading contents metadata. Traceback (most recent call last): File "/usr/lib/python3/dist-packages/swh/indexer/indexer.py", line 216, in run raw_content = self.objstorage.get(sha1) File "/usr/lib/python3/dist-packages/swh/objstorage/multiplexer/multiplexer_objstorage.py", line 134, in get return storage.get(obj_id) File "/usr/lib/python3/dist-packages/swh/objstorage/multiplexer/filter/filter.py", line 69, in get return self.storage.get(obj_id, *args, **kwargs) File "/usr/lib/python3/dist-packages/swh/objstorage/multiplexer/multiplexer_objstorage.py", line 134, in get return storage.get(obj_id) File "/usr/lib/python3/dist-packages/swh/objstorage/multiplexer/filter/id_filter.py", line 56, in get return self.storage.get(*args, obj_id=obj_id, **kwargs) File "/usr/lib/python3/dist-packages/swh/objstorage/cloud/objstorage_azure.py", line 105, in get return gzip.decompress(blob.content) File "/usr/lib/python3.4/gzip.py", line 632, in decompress return f.read() File "/usr/lib/python3.4/gzip.py", line 360, in read while self._read(readsize): File "/usr/lib/python3.4/gzip.py", line 454, in _read self._add_read_data( uncompress ) File "/usr/lib/python3.4/gzip.py", line 472, in _add_read_data self.extrabuf = self.extrabuf[offset:] + data MemoryError Oct 10 12:51:06 worker01.euwest.azure python3[15204]: [2017-10-10 12:51:06,099: WARNING/Worker-1] Rescheduling batch
Here, the hash b'\r5~\xe6\xb9\r\x86\nz\xb1\xa7S\x04\x03\xb3+\xbc\x97\x7f`' is the one failing:
Converting its hash to be readable:
$ python3 >>> h = b'\r5~\xe6\xb9\r\x86\nz\xb1\xa7S\x04\x03\xb3+\xbc\x97\x7f`' >>> from swh.model import hashutil >>> hashutil.hash_to_hex(h) '0d357ee6b90d860a7ab1a7530403b32bbc977f60'
Checking its length in the storage, we see that's indeed a quite huge file:
curl https://archive.softwareheritage.org/api/1/content/0d357ee6b90d860a7ab1a7530403b32bbc977f60/?fields=length {"length":1707673600}