Page MenuHomeSoftware Heritage

Indexer - Retrieval error when contents is too big
Open, NormalPublic

Description

When the contents size is high, supposedly more than the 100*1024*1024 bytes limit (limit imposed on our loaders), the objstorage retrieval fails:

Oct 10 12:49:26 worker01.euwest.azure python3[15204]: [2017-10-10 12:49:26,600: INFO/Worker-1] sha1: b'\r5~\xe6\xb9\r\x86\nz\xb1\xa7S\x04\x03\xb3+\xbc\x97\x7f`'
Oct 10 12:51:06 worker01.euwest.azure python3[15204]: [2017-10-10 12:51:06,094: ERROR/Worker-1] Problem when reading contents metadata.
                                                      Traceback (most recent call last):
                                                        File "/usr/lib/python3/dist-packages/swh/indexer/indexer.py", line 216, in run
                                                          raw_content = self.objstorage.get(sha1)
                                                        File "/usr/lib/python3/dist-packages/swh/objstorage/multiplexer/multiplexer_objstorage.py", line 134, in get
                                                          return storage.get(obj_id)
                                                        File "/usr/lib/python3/dist-packages/swh/objstorage/multiplexer/filter/filter.py", line 69, in get
                                                          return self.storage.get(obj_id, *args, **kwargs)
                                                        File "/usr/lib/python3/dist-packages/swh/objstorage/multiplexer/multiplexer_objstorage.py", line 134, in get
                                                          return storage.get(obj_id)
                                                        File "/usr/lib/python3/dist-packages/swh/objstorage/multiplexer/filter/id_filter.py", line 56, in get
                                                          return self.storage.get(*args, obj_id=obj_id, **kwargs)
                                                        File "/usr/lib/python3/dist-packages/swh/objstorage/cloud/objstorage_azure.py", line 105, in get
                                                          return gzip.decompress(blob.content)
                                                        File "/usr/lib/python3.4/gzip.py", line 632, in decompress
                                                          return f.read()
                                                        File "/usr/lib/python3.4/gzip.py", line 360, in read
                                                          while self._read(readsize):
                                                        File "/usr/lib/python3.4/gzip.py", line 454, in _read
                                                          self._add_read_data( uncompress )
                                                        File "/usr/lib/python3.4/gzip.py", line 472, in _add_read_data
                                                          self.extrabuf = self.extrabuf[offset:] + data
                                                      MemoryError
Oct 10 12:51:06 worker01.euwest.azure python3[15204]: [2017-10-10 12:51:06,099: WARNING/Worker-1] Rescheduling batch

Here, the hash b'\r5~\xe6\xb9\r\x86\nz\xb1\xa7S\x04\x03\xb3+\xbc\x97\x7f`' is the one failing:

Converting its hash to be readable:

$ python3
>>> h = b'\r5~\xe6\xb9\r\x86\nz\xb1\xa7S\x04\x03\xb3+\xbc\x97\x7f`'
>>> from swh.model import hashutil
>>> hashutil.hash_to_hex(h)
'0d357ee6b90d860a7ab1a7530403b32bbc977f60'

Checking its length in the storage, we see that's indeed a quite huge file:

curl https://archive.softwareheritage.org/api/1/content/0d357ee6b90d860a7ab1a7530403b32bbc977f60/?fields=length
{"length":1707673600}

Event Timeline

ardumont created this task.Oct 10 2017, 3:04 PM

In the objstorage's pathslicing implementation, there is the get_stream implementation which is not used [1]

That might help.

I suppose it all depends on the current storage's configuration.
And the fact that method's interface may not be completely implemented everywhere (in all objstorage implementations i mean).

[1] https://forge.softwareheritage.org/source/swh-objstorage/browse/master/swh/objstorage/objstorage_pathslicing.py$333-340

I suppose it all depends on the current storage's configuration.

No, the objstorage's configuration is relevant here

And the fact that method's interface may not be completely implemented everywhere (in all objstorage implementations i mean).

Indeed.

but the current objstorage-azure does not implement this.

Thank olasd for reminding me this [1]

[1] https://forge.softwareheritage.org/T1447#26691