Page MenuHomeSoftware Heritage

race condition during concurrent loading of the same objects from multiple origins
Closed, MigratedEdits Locked

Description

Looking through kibana logs, we found the following error happening quite often (in the storage):

[2019-08-20 00:39:25,728: ERROR/ForkPoolWorker-88373] Loading failure, updating to `partial` status
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 896, in load
    self.store_data()
  File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 1003, in store_data
    self.send_batch_contents(self.get_contents())
  File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 649, in send_batch_contents
    packet_size_bytes=packet_size_bytes)
  File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 41, in send_in_packets
    sender(formatted_objects)
  File "/usr/lib/python3/dist-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/usr/lib/python3/dist-packages/retrying.py", line 206, in call
    return attempt.get(self._wrap_exception)
  File "/usr/lib/python3/dist-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/lib/python3/dist-packages/six.py", line 686, in reraise
    raise value
  File "/usr/lib/python3/dist-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 400, in send_contents
    result = self.storage.content_add(content_list)
  File "/usr/lib/python3/dist-packages/swh/storage/api/client.py", line 24, in content_add
    return self.post('content/add', {'content': content})
  File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 198, in post
    return self._decode_response(response)
  File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 230, in _decode_response
    raise pickle.loads(decode_response(response))
swh.storage.HashCollision: sha1

Occurrences:

root@uffizi:~# zgrep -c "swh.storage.HashCollision" /var/log/syslog.*
/var/log/syslog.1:168
/var/log/syslog.2.gz:152
/var/log/syslog.3.gz:136
/var/log/syslog.4.gz:127
/var/log/syslog.5.gz:168
/var/log/syslog.6.gz:112
/var/log/syslog.7.gz:137

Event Timeline

ardumont created this task.
ardumont created this object in space Restricted Space.
ardumont created this object with visibility "Developers (Project)".
olasd shifted this object from the Restricted Space space to the S1 Public space.Sep 30 2019, 1:32 PM
olasd changed the visibility from "Developers (Project)" to "Public (No Login Required)".

This is a race condition that happens when two different workers are loading the exact same content in parallel transactions.

I've added a diff with a minimal reproducer.

zack renamed this task from Investigate hash collision error to race condition during concurrent loading of the same objects from multiple origins.Oct 1 2019, 10:58 AM

tagged and deployed (loaders are mostly restarted or in progress)

ardumont claimed this task.

This can be closed now thanks to D2977.