Page MenuHomeSoftware Heritage

cassandra - Timeouts during revision import
Closed, ResolvedPublic

Description

It seems the revision replayer is easily prone to fail due to a timeout:

Jun 18 00:47:00 parasilo-25 swh[10598]: Traceback (most recent call last):
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/bin/swh", line 11, in <module>
Jun 18 00:47:00 parasilo-25 swh[10598]:     load_entry_point('swh.core==0.14.3', 'console_scripts', 'swh')()
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main
Jun 18 00:47:00 parasilo-25 swh[10598]:     return swh(auto_envvar_prefix="SWH")
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
Jun 18 00:47:00 parasilo-25 swh[10598]:     return self.main(*args, **kwargs)
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
Jun 18 00:47:00 parasilo-25 swh[10598]:     rv = self.invoke(ctx)
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
Jun 18 00:47:00 parasilo-25 swh[10598]:     return _process_result(sub_ctx.command.invoke(sub_ctx))
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
Jun 18 00:47:00 parasilo-25 swh[10598]:     return _process_result(sub_ctx.command.invoke(sub_ctx))
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
Jun 18 00:47:00 parasilo-25 swh[10598]:     return ctx.invoke(self.callback, **ctx.params)
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
Jun 18 00:47:00 parasilo-25 swh[10598]:     return callback(*args, **kwargs)
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
Jun 18 00:47:00 parasilo-25 swh[10598]:     return f(get_current_context(), *args, **kwargs)
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3/dist-packages/swh/storage/cli.py", line 194, in replay
Jun 18 00:47:00 parasilo-25 swh[10598]:     client.process(worker_fn)
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3/dist-packages/swh/journal/client.py", line 265, in process
Jun 18 00:47:00 parasilo-25 swh[10598]:     batch_processed, at_eof = self.handle_messages(messages, worker_fn)
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3/dist-packages/swh/journal/client.py", line 292, in handle_messages
Jun 18 00:47:00 parasilo-25 swh[10598]:     worker_fn(dict(objects))
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3/dist-packages/swh/storage/replay.py", line 62, in process_replay_objects
Jun 18 00:47:00 parasilo-25 swh[10598]:     _insert_objects(object_type, objects, storage)
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3/dist-packages/swh/storage/replay.py", line 163, in _insert_objects
Jun 18 00:47:00 parasilo-25 swh[10598]:     for o in objects
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/storage.py", line 563, in revision_add
Jun 18 00:47:00 parasilo-25 swh[10598]:     missing = self.revision_missing([rev.id for rev in to_add])
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/storage.py", line 587, in revision_missing
Jun 18 00:47:00 parasilo-25 swh[10598]:     return self._cql_runner.revision_missing(revisions)
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/cql.py", line 133, in newf
Jun 18 00:47:00 parasilo-25 swh[10598]:     self, *args, **kwargs, statement=self._prepared_statements[f.__name__]
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/cql.py", line 526, in revision_missing
Jun 18 00:47:00 parasilo-25 swh[10598]:     return self._missing(statement, ids)
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/cql.py", line 289, in _missing
Jun 18 00:47:00 parasilo-25 swh[10598]:     rows = self._execute_with_retries(statement, [ids])
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 329, in wrapped_f
Jun 18 00:47:00 parasilo-25 swh[10598]:     return self.call(f, *args, **kw)
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 409, in call
Jun 18 00:47:00 parasilo-25 swh[10598]:     do = self.iter(retry_state=retry_state)
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 356, in iter
Jun 18 00:47:00 parasilo-25 swh[10598]:     return fut.result()
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 425, in result
Jun 18 00:47:00 parasilo-25 swh[10598]:     return self.__get_result()
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
Jun 18 00:47:00 parasilo-25 swh[10598]:     raise self._exception
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 412, in call
Jun 18 00:47:00 parasilo-25 swh[10598]:     result = fn(*args, **kwargs)
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "/usr/lib/python3/dist-packages/swh/storage/cassandra/cql.py", line 256, in _execute_with_retries
Jun 18 00:47:00 parasilo-25 swh[10598]:     return self._session.execute(statement, args, timeout=1000.0)
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "cassandra/cluster.py", line 2345, in cassandra.cluster.Session.execute
Jun 18 00:47:00 parasilo-25 swh[10598]:   File "cassandra/cluster.py", line 4304, in cassandra.cluster.ResponseFuture.result
Jun 18 00:47:00 parasilo-25 swh[10598]: cassandra.ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'consistency': 'LOCAL_ONE', 'required_responses': 1, 'received_responses': 0}

Event Timeline

vsellier triaged this task as Normal priority.Jun 18 2021, 4:57 PM
vsellier created this task.
vsellier updated the task description. (Show Details)

You can try changing the constants in _execute_with_retries.

But if a single select takes over 1s three times in a row, it means either the Cassandra cluster is too slow, or the replayer batches too many revisions together

vsellier moved this task from Backlog to done on the System administration board.

The problem was solved with an increase of the default timeout in the cassandra configuration and by reducing the journal_client batch size.