Page MenuHomeSoftware Heritage

The indexer journal client is unstable
Closed, MigratedEdits Locked

Description

I often get this error in a docker environment:

swh-indexer-journal-client_1  | 2019-02-04 20:12:29,441 26 WARNING Heartbeat session expired, marking coordinator dead
swh-indexer-journal-client_1  | 2019-02-04 20:12:29,441 26 WARNING Marking the coordinator dead (node 1001) for group swh.journal.client: Heartbeat session expired.
swh-indexer-journal-client_1  | 2019-02-04 20:12:29,449 26 INFO Group coordinator for swh.journal.client is BrokerMetadata(nodeId=1001, host='kafka', port=9092, rack=None)
swh-indexer-journal-client_1  | 2019-02-04 20:12:29,449 26 INFO Discovered coordinator 1001 for group swh.journal.client
swh-indexer-journal-client_1  | 2019-02-04 20:12:29,453 26 INFO Group coordinator for swh.journal.client is BrokerMetadata(nodeId=1001, host='kafka', port=9092, rack=None)
swh-indexer-journal-client_1  | 2019-02-04 20:12:29,453 26 INFO Discovered coordinator 1001 for group swh.journal.client
swh-indexer-journal-client_1  | 2019-02-04 20:12:29,461 26 INFO Scheduling indexer_origin_metadata for visit of origin 766
swh-indexer-journal-client_1  | Traceback (most recent call last):
swh-indexer-journal-client_1  |   File "/usr/local/lib/python3.6/runpy.py", line 193, in _run_module_as_main
swh-indexer-journal-client_1  |     "__main__", mod_spec)
swh-indexer-journal-client_1  |   File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
swh-indexer-journal-client_1  |     exec(code, run_globals)
swh-indexer-journal-client_1  |   File "/src/swh-indexer/swh/indexer/journal_client.py", line 88, in <module>
swh-indexer-journal-client_1  |     main()
swh-indexer-journal-client_1  |   File "/usr/local/lib/python3.6/site-packages/click/core.py", line 764, in __call__
swh-indexer-journal-client_1  |     return self.main(*args, **kwargs)
swh-indexer-journal-client_1  |   File "/usr/local/lib/python3.6/site-packages/click/core.py", line 717, in main
swh-indexer-journal-client_1  |     rv = self.invoke(ctx)
swh-indexer-journal-client_1  |   File "/usr/local/lib/python3.6/site-packages/click/core.py", line 956, in invoke
swh-indexer-journal-client_1  |     return ctx.invoke(self.callback, **ctx.params)
swh-indexer-journal-client_1  |   File "/usr/local/lib/python3.6/site-packages/click/core.py", line 555, in invoke
swh-indexer-journal-client_1  |     return callback(*args, **kwargs)
swh-indexer-journal-client_1  |   File "/src/swh-indexer/swh/indexer/journal_client.py", line 86, in main
swh-indexer-journal-client_1  |     IndexerJournalClient().process()
swh-indexer-journal-client_1  |   File "/usr/local/lib/python3.6/site-packages/swh/journal/client.py", line 121, in process
swh-indexer-journal-client_1  |     self.consumer.commit()
swh-indexer-journal-client_1  |   File "/usr/local/lib/python3.6/site-packages/kafka/consumer/group.py", line 515, in commit
swh-indexer-journal-client_1  |     self._coordinator.commit_offsets_sync(offsets)
swh-indexer-journal-client_1  |   File "/usr/local/lib/python3.6/site-packages/kafka/coordinator/consumer.py", line 513, in commit_offsets_sync
swh-indexer-journal-client_1  |     raise future.exception # pylint: disable-msg=raising-bad-type
swh-indexer-journal-client_1  | kafka.errors.CommitFailedError: CommitFailedError: Commit cannot be completed since the group has already
swh-indexer-journal-client_1  |             rebalanced and assigned the partitions to another member.
swh-indexer-journal-client_1  |             This means that the time between subsequent calls to poll()
swh-indexer-journal-client_1  |             was longer than the configured max_poll_interval_ms, which
swh-indexer-journal-client_1  |             typically implies that the poll loop is spending too much
swh-indexer-journal-client_1  |             time message processing. You can address this either by
swh-indexer-journal-client_1  |             increasing the rebalance timeout with max_poll_interval_ms,
swh-indexer-journal-client_1  |             or by reducing the maximum size of batches returned in poll()
swh-indexer-journal-client_1  |             with max_poll_records.
swh-indexer-journal-client_1  |             
swh-docker-dev_swh-indexer-journal-client_1 exited with code 1

Event Timeline

douardda triaged this task as High priority.Feb 5 2019, 9:43 AM
douardda created this task.

What's the python3-kafka version?

As a heads up, so far, production is fine. It's running:

ii  python3-kafka                             1.3.3-1~swh+1~bpo9+1      all                       Pure Python client for Apache Kafka - Python 3.x

Does it still happen? The journal client changed a lot since this task was open, including switching backend library.

vlorentz claimed this task.