Page MenuHomeSoftware Heritage

Unbreak journal clients
Closed, MigratedEdits Locked

Description

Since migration to confluent-kafka-python, there is a recurring issue of journal clients being stuck for no apparent reason. Debug logs don't seem to show anything relevan.

My observations so far:

  • a consumer group running confluent-kafka-python may block other consumer groups that don't use it (eg. if they use python-kafka)
  • when the number of consumers is low (<10), having less topics or more consumers seems to unstuck the consumers. (unsure if that's also true with higher number of consumers)
  • most days, switching the same consumer between confluent-kafka-python and python-kafka results very clearly in "confluent-kafka-python = broken, python-kafka = works", but some days both works, and some rare days neither do

Consumer logs are full of errors like this, regardless of whether they are stuck or not:

Oct 10 12:05:03 desktop5 replayer-12301[24534]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode2.internal.softwareheritage.org:9092/12]: esnode2.internal.softwareheritage.org:9092/12: Timed out FetchRequest in flight (after 61043ms, timeout #0)
Oct 10 12:05:03 desktop5 replayer-12301[24534]: WARNING:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode2.internal.softwareheritage.org:9092/12]: esnode2.internal.softwareheritage.org:9092/12: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
Oct 10 12:05:05 desktop5 replayer-12302[24568]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode2.internal.softwareheritage.org:9092/12]: esnode2.internal.softwareheritage.org:9092/12: Timed out FetchRequest in flight (after 61053ms, timeout #0)
Oct 10 12:05:05 desktop5 replayer-12302[24568]: WARNING:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode2.internal.softwareheritage.org:9092/12]: esnode2.internal.softwareheritage.org:9092/12: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
Oct 10 12:05:14 desktop5 replayer-12301[24534]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out FetchRequest in flight (after 61039ms, timeout #0)
Oct 10 12:05:14 desktop5 replayer-12301[24534]: WARNING:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
Oct 10 12:05:15 desktop5 replayer-12302[24568]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out FetchRequest in flight (after 61044ms, timeout #0)
Oct 10 12:05:15 desktop5 replayer-12302[24568]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out MetadataRequest in flight (after 60902ms, timeout #1)
Oct 10 12:05:15 desktop5 replayer-12302[24568]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out MetadataRequest in flight (after 60902ms, timeout #2)
Oct 10 12:05:15 desktop5 replayer-12302[24568]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out MetadataRequest in flight (after 60902ms, timeout #3)
Oct 10 12:05:15 desktop5 replayer-12302[24568]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out MetadataRequest in flight (after 60902ms, timeout #4)
Oct 10 12:05:15 desktop5 replayer-12302[24568]: WARNING:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out 8 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
Oct 10 12:05:20 desktop5 replayer-12301[24534]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode3.internal.softwareheritage.org:9092/13]: esnode3.internal.softwareheritage.org:9092/13: Timed out FetchRequest in flight (after 61049ms, timeout #0)
Oct 10 12:05:20 desktop5 replayer-12301[24534]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode3.internal.softwareheritage.org:9092/13]: esnode3.internal.softwareheritage.org:9092/13: Timed out MetadataRequest in flight (after 60956ms, timeout #1)
Oct 10 12:05:20 desktop5 replayer-12301[24534]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode3.internal.softwareheritage.org:9092/13]: esnode3.internal.softwareheritage.org:9092/13: Timed out MetadataRequest in flight (after 60956ms, timeout #2)
Oct 10 12:05:20 desktop5 replayer-12301[24534]: WARNING:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode3.internal.softwareheritage.org:9092/13]: esnode3.internal.softwareheritage.org:9092/13: Timed out 3 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
Oct 10 12:05:24 desktop5 replayer-12302[24568]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode3.internal.softwareheritage.org:9092/13]: esnode3.internal.softwareheritage.org:9092/13: Timed out FetchRequest in flight (after 61045ms, timeout #0)
Oct 10 12:05:24 desktop5 replayer-12302[24568]: WARNING:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode3.internal.softwareheritage.org:9092/13]: esnode3.internal.softwareheritage.org:9092/13: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests

Though it seems like there are less errors about esnode1 when they are not stuck.

Event Timeline

vlorentz created this task.
vlorentz edited projects, added Journal; removed Storage manager.

Nope, still not completely fixed

So I'm experiencing some of this on uffizi, where I'm running the s3 content copier as a group of 64 coordinated journal clients.

My latest experiment is running 63 of these with default verbosity, and one at debug verbosity.

Of course the instance running at DEBUG verbosity has not hung yet. Sometimes it spends around a minute waiting for a rebalance, but so far it has always succeeded in doing so and resumed processing objects.

When issuing a SIGTERM on the hung instances, the following log messages are emitted:

Nov 15 12:58:46 uffizi content-replayer-s3-50[2093298]: WARNING:swh.journal.client.rdkafka:MAXPOLL [rdkafka#consumer-1] [thrd:main]: Application maximum poll interval (300000ms) exceeded by 166ms (adjust max.poll.interval.ms for long-running message processing): leaving group
Nov 15 12:58:46 uffizi content-replayer-s3-50[2093298]: INFO:swh.journal.client:Received non-fatal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="esnode1.internal.softwareheritage.org:9092/11: Disconnected (after 600003ms in state UP)"}

I suppose that these messages have been queued since the process started hanging, and were never actually processed by rdkafka until the termination ran consumer.close() forcefully.

I'm now looking into options to remotely attach pdb on a running process, so I can get a trace from one of the stuck processes...

I've bodged https://github.com/ionelmc/python-remote-pdb into the process, which will help understand what's up.

My suspicion is that we indeed spend too much time doing stuff between calls to poll() / consume(), and that this puts us in a state where the rdkafka background thread unsubscribes us within a consume() call, never allowing us to resume.

We'll see when I get a pdb attached to a hung process.

Looks like attaching a stuck process with remote_pdb doesn't work the first time, then unsticks it when launching the second time. Oops.

I guess the next debugging step is GDB.

I've installed gdb, python3.7-dbg, the debug symbols for librdkafka as well as for libssl1.1 on uffizi.

After looking at various GDB traces for a while, looks like the main thread of the content copier processes just ends up getting stuck waiting for S3 to answer a PUT request.

When that happens, the background librdkafka thread notices that the main thread hasn't called poll for a while, and gracefully exits the consumer group.

Attaching the process with gdb and closing the socket forcefully lets everything resume without an issue: on the next call to poll()/consume(), librdkafka emits its queued log messages, re-joins the consumer group, and processing resumes.

Looks like the particular hanging read() operation isn't affected by the timeout value set in the libcloud instantiation, which smells like a bug.

Now I'm curious to check one of @vlorentz's hanging processes to see whether the underlying issue is similar or not.

What is the status of this issue? Do we still face this bug?

Clients I run myself are no longer affected, I'm guessing it's thanks to one of the diffs linked from this task. But AFAIK, olasd still sees some consumers with this issue.

I don't think I've actually seen this specific symptom in prod again, and if so only on workers that were hung up on something else already. We can reopen it if we notice it again.