Maniphest T2034

Unbreak journal clients
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vlorentz
	Oct 10 2019, 12:06 PM

Description

Since migration to confluent-kafka-python, there is a recurring issue of journal clients being stuck for no apparent reason. Debug logs don't seem to show anything relevan.

My observations so far:

a consumer group running confluent-kafka-python may block other consumer groups that don't use it (eg. if they use python-kafka)
when the number of consumers is low (<10), having less topics or more consumers seems to unstuck the consumers. (unsure if that's also true with higher number of consumers)
most days, switching the same consumer between confluent-kafka-python and python-kafka results very clearly in "confluent-kafka-python = broken, python-kafka = works", but some days both works, and some rare days neither do

Consumer logs are full of errors like this, regardless of whether they are stuck or not:

Oct 10 12:05:03 desktop5 replayer-12301[24534]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode2.internal.softwareheritage.org:9092/12]: esnode2.internal.softwareheritage.org:9092/12: Timed out FetchRequest in flight (after 61043ms, timeout #0)
Oct 10 12:05:03 desktop5 replayer-12301[24534]: WARNING:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode2.internal.softwareheritage.org:9092/12]: esnode2.internal.softwareheritage.org:9092/12: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
Oct 10 12:05:05 desktop5 replayer-12302[24568]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode2.internal.softwareheritage.org:9092/12]: esnode2.internal.softwareheritage.org:9092/12: Timed out FetchRequest in flight (after 61053ms, timeout #0)
Oct 10 12:05:05 desktop5 replayer-12302[24568]: WARNING:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode2.internal.softwareheritage.org:9092/12]: esnode2.internal.softwareheritage.org:9092/12: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
Oct 10 12:05:14 desktop5 replayer-12301[24534]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out FetchRequest in flight (after 61039ms, timeout #0)
Oct 10 12:05:14 desktop5 replayer-12301[24534]: WARNING:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
Oct 10 12:05:15 desktop5 replayer-12302[24568]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out FetchRequest in flight (after 61044ms, timeout #0)
Oct 10 12:05:15 desktop5 replayer-12302[24568]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out MetadataRequest in flight (after 60902ms, timeout #1)
Oct 10 12:05:15 desktop5 replayer-12302[24568]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out MetadataRequest in flight (after 60902ms, timeout #2)
Oct 10 12:05:15 desktop5 replayer-12302[24568]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out MetadataRequest in flight (after 60902ms, timeout #3)
Oct 10 12:05:15 desktop5 replayer-12302[24568]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out MetadataRequest in flight (after 60902ms, timeout #4)
Oct 10 12:05:15 desktop5 replayer-12302[24568]: WARNING:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode1.internal.softwareheritage.org:9092/bootstrap]: esnode1.internal.softwareheritage.org:9092/11: Timed out 8 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
Oct 10 12:05:20 desktop5 replayer-12301[24534]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode3.internal.softwareheritage.org:9092/13]: esnode3.internal.softwareheritage.org:9092/13: Timed out FetchRequest in flight (after 61049ms, timeout #0)
Oct 10 12:05:20 desktop5 replayer-12301[24534]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode3.internal.softwareheritage.org:9092/13]: esnode3.internal.softwareheritage.org:9092/13: Timed out MetadataRequest in flight (after 60956ms, timeout #1)
Oct 10 12:05:20 desktop5 replayer-12301[24534]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode3.internal.softwareheritage.org:9092/13]: esnode3.internal.softwareheritage.org:9092/13: Timed out MetadataRequest in flight (after 60956ms, timeout #2)
Oct 10 12:05:20 desktop5 replayer-12301[24534]: WARNING:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode3.internal.softwareheritage.org:9092/13]: esnode3.internal.softwareheritage.org:9092/13: Timed out 3 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
Oct 10 12:05:24 desktop5 replayer-12302[24568]: INFO:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode3.internal.softwareheritage.org:9092/13]: esnode3.internal.softwareheritage.org:9092/13: Timed out FetchRequest in flight (after 61045ms, timeout #0)
Oct 10 12:05:24 desktop5 replayer-12302[24568]: WARNING:swh.journal.client:REQTMOUT [rdkafka#consumer-1] [thrd:esnode3.internal.softwareheritage.org:9092/13]: esnode3.internal.softwareheritage.org:9092/13: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests

Though it seems like there are less errors about esnode1 when they are not stuck.

Related Objects

Mentioned In: T1954: Up-to-date objstorage mirror on S3
D2281: client: Add some default rdkafka debug settings when logging is enabled
D2280: Move rdkafka logging to a separate logger
rDJNLc149860fdff4: Consume messages from Kafka in batches of 20.
D2185: Consume messages from Kafka in batches of 20.
Mentioned Here: D2182: Prevent busy-waiting on / spamming consumer.poll.
D2199: Use unbound methods as callbacks for confluent-kafka.
D2185: Consume messages from Kafka in batches of 20.
D2186: Close Consumer connection on replayer shut down.
D2205: Call worker_fn for each batch, instead of waiting for max_messages.

Event Timeline

vlorentz triaged this task as High priority.Oct 10 2019, 12:06 PM

vlorentz created this task.

vlorentz edited projects, added Journal; removed Storage manager.

vlorentz mentioned this in D2185: Consume messages from Kafka in batches of 20..Oct 28 2019, 2:19 PM

vlorentz mentioned this in rDJNLc149860fdff4: Consume messages from Kafka in batches of 20..Oct 30 2019, 4:04 PM

Seems to be fixed by D2182/D2199/D2185/D2186/D2205

Nope, still not completely fixed

olasd mentioned this in D2280: Move rdkafka logging to a separate logger.Nov 14 2019, 7:15 PM

olasd mentioned this in D2281: client: Add some default rdkafka debug settings when logging is enabled.Nov 14 2019, 7:23 PM

So I'm experiencing some of this on uffizi, where I'm running the s3 content copier as a group of 64 coordinated journal clients.

My latest experiment is running 63 of these with default verbosity, and one at debug verbosity.

Of course the instance running at DEBUG verbosity has not hung yet. Sometimes it spends around a minute waiting for a rebalance, but so far it has always succeeded in doing so and resumed processing objects.

When issuing a SIGTERM on the hung instances, the following log messages are emitted:

Nov 15 12:58:46 uffizi content-replayer-s3-50[2093298]: WARNING:swh.journal.client.rdkafka:MAXPOLL [rdkafka#consumer-1] [thrd:main]: Application maximum poll interval (300000ms) exceeded by 166ms (adjust max.poll.interval.ms for long-running message processing): leaving group
Nov 15 12:58:46 uffizi content-replayer-s3-50[2093298]: INFO:swh.journal.client:Received non-fatal kafka error: KafkaError{code=_TRANSPORT,val=-195,str="esnode1.internal.softwareheritage.org:9092/11: Disconnected (after 600003ms in state UP)"}

I suppose that these messages have been queued since the process started hanging, and were never actually processed by rdkafka until the termination ran consumer.close() forcefully.

I'm now looking into options to remotely attach pdb on a running process, so I can get a trace from one of the stuck processes...

I've bodged https://github.com/ionelmc/python-remote-pdb into the process, which will help understand what's up.

My suspicion is that we indeed spend too much time doing stuff between calls to poll() / consume(), and that this puts us in a state where the rdkafka background thread unsubscribes us within a consume() call, never allowing us to resume.

We'll see when I get a pdb attached to a hung process.

Looks like attaching a stuck process with remote_pdb doesn't work the first time, then unsticks it when launching the second time. Oops.

I guess the next debugging step is GDB.

olasd mentioned this in T1954: Up-to-date objstorage mirror on S3.Nov 25 2019, 3:30 PM

I've installed gdb, python3.7-dbg, the debug symbols for librdkafka as well as for libssl1.1 on uffizi.

After looking at various GDB traces for a while, looks like the main thread of the content copier processes just ends up getting stuck waiting for S3 to answer a PUT request.

When that happens, the background librdkafka thread notices that the main thread hasn't called poll for a while, and gracefully exits the consumer group.

Attaching the process with gdb and closing the socket forcefully lets everything resume without an issue: on the next call to poll()/consume(), librdkafka emits its queued log messages, re-joins the consumer group, and processing resumes.

Looks like the particular hanging read() operation isn't affected by the timeout value set in the libcloud instantiation, which smells like a bug.

Now I'm curious to check one of @vlorentz's hanging processes to see whether the underlying issue is similar or not.

vlorentz removed vlorentz as the assignee of this task.Dec 16 2019, 4:34 PM

vlorentz removed a parent task: T2033: Run Cassandra storage backend with production data.Jan 14 2020, 3:10 PM

What is the status of this issue? Do we still face this bug?

Clients I run myself are no longer affected, I'm guessing it's thanks to one of the diffs linked from this task. But AFAIK, olasd still sees some consumers with this issue.

I don't think I've actually seen this specific symptom in prod again, and if so only on workers that were hung up on something else already. We can reopen it if we notice it again.

This task has been migrated to GitLab.

Unbreak journal clientsClosed, MigratedEdits LockedActions

Description

Related Objects

Event Timeline

Unbreak journal clients
Closed, MigratedEdits Locked
Actions