Page MenuHomeSoftware Heritage

test_replay: Fix hang since confluent-kafka 1.6 release
ClosedPublic

Authored by anlambert on Feb 9 2021, 10:51 AM.

Details

Summary

Side effect of the following commit (identified using git bisect) in librdkafka 1.6:
https://github.com/edenhill/librdkafka/commit/f418e0f721518d71ff533759698b647cb2e89b80

It seems consumers must have different group ids now when using a mocked kafka cluster
in tests.

Depends on D5048

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build was aborted

Patch application report for D5045 (id=17987)

Rebasing onto b0383833fe...

Current branch diff-target is up to date.
Changes applied before test
commit 01c1b67a6179a202f6613f0491254a0285a11c1a
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Feb 9 10:46:37 2021 +0100

    test_replay: Fix hang since confluent-kafka 1.6 release
    
    Side effect of the following commit in librdkafka 1.6:
    https://github.com/edenhill/librdkafka/commit/f418e0f721518d71ff533759698b647cb2e89b80
    
    It seems consumers must have different group ids now when using a
    mocked kafka cluster in tests.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1146/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1146/console

Harbormaster returned this revision to the author for changes because remote builds failed.Feb 9 2021, 11:03 AM
Harbormaster failed remote builds in B19108: Diff 17987!

Looks like there is still a hang when executing tests in swh/storage/tests/test_kafka_writer.py but it does not appear when only that test file is executed ...

Looking at that gdb stacktrace It seems there is a race condition when the consumer fixture of swh-journal calls consumer.close() after the execution of test_storage_direct_writer_anonymized.

#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x29372c0) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x2937270, cond=0x2937298) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x2937298, mutex=0x2937270) at pthread_cond_wait.c:655
#3  0x00007ffff7f5c1a9 in cnd_wait (cond=<optimized out>, mutex=<optimized out>) at cnd_wait.c:24
#4  0x00007ffff3edaff5 in cnd_timedwait_abs (cnd=cnd@entry=0x2937298, mtx=mtx@entry=0x2937270, tspec=tspec@entry=0x7fffffff77b0) at /home/anlambert/dev/librdkafka/src/tinycthread_extra.c:99
#5  0x00007ffff3e5e025 in rd_kafka_q_pop_serve (rkq=rkq@entry=0x2937270, timeout_us=<optimized out>, version=version@entry=0, cb_type=cb_type@entry=RD_KAFKA_Q_CB_RETURN, callback=callback@entry=0x0, 
    opaque=opaque@entry=0x0) at /home/anlambert/dev/librdkafka/src/rdkafka_queue.c:404
#6  0x00007ffff3e5e140 in rd_kafka_q_pop (rkq=rkq@entry=0x2937270, timeout_us=<optimized out>, version=version@entry=0) at /home/anlambert/dev/librdkafka/src/rdkafka_queue.c:428
#7  0x00007ffff3e4c18c in rd_kafka_op_req0 (destq=destq@entry=0x26d9ec0, recvq=recvq@entry=0x2937270, rko=rko@entry=0x1997b80, timeout_ms=timeout_ms@entry=-1) at /home/anlambert/dev/librdkafka/src/rdtime.h:146
#8  0x00007ffff3e4cadf in rd_kafka_op_req (destq=0x26d9ec0, rko=rko@entry=0x1997b80, timeout_ms=timeout_ms@entry=-1) at /home/anlambert/dev/librdkafka/src/rdkafka_op.c:631
#9  0x00007ffff3e821b0 in rd_kafka_assign0 (rk=<optimized out>, assign_method=RD_KAFKA_ASSIGN_METHOD_ASSIGN, partitions=0x7ffeb400bdb0) at /home/anlambert/dev/librdkafka/src/rdkafka_subscription.c:123
#10 0x00007ffff3e82202 in rd_kafka_assign (rk=rk@entry=0x29523f0, partitions=partitions@entry=0x7ffeb400bdb0) at /home/anlambert/dev/librdkafka/src/rdkafka_subscription.c:134
#11 0x00007ffff53722f6 in Consumer_rebalance_cb (rk=0x29523f0, err=RD_KAFKA_RESP_ERR__ASSIGN_PARTITIONS, c_parts=0x7ffeb400bdb0, opaque=0x7ffe0d30d0d0)
    at /home/anlambert/dev/confluent-kafka-python/src/confluent_kafka/src/Consumer.c:1365
#12 0x00007ffff3de7f1f in rd_kafka_poll_cb (rk=rk@entry=0x29523f0, rkq=rkq@entry=0x1b48300, rko=rko@entry=0x7ffeb400bd30, cb_type=cb_type@entry=RD_KAFKA_Q_CB_RETURN, opaque=opaque@entry=0x0)
    at /home/anlambert/dev/librdkafka/src/rdkafka.c:3647
#13 0x00007ffff3de8303 in rd_kafka_consumer_close (rk=0x29523f0) at /home/anlambert/dev/librdkafka/src/rdkafka.c:3186
#14 0x00007ffff5373714 in Consumer_close (self=0x7ffe0d30d0d0, ignore=<optimized out>) at /home/anlambert/dev/confluent-kafka-python/src/confluent_kafka/src/Consumer.c:981

Fixed in D5048.

Hmm. Looks like this test was relying on a buggy behavior of the mocked kafka cluster: two subsequent consumers setup with the same group id should receive a different set of messages, rather than the same set of messages.

I guess we should also do replayer.commit() after replayer.process() so that the offsets are properly updated on the broker once we've consumed the messages.

Build is green

Patch application report for D5045 (id=17987)

Rebasing onto b0383833fe...

Current branch diff-target is up to date.
Changes applied before test
commit 01c1b67a6179a202f6613f0491254a0285a11c1a
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Feb 9 10:46:37 2021 +0100

    test_replay: Fix hang since confluent-kafka 1.6 release
    
    Side effect of the following commit in librdkafka 1.6:
    https://github.com/edenhill/librdkafka/commit/f418e0f721518d71ff533759698b647cb2e89b80
    
    It seems consumers must have different group ids now when using a
    mocked kafka cluster in tests.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1149/ for more details.

Improve commit message and explicitly commit messages once consumed.

This revision is now accepted and ready to land.Feb 9 2021, 4:01 PM

Build is green

Patch application report for D5045 (id=18001)

Rebasing onto b0383833fe...

Current branch diff-target is up to date.
Changes applied before test
commit 51df58e82e4f2c3792c7e4229ecd059be19cb050
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Tue Feb 9 10:46:37 2021 +0100

    test_replay: Fix hang since confluent-kafka 1.6 release
    
    Side effect of the following commit in librdkafka 1.6:
    https://github.com/edenhill/librdkafka/commit/f418e0f721518d71ff533759698b647cb2e89b80
    
    Tests was relying on a buggy behavior of the mocked kafka cluster: two
    subsequent consumers setup with the same group id should receive a
    different set of messages, rather than the same set of messages.
    
    Also explicitly commit messages once consumed.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1150/ for more details.