Page MenuHomeSoftware Heritage

Unstuck provenance diff build hanging and then aborted
Closed, MigratedEdits Locked

Description

Master build is fine [1] Diff build [2] on top of the master is hanging and gets aborted
for some reason. The only difference in between the master commit and the diff commit is
about the new pulled swh.journal dependency.

Although the diff build seems to hang around the
test_provenance_storage_content[rabbitmq] with rabbitmq somehow.

[1] https://jenkins.softwareheritage.org/job/DPROV/job/tests/765/console

[2] https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/619/console

[3]

# diff hanging then killed
11:22:37  .tox/py3/lib/python3.7/site-packages/swh/provenance/tests/test_provenance_storage.py::test_provenance_storage_content[rabbitmq-with-path-denormalized] PASSED [ 27%]
11:35:35  Cancelling nested steps due to timeout
11:35:35  .tox/py3/lib/python3.7/site-packages/swh/provenance/tests/test_provenance_storage.py::test_provenance_storage_content[rabbitmq-without-path-denormalized] Sending interrupt signal to process
11:35:43  Terminated
11:35:43  ERROR: Got SIGTERM, handling it as a KeyboardInterrupt
11:35:43  ERROR: got KeyboardInterrupt signal

# master build ok around the same test
11:13:17  .tox/py3/lib/python3.7/site-packages/swh/provenance/tests/test_provenance_storage.py::test_provenance_storage_content[rabbitmq-without-path-denormalized] PASSED [ 27%]
11:13:18  .tox/py3/lib/python3.7/site-packages/swh/provenance/tests/test_provenance_storage.py::test_provenance_storage_directory[mongodb-with-path] PASSED [ 27%]

Event Timeline

ardumont renamed this task from Unstuck provenance diff build to Unstuck provenance diff build hanging and then aborted.EditedJun 29 2022, 11:34 AM
ardumont triaged this task as Normal priority.
ardumont created this task.

That also seem to incur a heavy load on the jenkins node thyssen [1] as lots of processes seem to run for that job [2].

11:28 <+swhbot> icinga PROBLEM: service load on thyssen.internal.softwareheritage.org is WARNING: WARNING - load average: 30.18, 22.54, 11.58

[2]

root@thyssen:~# ps aux | grep -c PROV
31

wip tryout [1] without those tests to ensure whether it's the rabbitmq fixture which hangs or not.

[1] D8023#209122

wip tryout [1] without those tests to ensure whether it's the rabbitmq fixture which hangs or not.

[1] D8023#209122

Without rabbitmq fixture, the build passes [2]

[2] D8023#209131

It looks like the rabbitmq provenance storage server, which is run within the pytest context through multiprocessing (so forks a bunch of python processes in the pytest run context), interacts poorly with the confluent-kafka library (which is used in the swh.journal fixtures, and brings up a bunch of internal threads).

When the rabbitmq storage server subprocesses are brought down, the internal rdkafka threads get confused and SIGABRT one of the interpreters, which makes pytest hang waiting for the interpreter to respond that it has shut down.

Isolating the journal and rabbitmq tests in separate pytest runs seems to work around the issue.

We could consider explicitly bringing up the rabbitmq storage server in separate processes, that would be clean from any internal rdkafka state.

There doesn't seem to be a way to explicitly tear down the rdkafka test broker/internal threads that we use in the swh.journal test fixtures.

Fixes included in the diff landed [1].

Note that the issue on the initial commit (first one in diff [1]) did not reproduce once pushed to the master branch. [2]

In any case, we can close this.

[1] D8023

[2] https://jenkins.softwareheritage.org/job/DPROV/job/tests/767/console

ardumont claimed this task.