Page MenuHomeSoftware Heritage

swh-graph CI hangs badly when py4j doesn't find needed files
Closed, ResolvedPublic

Description

The CI runs for swh-graph might hang badly on test_api_client.py, here's the most recent exampe.

This specific issues is solved now, but it really shouldn't happen that test running hangs, see below for an analysis by @olasd on how to avoid that exception aren't forwarded, blocking test abortion.

Event Timeline

zack triaged this task as Unbreak Now! priority.Nov 3 2019, 4:46 PM
zack created this task.
olasd added a comment.Nov 4 2019, 11:43 AM

The .jar file is never installed within the tox environment, so the graph backend process fixture never actually succeeds in launching the server. FWIW, when running tox on my system, the tests hang just the same.

root@thyssen:~# unzip -l /home/jenkins/workspace/DGRPH/tests/.tox/dist/swh.graph-0.1.0.post33.zip 
Archive:  /home/jenkins/workspace/DGRPH/tests/.tox/dist/swh.graph-0.1.0.post33.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2019-11-04 10:23   swh.graph-0.1.0.post33/
        0  2019-11-04 10:23   swh.graph-0.1.0.post33/swh.graph.egg-info/
        0  2019-11-04 10:23   swh.graph-0.1.0.post33/swh/
      140  2019-11-04 10:23   swh.graph-0.1.0.post33/MANIFEST.in
       18  2019-11-04 10:23   swh.graph-0.1.0.post33/version.txt
       35  2019-11-04 10:23   swh.graph-0.1.0.post33/requirements-swh.txt
      286  2019-11-04 10:23   swh.graph-0.1.0.post33/README.md
       38  2019-11-04 10:23   swh.graph-0.1.0.post33/setup.cfg
      163  2019-11-04 10:23   swh.graph-0.1.0.post33/Makefile
     1162  2019-11-04 10:23   swh.graph-0.1.0.post33/PKG-INFO
       38  2019-11-04 10:23   swh.graph-0.1.0.post33/requirements.txt
     2411  2019-11-04 10:23   swh.graph-0.1.0.post33/setup.py
      709  2019-11-04 10:23   swh.graph-0.1.0.post33/swh.graph.egg-info/SOURCES.txt
        1  2019-11-04 10:23   swh.graph-0.1.0.post33/swh.graph.egg-info/dependency_links.txt
       89  2019-11-04 10:23   swh.graph-0.1.0.post33/swh.graph.egg-info/requires.txt
      130  2019-11-04 10:23   swh.graph-0.1.0.post33/swh.graph.egg-info/entry_points.txt
     1162  2019-11-04 10:23   swh.graph-0.1.0.post33/swh.graph.egg-info/PKG-INFO
        4  2019-11-04 10:23   swh.graph-0.1.0.post33/swh.graph.egg-info/top_level.txt
        0  2019-11-04 10:23   swh.graph-0.1.0.post33/swh/graph/
      127  2019-11-04 10:23   swh.graph-0.1.0.post33/swh/__init__.py
        0  2019-11-04 10:23   swh.graph-0.1.0.post33/swh/graph/server/
        0  2019-11-04 10:23   swh.graph-0.1.0.post33/swh/graph/tests/
     7032  2019-11-04 10:23   swh.graph-0.1.0.post33/swh/graph/backend.py
        0  2019-11-04 10:23   swh.graph-0.1.0.post33/swh/graph/__init__.py
     5751  2019-11-04 10:23   swh.graph-0.1.0.post33/swh/graph/cli.py
     8863  2019-11-04 10:23   swh.graph-0.1.0.post33/swh/graph/webgraph.py
    13080  2019-11-04 10:23   swh.graph-0.1.0.post33/swh/graph/pid.py
     1565  2019-11-04 10:23   swh.graph-0.1.0.post33/swh/graph/dot.py
     3278  2019-11-04 10:23   swh.graph-0.1.0.post33/swh/graph/client.py
     5356  2019-11-04 10:23   swh.graph-0.1.0.post33/swh/graph/graph.py
       27  2019-11-04 10:23   swh.graph-0.1.0.post33/swh/graph/py.typed
        0  2019-11-04 10:23   swh.graph-0.1.0.post33/swh/graph/server/__init__.py
     5175  2019-11-04 10:23   swh.graph-0.1.0.post33/swh/graph/server/app.py
        0  2019-11-04 10:23   swh.graph-0.1.0.post33/swh/graph/tests/__init__.py
     1361  2019-11-04 10:23   swh.graph-0.1.0.post33/swh/graph/tests/conftest.py
     4540  2019-11-04 10:23   swh.graph-0.1.0.post33/swh/graph/tests/test_api_client.py
     1526  2019-11-04 10:23   swh.graph-0.1.0.post33/swh/graph/tests/test_cli.py
     4329  2019-11-04 10:23   swh.graph-0.1.0.post33/swh/graph/tests/test_graph.py
     7219  2019-11-04 10:23   swh.graph-0.1.0.post33/swh/graph/tests/test_pid.py
---------                     -------
    75615                     39 files

I guess you at least need a fix to MANIFEST.in to make sure the jar file is properly included in the sdist that tox uses to generate its venvs. Looks like the test data isn't included either.

Why this results in a hang instead of a proper failure needs to be checked as well. I'm guessing the exception is never actually passed to the queue if the GraphServerProcess.start() fails, and so the fixture's queue.get() hangs. The path manipulations to try and find a .jar file and the dataset also seem quite brittle, and we should consider moving to importlib.resources.

zack renamed this task from swh-graph CI hangs badly on test_api_client.py to swh-graph CI hangs badly when py4j doesn't find needed files.Nov 4 2019, 1:45 PM
zack lowered the priority of this task from Unbreak Now! to High.
zack updated the task description. (Show Details)
zack added a comment.Nov 11 2019, 1:45 PM

AFAICT this is a more general problem, the Java backend can hang forever in case of unexpected situations (uncaught exceptions? I really don't know…), which will make it not respond to any incoming request with no visible output.
We should make this visible and debuggable.

zack assigned this task to seirl.Nov 12 2019, 4:16 PM

Another simple way to reproduce is just removing the *.jar file and running pytest on test_api_client.py.
This is not even a Java exception, but chances are fixing that case will fix at least a significant part of the general problem, if not all.