Page MenuHomeSoftware Heritage

swh-graph timeouts
Closed, MigratedEdits Locked

Description

This task is to investigate the number of timeouts generated by the swh-graph
It seems it's not due to the number of parallel clients querying the graph in parallel as the exception are also
raised for a unique client.

For example:
The query:

http "http://granet:5009/graph/neighbors/swh:1:rev:bffd5ed1e4f1d784b670ec83bce630aee6a0a67d?edges=rev:rev&direction=forward&max_edges=0&return_types=rev"                                                                                                                                                                                                     09:54:19
HTTP/1.1 200 OK
Content-Type: text/plain
Date: Mon, 20 Jun 2022 07:54:21 GMT
Server: Python/3.7 aiohttp/3.6.2
Transfer-Encoding: chunked


http: error: ChunkedEncodingError: ("Connection broken: InvalidChunkLength(got length b'HTTP/1.1 500 Internal Server Error\\r\\n', 0 bytes read)", InvalidChunkLength(got length b'HTTP/1.1 500 Internal Server Error\r\n', 0 bytes read))

The swh-graph logs:

Jun 20 07:55:31 granet swh[59760]: ERROR:root:Exception is not set.
Jun 20 07:55:31 granet swh[59760]: Traceback (most recent call last):
Jun 20 07:55:31 granet swh[59760]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/swh/graph/backend.py", line 163, in java_call_iterator
Jun 20 07:55:31 granet swh[59760]:     async for value in reader:
Jun 20 07:55:31 granet swh[59760]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/swh/graph/backend.py", line 114, in read_node_ids
Jun 20 07:55:31 granet swh[59760]:     with (await asyncio.wait_for(open_thread, timeout=2)) as f:
Jun 20 07:55:31 granet swh[59760]:   File "/usr/lib/python3.7/asyncio/tasks.py", line 423, in wait_for
Jun 20 07:55:31 granet swh[59760]:     raise futures.TimeoutError()
Jun 20 07:55:31 granet swh[59760]: concurrent.futures._base.TimeoutError
Jun 20 07:55:31 granet swh[59760]: During handling of the above exception, another exception occurred:
Jun 20 07:55:31 granet swh[59760]: Traceback (most recent call last):
Jun 20 07:55:31 granet swh[59760]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/swh/core/api/asynchronous.py", line 71, in middleware_handler
Jun 20 07:55:31 granet swh[59760]:     return await handler(request)
Jun 20 07:55:31 granet swh[59760]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/aiohttp/web_urldispatcher.py", line 892, in _iter
Jun 20 07:55:31 granet swh[59760]:     resp = await method()
Jun 20 07:55:31 granet swh[59760]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/swh/graph/server/app.py", line 164, in get
Jun 20 07:55:31 granet swh[59760]:     await self.stream_response()
Jun 20 07:55:31 granet swh[59760]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/swh/graph/server/app.py", line 220, in stream_response
Jun 20 07:55:31 granet swh[59760]:     self.return_types,
Jun 20 07:55:31 granet swh[59760]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/swh/graph/backend.py", line 84, in traversal
Jun 20 07:55:31 granet swh[59760]:     async for line in method(*args):
Jun 20 07:55:31 granet swh[59760]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/swh/graph/backend.py", line 170, in java_call_iterator
Jun 20 07:55:31 granet swh[59760]:     task_exc = java_task.exception()
Jun 20 07:55:31 granet swh[59760]: asyncio.base_futures.InvalidStateError: Exception is not set.
Jun 20 07:55:31 granet swh[59760]: INFO:aiohttp.access:192.168.101.200 [20/Jun/2022:07:55:29 +0000] "GET /graph/neighbors/swh:1:rev:bffd5ed1e4f1d784b670ec83bce630aee6a0a67d?edges=rev:rev&direction=forward&max_edges=0&return_types=rev HTTP/1.1" 500 1960 "-" "HTTPie/2.2.0"

For information, these are the statistics of the queries done by the provenance clients:

root@granet:~# journalctl -u swhgraphdev.service | head -n1
-- Journal begins at Mon 2022-06-20 00:59:12 UTC, ends at Mon 2022-06-20 08:16:36 UTC. --
root@granet:~# journalctl -u swhgraphdev.service | grep " 500 " | wc -l
95453
root@granet:~# journalctl -u swhgraphdev.service | grep -v monitoring | grep " 200 " | wc -l
0
root@granet:~# journalctl -u swhgraphdev.service | grep " 400 " | wc -l
1875970

All the requests return a 500 or a 400

Revisions and Commits

rDGRPH Compressed graph representation
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890
D7890

Event Timeline

vsellier created this task.
vsellier updated the task description. (Show Details)
vsellier updated the task description. (Show Details)

I reversed engineered the py4j communication protocol, so next time it will hang, we should be able to tell if the issue is on the gateway server side or on the python side:

  • Create a name pipe
mkfifo /tmp/test
chmod a+w /tmp/test
tail -F /tmp/test
  • query the graph
ss -ltp | grep java
<get the port number>
telnet localhost <port number>
c
o0
get_handler
s/tmp/test
e

!yro1   <----- o1 is the object id, it will change at each request

c
o1    <----------------- change the object id according the result of the previous request
neighbors
sforward
srev:rev
sswh:1:rev:fe5ce7b8b8babffb8853127392506cc5e6dae045
i0
srev
e

The result is sent to the pipe:

tail -F /tmp/test                                                                                                                                                                                 
swh:1:rev:0ba6d132a0cf26f58dc6b34f1be0e6d663a2bb04

It's confirmed that the issue seems to be on the python part of the current implementation so I'm eager to see D7890 landed ;)

When the problem occurs, the java side of the gateway is responding well but not from the python api:

http "http://granet:5009/graph/neighbors/swh:1:rev:fe5ce7b8b8babffb8853127392506cc5e6dae045?edges=rev:rev&direction=forward&max_edges=0&return_types=rev"                                                                                                                                                                      10:09:07
HTTP/1.1 200 OK
Content-Type: text/plain
Date: Fri, 24 Jun 2022 08:09:09 GMT
Server: Python/3.7 aiohttp/3.6.2
Transfer-Encoding: chunked


http: error: ChunkedEncodingError: ("Connection broken: InvalidChunkLength(got length b'HTTP/1.1 500 Internal Server Error\\r\\n', 0 bytes read)", InvalidChunkLength(got length b'HTTP/1.1 500 Internal Server Error\r\n', 0 bytes read))

granet logs :

Jun 24 08:09:11 granet swh[2216885]: ERROR:root:Exception is not set.
Jun 24 08:09:11 granet swh[2216885]: Traceback (most recent call last):
Jun 24 08:09:11 granet swh[2216885]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/swh/graph/backend.py", line 163, in java_call_iterator
Jun 24 08:09:11 granet swh[2216885]:     async for value in reader:
Jun 24 08:09:11 granet swh[2216885]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/swh/graph/backend.py", line 114, in read_node_ids
Jun 24 08:09:11 granet swh[2216885]:     with (await asyncio.wait_for(open_thread, timeout=2)) as f:
Jun 24 08:09:11 granet swh[2216885]:   File "/usr/lib/python3.7/asyncio/tasks.py", line 423, in wait_for
Jun 24 08:09:11 granet swh[2216885]:     raise futures.TimeoutError()
Jun 24 08:09:11 granet swh[2216885]: concurrent.futures._base.TimeoutError
Jun 24 08:09:11 granet swh[2216885]: During handling of the above exception, another exception occurred:
Jun 24 08:09:11 granet swh[2216885]: Traceback (most recent call last):
Jun 24 08:09:11 granet swh[2216885]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/swh/core/api/asynchronous.py", line 71, in middleware_handler
Jun 24 08:09:11 granet swh[2216885]:     return await handler(request)
Jun 24 08:09:11 granet swh[2216885]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/aiohttp/web_urldispatcher.py", line 892, in _iter
Jun 24 08:09:11 granet swh[2216885]:     resp = await method()
Jun 24 08:09:11 granet swh[2216885]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/swh/graph/server/app.py", line 164, in get
Jun 24 08:09:11 granet swh[2216885]:     await self.stream_response()
Jun 24 08:09:11 granet swh[2216885]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/swh/graph/server/app.py", line 220, in stream_response
Jun 24 08:09:11 granet swh[2216885]:     self.return_types,
Jun 24 08:09:11 granet swh[2216885]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/swh/graph/backend.py", line 84, in traversal
Jun 24 08:09:11 granet swh[2216885]:     async for line in method(*args):
Jun 24 08:09:11 granet swh[2216885]:   File "/opt/swhgraph_venv/lib/python3.7/site-packages/swh/graph/backend.py", line 170, in java_call_iterator
Jun 24 08:09:11 granet swh[2216885]:     task_exc = java_task.exception()
Jun 24 08:09:11 granet swh[2216885]: asyncio.base_futures.InvalidStateError: Exception is not set.
Jun 24 08:09:11 granet swh[2216885]: INFO:aiohttp.access:192.168.101.200 [24/Jun/2022:08:09:09 +0000] "GET /graph/neighbors/swh:1:rev:fe5ce7b8b8babffb8853127392506cc5e6dae045?edges=rev:rev&direction=forward&max_edges=0&return_types=rev HTTP/1.1" 500 1960 "-" "HTTPie/2.2.0"

Direct access to the java part:

root@granet:~# ss -lnp | grep java
tcp    LISTEN  0        50                                   [::ffff:127.0.0.1]:37537                                                 *:*                        users:(("java",pid=2217234,fd=9))                                              
root@granet:~# telnet localhost 37537
Trying ::1...
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
c
o0
check_swhid
sswh:1:rev:fe5ce7b8b8babffb8853127392506cc5e6dae045
e
!yv
c
o0
get_handler
s/tmp/test
e
!yro5336651
c
o5336651
neighbors
sforward
srev:rev
sswh:1:rev:fe5ce7b8b8babffb8853127392506cc5e6dae045
i0
srev
e
!yv
^[^]
telnet> Connection closed.
root@granet:~# cat /tmp/test
swh:1:rev:0ba6d132a0cf26f58dc6b34f1be0e6d663a2bb04
vsellier claimed this task.

I will be solved by D7890

vlorentz changed the task status from Wontfix to Resolved.Aug 26 2022, 11:12 AM
vlorentz changed the task status from Resolved to Wontfix.
vlorentz moved this task from Backlog to Deployed on the Compressed graph service board.
vlorentz moved this task from Deployed to Wontfix / Invalid on the Compressed graph service board.