Running the metadata indexers with 10 workers and concurrency >2 (~500 requests/s) fully uses all 8 CPUs. This seems excessive, and is the metadata indexers' bottleneck, and makes some tasks crash.
Description
Revisions and Commits
Related Objects
Event Timeline
Once in a while, an indexer worker crashes with the following error:
[2019-01-31 13:01:13,034: ERROR/ForkPoolWorker-486] Problem when processing origin 7385949 Traceback (most recent call last): File "/usr/lib/python3/dist-packages/swh/indexer/indexer.py", line 570, in run res = self.index(origin, **kwargs) File "/usr/lib/python3/dist-packages/swh/indexer/metadata.py", line 281, in index rev = list(self.storage.revision_get([rev_id])) File "/usr/lib/python3/dist-packages/swh/storage/api/client.py", line 61, in revision_get return self.post('revision', {'revisions': revisions}) File "/usr/lib/python3/dist-packages/swh/core/api.py", line 128, in post return self._decode_response(response) File "/usr/lib/python3/dist-packages/swh/core/api.py", line 157, in _decode_response raise pickle.loads(decode_response(response)) psycopg2.extensions.QueryCanceledError: canceling statement due to statement timeout
Checking the storage0's service, a similar query is cancelled with the following error:
Jan 31 13:04:36 storage0 python3 [2434334]: 2019-01-31 13:04:36 [2434334] [ERROR] canceling statement due to statement timeout Traceback (most recent call last): File "/usr/lib/python3/dist-packages/flask/app.py", line 1612, in full_dispatch_request rv = self.dispatch_request () File "/usr/lib/python3/dist-packages/flask/app.py", line 1598, in dispatch_request return self.view_functions [rule.endpoint] (**req.view_args) File "/usr/lib/python3/dist-packages/swh/storage/api/server.py", line 147, in revision_get return encode_data (get_storage ().revision_get (**decode_request (request))) File "/usr/lib/python3/dist-packages/swh/core/api.py", line 176, in encode_data_server msgpack_dumps (data), File "/usr/lib/python3/dist-packages/swh/core/serializers.py", line 170, in msgpack_dumps return msgpack.packb (data, use_bin_type=True, default=encode_types) File "/usr/lib/python3/dist-packages/msgpack/__init__.py", line 47, in packb return Packer (**kwargs).pack (o) File "msgpack/_packer.pyx", line 231, in msgpack._packer.Packer.pack (msgpack/_packer.cpp:3661) File "msgpack/_packer.pyx", line 233, in msgpack._packer.Packer.pack (msgpack/_packer.cpp:3503) File "msgpack/_packer.pyx", line 224, in msgpack._packer.Packer._pack (msgpack/_packer.cpp:3312) File "/usr/lib/python3/dist-packages/swh/core/serializers.py", line 154, in encode_types return list (obj) File "/usr/lib/python3/dist-packages/swh/storage/common.py", line 78, in _meth yield from meth (self, *args, db=db, cur=cur, **kwargs) File "/usr/lib/python3/dist-packages/swh/storage/storage.py", line 622, in revision_get for line in db.revision_get_from_list (revisions, cur): File "/usr/lib/python3/dist-packages/swh/storage/db.py", line 667, in revision_get_from_list ((id,) for id in revisions)) File "/usr/lib/python3/dist-packages/swh/storage/db.py", line 77, in execute_values_to_bytes for line in execute_values_generator (*args, **kwargs): File "/usr/lib/python3/dist-packages/swh/storage/db_utils.py", line 122, in execute_values_generator cur.execute (b''.join (parts)) psycopg2.extensions.QueryCanceledError: canceling statement due to statement timeout Jan 31 13:04:36 storage0 python3 [2434334]: 2019-01-31 13:04:36 [2434334] [INFO] b'' - - [31/Jan/2019:13:04:36 +0000] "POST /revision HTTP/1.0" 400 210 "-" "python-requests/2.12.4"
source: retrieved from the machine's journalctl output
It's not clear yet as to why though.
Checking the similar sql query resulting in that call is fine...
softwareheritage=> explain SELECT t.id, revision.date, revision.date_offset, revision.date_neg_utc_offset, revision.committer_date, revision.committer_date_offset, revision.committer_date_neg_utc_offset, revision.type, revision.directory, revision.message, author.fullname, author.name, author.email, committer.fullname, committer.name, committer.email, revision.metadata, revision.synthetic, author.id, committer.id, ARRAY ( SELECT rh.parent_id::bytea FROM revision_history rh WHERE rh.id = t.id ORDER BY rh.parent_rank ) FROM (VALUES ('\x3afbfabe2c23f2f9ee27b0e8fc97277123981f38'::bytea)) as t (id) LEFT JOIN revision ON t.id = revision.id LEFT JOIN person author ON revision.author = author.id LEFT JOIN person committer ON revision.committer = committer.id; QUERY PLAN -------------------------------------------------------------------------------------------------------------- Nested Loop Left Join (cost=1.58..133.13 rows=1 width=727) -> Nested Loop Left Join (cost=1.14..4.52 rows=1 width=623) -> Nested Loop Left Join (cost=0.70..2.74 rows=1 width=551) -> Result (cost=0.00..0.01 rows=1 width=32) -> Index Scan using revision_pkey on revision (cost=0.70..2.72 rows=1 width=540) Index Cond: (('\x3afbfabe2c23f2f9ee27b0e8fc97277123981f38'::bytea) = (id)::bytea) -> Index Scan using person_pkey on person author (cost=0.44..1.78 rows=1 width=80) Index Cond: (revision.author = id) -> Index Scan using person_pkey on person committer (cost=0.44..1.78 rows=1 width=80) Index Cond: (revision.committer = id) SubPlan 1 -> Index Scan using revision_history_pkey on revision_history rh (cost=0.70..126.84 rows=122 width=36) Index Cond: ((id)::bytea = ('\x3afbfabe2c23f2f9ee27b0e8fc97277123981f38'::bytea)) (13 rows)
source: pg_activity extract from dbreplica0.
oh, i think i know... [1] We have a timeout limit on some endpoints... (including the revision_get endpoint mentioned in the stacktrace).
Yes, [1] We have a timeout of 500ms, that might be too small "once in a while"...
[1] https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/storage.py$610
To exercise my hypothesis, i've increased those timeouts for storage0.euwest.azure's storage service endpoints:
- revision_get: 500ms to 2s
- directory_ls: 20s to 40s
So far so good, checking the dashboard, i no longer see error for the impacted endpoints (revision_get as per previous comment, directory_ls which got apparent after the first one got fixed).
There now remains errors :
- directly related to indexers (bad input) so indexer fails
- Sometimes it's the indexer's api call to the indexer storage that fails. I've created a dashboard to see those errors [1]
@ardumont These timeouts are either caused by the high CPU usage of swh-storage, or an unrelated issue.