Page MenuHomeSoftware Heritage

Investigate CPU usage of swh-storage on storage0.euwest.azure
Closed, MigratedEdits Locked

Description

Running the metadata indexers with 10 workers and concurrency >2 (~500 requests/s) fully uses all 8 CPUs. This seems excessive, and is the metadata indexers' bottleneck, and makes some tasks crash.

Event Timeline

vlorentz triaged this task as Normal priority.
vlorentz created this task.

Once in a while, an indexer worker crashes with the following error:

[2019-01-31 13:01:13,034: ERROR/ForkPoolWorker-486] Problem when processing origin 7385949
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swh/indexer/indexer.py", line 570, in run
    res = self.index(origin, **kwargs)
  File "/usr/lib/python3/dist-packages/swh/indexer/metadata.py", line 281, in index
    rev = list(self.storage.revision_get([rev_id]))
  File "/usr/lib/python3/dist-packages/swh/storage/api/client.py", line 61, in revision_get
    return self.post('revision', {'revisions': revisions})
  File "/usr/lib/python3/dist-packages/swh/core/api.py", line 128, in post
    return self._decode_response(response)
  File "/usr/lib/python3/dist-packages/swh/core/api.py", line 157, in _decode_response
    raise pickle.loads(decode_response(response))
psycopg2.extensions.QueryCanceledError: canceling statement due to statement timeout

[1] http://kibana0.internal.softwareheritage.org:5601/app/kibana#/dashboard/29df7540-17e9-11e9-b8ce-cf95f437ce37

Checking the storage0's service, a similar query is cancelled with the following error:

	Jan 31 13:04:36 storage0 python3 [2434334]: 2019-01-31 13:04:36 [2434334] [ERROR] canceling statement due to statement timeout
	Traceback (most recent call last):
	  File "/usr/lib/python3/dist-packages/flask/app.py", line 1612, in full_dispatch_request
	    rv = self.dispatch_request ()
	  File "/usr/lib/python3/dist-packages/flask/app.py", line 1598, in dispatch_request
	    return self.view_functions [rule.endpoint] (**req.view_args)
	  File "/usr/lib/python3/dist-packages/swh/storage/api/server.py", line 147, in revision_get
	    return encode_data (get_storage ().revision_get (**decode_request (request)))
	  File "/usr/lib/python3/dist-packages/swh/core/api.py", line 176, in encode_data_server
	    msgpack_dumps (data),
	  File "/usr/lib/python3/dist-packages/swh/core/serializers.py", line 170, in msgpack_dumps
	    return msgpack.packb (data, use_bin_type=True, default=encode_types)
	  File "/usr/lib/python3/dist-packages/msgpack/__init__.py", line 47, in packb
	    return Packer (**kwargs).pack (o)
	  File "msgpack/_packer.pyx", line 231, in msgpack._packer.Packer.pack (msgpack/_packer.cpp:3661)
	  File "msgpack/_packer.pyx", line 233, in msgpack._packer.Packer.pack (msgpack/_packer.cpp:3503)
	  File "msgpack/_packer.pyx", line 224, in msgpack._packer.Packer._pack (msgpack/_packer.cpp:3312)
	  File "/usr/lib/python3/dist-packages/swh/core/serializers.py", line 154, in encode_types
	    return list (obj)
	  File "/usr/lib/python3/dist-packages/swh/storage/common.py", line 78, in _meth
	    yield from meth (self, *args, db=db, cur=cur, **kwargs)
	  File "/usr/lib/python3/dist-packages/swh/storage/storage.py", line 622, in revision_get
	    for line in db.revision_get_from_list (revisions, cur):
	  File "/usr/lib/python3/dist-packages/swh/storage/db.py", line 667, in revision_get_from_list
	    ((id,) for id in revisions))
	  File "/usr/lib/python3/dist-packages/swh/storage/db.py", line 77, in execute_values_to_bytes
	    for line in execute_values_generator (*args, **kwargs):
	  File "/usr/lib/python3/dist-packages/swh/storage/db_utils.py", line 122, in execute_values_generator
	    cur.execute (b''.join (parts))
	psycopg2.extensions.QueryCanceledError: canceling statement due to statement timeout
	Jan 31 13:04:36 storage0 python3 [2434334]: 2019-01-31 13:04:36 [2434334] [INFO] b'' - - [31/Jan/2019:13:04:36 +0000] "POST /revision HTTP/1.0" 400 210 "-" "python-requests/2.12.4"

source: retrieved from the machine's journalctl output

It's not clear yet as to why though.
Checking the similar sql query resulting in that call is fine...

softwareheritage=> explain SELECT t.id, revision.date, revision.date_offset, revision.date_neg_utc_offset, revision.committer_date, revision.committer_date_offset, revision.committer_date_neg_utc_offset, revision.type, revision.directory, revision.message, author.fullname, author.name, author.email, committer.fullname, committer.name, committer.email, revision.metadata, revision.synthetic, author.id, committer.id, ARRAY ( SELECT rh.parent_id::bytea FROM revision_history  rh WHERE rh.id = t.id ORDER BY rh.parent_rank ) FROM (VALUES ('\x3afbfabe2c23f2f9ee27b0e8fc97277123981f38'::bytea)) as t (id) LEFT JOIN revision ON t.id = revision.id LEFT JOIN person author ON revision.author = author.id LEFT JOIN person committer ON revision.committer = committer.id;
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=1.58..133.13 rows=1 width=727)
   ->  Nested Loop Left Join  (cost=1.14..4.52 rows=1 width=623)
         ->  Nested Loop Left Join  (cost=0.70..2.74 rows=1 width=551)
               ->  Result  (cost=0.00..0.01 rows=1 width=32)
               ->  Index Scan using revision_pkey on revision  (cost=0.70..2.72 rows=1 width=540)
                     Index Cond: (('\x3afbfabe2c23f2f9ee27b0e8fc97277123981f38'::bytea) = (id)::bytea)
         ->  Index Scan using person_pkey on person author  (cost=0.44..1.78 rows=1 width=80)
               Index Cond: (revision.author = id)
   ->  Index Scan using person_pkey on person committer  (cost=0.44..1.78 rows=1 width=80)
         Index Cond: (revision.committer = id)
   SubPlan 1
     ->  Index Scan using revision_history_pkey on revision_history rh  (cost=0.70..126.84 rows=122 width=36)
           Index Cond: ((id)::bytea = ('\x3afbfabe2c23f2f9ee27b0e8fc97277123981f38'::bytea))
(13 rows)

source: pg_activity extract from dbreplica0.

oh, i think i know... [1] We have a timeout limit on some endpoints... (including the revision_get endpoint mentioned in the stacktrace).

Yes, [1] We have a timeout of 500ms, that might be too small "once in a while"...

[1] https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/storage.py$610

To exercise my hypothesis, i've increased those timeouts for storage0.euwest.azure's storage service endpoints:

  • revision_get: 500ms to 2s
  • directory_ls: 20s to 40s

So far so good, checking the dashboard, i no longer see error for the impacted endpoints (revision_get as per previous comment, directory_ls which got apparent after the first one got fixed).

There now remains errors :

  • directly related to indexers (bad input) so indexer fails
  • Sometimes it's the indexer's api call to the indexer storage that fails. I've created a dashboard to see those errors [1]

[1] http://kibana0.internal.softwareheritage.org:5601/app/kibana#/dashboard/22066c20-256d-11e9-b8ce-cf95f437ce37

@ardumont These timeouts are either caused by the high CPU usage of swh-storage, or an unrelated issue.

@ardumont These timeouts are either caused by the high CPU usage of swh-storage, or an unrelated issue.

Awesome, i guess D1053 will fix that then (when it's ok for it to land).