Page MenuHomeSoftware Heritage

[cassandra] Git loader performance are very bad
Closed, MigratedEdits Locked

Description

With all the replayers stopped and no particular load on the cluster, the loading of almost all the git repository are failing in timeout.
Not tested with other types of origin

Reference in current production:

swhworker@worker01:~$ time SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_git.yml swh loader run git https://github.com/slackhq/nebula
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/slackhq/nebula' with type 'git'
Enumerating objects: 1461, done.
Counting objects: 100% (307/307), done.
Compressing objects: 100% (186/186), done.
Total 1461 (delta 166), reused 193 (delta 109), pack-reused 1154
INFO:swh.loader.git.loader.GitLoader:Listed 293 refs for repo https://github.com/slackhq/nebula
{'status': 'eventful'}

real	0m9.096s
user	0m1.551s
sys	0m0.064s

With the cassandra backend:

swh@6490bac3ba28:/$ time swh loader run git https://github.com/slackhq/nebula
INFO:swh.loader.git.loader.GitLoader:Load origin 'https://github.com/slackhq/nebula' with type 'git'
Enumerating objects: 3317, done.
Counting objects: 100% (1128/1128), done.
Compressing objects: 100% (508/508), done.
Total 3317 (delta 696), reused 915 (delta 601), pack-reused 2189
INFO:swh.loader.git.loader.GitLoader:Listed 293 refs for repo https://github.com/slackhq/nebula
ERROR:swh.loader.git.loader.GitLoader:Loading failure, updating to `failed` status
Traceback (most recent call last):
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/core/loader.py", line 339, in load
    self.store_data()
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/core/loader.py", line 467, in store_data
    self.flush()
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/core/loader.py", line 168, in flush
    self.storage.flush()
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/proxies/buffer.py", line 156, in flush
    stats = add_fn(list(batch))
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/proxies/filter.py", line 59, in revision_add
    missing_ids = self._filter_missing_ids("revision", (r.id for r in revisions))
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/proxies/filter.py", line 116, in _filter_missing_ids
    return set(fn(missing_ids))
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 181, in meth_
    return self.post(meth._endpoint_path, post_data)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 278, in post
    return self._decode_response(response)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 354, in _decode_response
    self.raise_for_status(response)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/api/client.py", line 29, in raise_for_status
    super().raise_for_status(response)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 344, in raise_for_status
    raise exception from None
swh.core.api.RemoteException: <RemoteException 500 ReadTimeout: ['Error from server: code=1200 [Coordinator node timed out waiting for replica nodes\' responses] message="Operation timed out - received only 0 responses." info={\'consistency\': \'LOCAL_ONE\', \'required_responses\': 1, \'received_responses\': 0}']>


Traceback (most recent call last):
  File "/srv/softwareheritage/venv/bin/swh", line 8, in <module>
    sys.exit(main())
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/cli/__init__.py", line 185, in main
    return swh(auto_envvar_prefix="SWH")
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/cli.py", line 105, in run
    result = loader.load()
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/core/loader.py", line 382, in load
    self.flush()
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/core/loader.py", line 168, in flush
    self.storage.flush()
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/proxies/buffer.py", line 156, in flush
    stats = add_fn(list(batch))
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/proxies/filter.py", line 59, in revision_add
    missing_ids = self._filter_missing_ids("revision", (r.id for r in revisions))
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/proxies/filter.py", line 116, in _filter_missing_ids
    return set(fn(missing_ids))
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 181, in meth_
    return self.post(meth._endpoint_path, post_data)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 278, in post
    return self._decode_response(response)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 354, in _decode_response
    self.raise_for_status(response)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/api/client.py", line 29, in raise_for_status
    super().raise_for_status(response)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 344, in raise_for_status
    raise exception from None
swh.core.api.RemoteException: <RemoteException 500 ReadTimeout: ['Error from server: code=1200 [Coordinator node timed out waiting for replica nodes\' responses] message="Operation timed out - received only 0 responses." info={\'consistency\': \'LOCAL_ONE\', \'required_responses\': 1, \'received_responses\': 0}']>

real	8m59.338s     <------------ zzzZZZZZ
user	0m5.499s
sys	0m0.179s

Storage logs:

swh-storage_1                        | [2021-08-19 13:56:01 +0000] [31] [DEBUG] POST /origin/add_multi
swh-storage_1                        | [2021-08-19 13:56:01 +0000] [31] [DEBUG] POST /origin/visit/add
swh-storage_1                        | [2021-08-19 13:56:01 +0000] [31] [DEBUG] POST /origin/visit/get_latest
swh-storage_1                        | [2021-08-19 13:56:01 +0000] [31] [DEBUG] POST /origin/visit_status/get_latest
swh-storage_1                        | [2021-08-19 13:56:01 +0000] [31] [DEBUG] POST /snapshot/get_branches
swh-storage_1                        | [2021-08-19 13:56:05 +0000] [32] [DEBUG] POST /content/missing
swh-storage_1                        | [2021-08-19 13:56:56 +0000] [32] [DEBUG] POST /content/add
swh-storage_1                        | [2021-08-19 13:57:40 +0000] [32] [DEBUG] POST /directory/missing
swh-storage_1                        | [2021-08-19 13:57:41 +0000] [32] [DEBUG] POST /directory/add
swh-storage_1                        | [2021-08-19 13:59:12 +0000] [32] [DEBUG] POST /revision/missing
swh-storage_1                        | ERROR:root:Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'consistency': 'LOCAL_ONE', 'required_responses': 1, 'received_responses': 0}
swh-storage_1                        | Traceback (most recent call last):
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request                                                                                                                       
swh-storage_1                        |     rv = self.dispatch_request()
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request                                                                                                                            
swh-storage_1                        |     return self.view_functions[rule.endpoint](**req.view_args)
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/negotiation.py", line 153, in newf                                                                                                                          
swh-storage_1                        |     return f.negotiator(*args, **kwargs)
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/negotiation.py", line 81, in __call__                                                                                                                       
swh-storage_1                        |     result = self.func(*args, **kwargs)
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 460, in _f                                                                                                                               
swh-storage_1                        |     return obj_meth(**kw)
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/storage.py", line 593, in revision_missing                                                                                                         
swh-storage_1                        |     return self._cql_runner.revision_missing(revisions)
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 145, in newf                                                                                                                         
swh-storage_1                        |     self, *args, **kwargs, statement=self._prepared_statements[f.__name__]
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 542, in revision_missing                                                                                                             
swh-storage_1                        |     return self._missing(statement, ids)
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 305, in _missing                                                                                                                     
swh-storage_1                        |     rows = self._execute_with_retries(statement, [ids])
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 324, in wrapped_f                                                                                                                            
swh-storage_1                        |     return self(f, *args, **kw)
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 404, in __call__                                                                                                                             
swh-storage_1                        |     do = self.iter(retry_state=retry_state)
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 349, in iter                                                                                                                                 
swh-storage_1                        |     return fut.result()
swh-storage_1                        |   File "/usr/local/lib/python3.7/concurrent/futures/_base.py", line 428, in result
swh-storage_1                        |     return self.__get_result()
swh-storage_1                        |   File "/usr/local/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
swh-storage_1                        |     raise self._exception
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 407, in __call__                                                                                                                             
swh-storage_1                        |     result = fn(*args, **kwargs)
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 272, in _execute_with_retries                                                                                                        
swh-storage_1                        |     return self._session.execute(statement, args, timeout=1000.0)
swh-storage_1                        |   File "cassandra/cluster.py", line 2618, in cassandra.cluster.Session.execute
swh-storage_1                        |   File "cassandra/cluster.py", line 4894, in cassandra.cluster.ResponseFuture.result
swh-storage_1                        | cassandra.ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'consistency': 'LOCAL_ONE', 'required_responses':
1, 'received_responses': 0}
swh-storage_1                        | [2021-08-19 13:59:17 +0000] [32] [DEBUG] POST /origin/visit_status/add
swh-storage_1                        | [2021-08-19 13:59:17 +0000] [32] [DEBUG] POST /content/missing
swh-storage_1                        | [2021-08-19 14:04:53 +0000] [32] [DEBUG] POST /content/add
swh-storage_1                        | [2021-08-19 14:04:53 +0000] [32] [DEBUG] POST /directory/missing
swh-storage_1                        | [2021-08-19 14:04:53 +0000] [32] [DEBUG] POST /directory/add
swh-storage_1                        | [2021-08-19 14:04:53 +0000] [32] [DEBUG] POST /revision/missing
swh-storage_1                        | ERROR:root:Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'consistency': 'LOCAL_ONE', 'required_responses': 1, 'received_responses': 0}
swh-storage_1                        | Traceback (most recent call last):
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
swh-storage_1                        |     rv = self.dispatch_request()
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
swh-storage_1                        |     return self.view_functions[rule.endpoint](**req.view_args)
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/negotiation.py", line 153, in newf
swh-storage_1                        |     return f.negotiator(*args, **kwargs)
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/negotiation.py", line 81, in __call__
swh-storage_1                        |     result = self.func(*args, **kwargs)
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 460, in _f
swh-storage_1                        |     return obj_meth(**kw)
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/storage.py", line 593, in revision_missing
swh-storage_1                        |     return self._cql_runner.revision_missing(revisions)
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 145, in newf
swh-storage_1                        |     self, *args, **kwargs, statement=self._prepared_statements[f.__name__]
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 542, in revision_missing
swh-storage_1                        |     return self._missing(statement, ids)
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 305, in _missing
swh-storage_1                        |     rows = self._execute_with_retries(statement, [ids])
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 324, in wrapped_f
swh-storage_1                        |     return self(f, *args, **kw)
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 404, in __call__
swh-storage_1                        |     do = self.iter(retry_state=retry_state)
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 349, in iter
swh-storage_1                        |     return fut.result()
swh-storage_1                        |   File "/usr/local/lib/python3.7/concurrent/futures/_base.py", line 428, in result
swh-storage_1                        |     return self.__get_result()
swh-storage_1                        |   File "/usr/local/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
swh-storage_1                        |     raise self._exception
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 407, in __call__
swh-storage_1                        |     result = fn(*args, **kwargs)
swh-storage_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 272, in _execute_with_retries
swh-storage_1                        |     return self._session.execute(statement, args, timeout=1000.0)
swh-storage_1                        |   File "cassandra/cluster.py", line 2618, in cassandra.cluster.Session.execute
swh-storage_1                        |   File "cassandra/cluster.py", line 4894, in cassandra.cluster.ResponseFuture.result
swh-storage_1                        | cassandra.ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'consistency': 'LOCAL_ONE', 'required_responses': 1, 'received_responses': 0}
swh-storage_1                        | [2021-08-19 14:04:58 +0000] [32] [DEBUG] Closing connection.

Some queries are logged in the cassandra logs:

INFO  [ScheduledTasks:1] 2021-08-19 16:00:16,534 NoSpamLogger.java:92 - Some operations were slow, details available at debug level (debug.log)
DEBUG [ScheduledTasks:1] 2021-08-19 16:00:16,535 MonitoringTask.java:175 - 1 operations were slow in the last 4999 msecs:
<SELECT ctime, length, status FROM swh.content WHERE token(sha1, sha1_git, sha256, blake2s256) >= 1987356847901115199 AND token(sha1, sha1_git, sha256, blake2s256) <= 1987356847901115199 LIMIT 5000>, time 1067 msec - slow timeout 1000 msec/cross-node
DEBUG [ScheduledTasks:1] 2021-08-19 16:02:11,537 MonitoringTask.java:175 - 1 operations were slow in the last 5000 msecs:
<SELECT ctime, length, status FROM swh.content WHERE token(sha1, sha1_git, sha256, blake2s256) >= 8339318752667056020 AND token(sha1, sha1_git, sha256, blake2s256) <= 8339318752667056020 LIMIT 5000>, time 1114 msec - slow timeout 1000 msec
DEBUG [ScheduledTasks:1] 2021-08-19 16:02:14,058 SSLFactory.java:354 - Checking whether certificates have been updated []

Event Timeline

vsellier triaged this task as Normal priority.Aug 19 2021, 4:32 PM
vsellier created this task.

it seems some more precise information can be logged by activating the full query logs without a big performance impact: https://cassandra.apache.org/doc/latest/cassandra/new/fqllogging.html

vsellier renamed this task from Git loader performance are very bad to [cassandra] Git loader performance are very bad.Aug 24 2021, 12:07 PM

Some live data from a git loader with a batch size of 1000 for each object types (with D6118 applied):

"object type";"input count";"missing_id duration (s)";"_missing_id count","_add duration(s)"
content;1000;0.4928;999;35.3384
content;1000;0.4095;1000;34.1440
content;1000;0.4374;998;35.6249
content;492;0.2960;488;16.7028
directory;1000;0.3978;999;71.2518
directory;1000;0.4484;1000;39.6845
directory;1000;0.4356;1000;54.0077
directory;1000;0.3833;1000;36.1437
directory;1000;0.4319;1000;30.5690
directory;402;0.1718;402;19.2335
revision;1000;0.8671;1000;10.3417
revision;575;0.4639;575;4.0819

and the associated flame graph:

It was the first time the origin (https://gitlab.com/obob_websites/samba2020) is loaded

D6139 should address the bottleneck in the flame graph

vsellier moved this task from Backlog to done on the System administration board.

changing the status to resolved as the main issues are solved.
Other tests with more parallel workers will be launched, if other problems will be detected, they will be tracked in new dedicated tickets.