Parallel loaders performances
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vsellier
	Sep 15 2021, 5:49 PM

Description

In order to simulate the cluster behavior with an usual swh load, a test should be prepare to have a ingestion infrastructure with a lot of loader running in parallel

Some adaptions need to be done on the bench lab to have a central swh-storage responding to the workers.
A fake objstorage needs to be added to handle the worker adds
and of course all the ingestion services, a lister / scheduler / rabbitmq ...

The docker infrastucture will probably be used (or perhaps kubernetes as grid5000 provides a k8s deployment environment)

Revisions and Commits

rDSTO Storage manager
	Closed		D6885 cassandra: Use concurrent queries in *_missing() instead of naive grouping
	Abandoned		D6495 cassandra: Rewrite content_missing to run queries concurrently.
	Closed		D6494 cassandra: Fix incomplete check of content existence in object_find_by_sha1_git
	Abandoned		D6423 cassandra: Add alternative algorithms to list missing objects
		D6888	rDSTOd5f1f0ec0554 cassandra: Rewrite content_missing to run queries concurrently.

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T2213 Storage
Migrated	gitlab-migration	T2214 Scale-out graph and database storage in production
Migrated	gitlab-migration	T1892 Cassandra as a storage backend
Migrated	gitlab-migration	T3357 Perform some tests of the cassandra storage on Grid5000
Migrated	gitlab-migration	T3577 Parallel loaders performances

Event Timeline

vsellier changed the task status from Open to Work in Progress.Sep 15 2021, 5:49 PM

vsellier triaged this task as Normal priority.

vsellier created this task.

vsellier updated the task description. (Show Details)

vsellier mentioned this in rDSNIP118ba1a0f735: grid5000/cassandra: declare a server type with only a big zfs dataset.Sep 15 2021, 6:19 PM

vsellier mentioned this in rDSNIP7ec3d29c5c19: grid5000/cassandra: allow to test scheduler db persistance locally.Sep 17 2021, 1:58 PM

vsellier mentioned this in rDSNIP7cc495e333e2: grid5000/cassandra: kubernetes configuration for massive parallel loader test.Oct 1 2021, 4:37 PM

intermediary status:

the bench lab is easily deployable on g5k on several workers to distribute the load [1]
it's working well when the load is not so high. When the number of worker is increased, it seems the workers have some issues to talk with rabbitmq:

[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-p9ds5                                                                                                                                                     
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-n6pvm                    
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-mrcjj                                                                                                                                                     
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-7bn4s                                                                                       
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-lg2bd

and also an unexplained time drift:

[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,447: WARNING/MainProcess] Substantial drift from celery@loaders-77cdd444df-lxjpl may mean clocks are out of sync.  Current drift is 
[loaders-77cdd444df-flcv9 loaders] 356 seconds.  [orig: 2021-09-30 23:46:55.447181 recv: 2021-09-30 23:40:59.633444]                                                                                                                                                                     
[loaders-77cdd444df-flcv9 loaders]                                                                                                                                                                            
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,447: WARNING/MainProcess] Substantial drift from celery@loaders-77cdd444df-jd6v9 may mean clocks are out of sync.  Current drift is                                                                                              
[loaders-77cdd444df-flcv9 loaders] 355 seconds.  [orig: 2021-09-30 23:46:55.447552 recv: 2021-09-30 23:41:00.723983]                                  
[loaders-77cdd444df-flcv9 loaders]

and several other issues with interrupted connections.

there are also issues with storage and objstorage tunning that need to be resolved to have a stable test environment and to have relevant metrics.

[loaders-77cdd444df-mf85h loaders] [2021-09-30 23:47:00,865: WARNING/MainProcess] Retrying RPC call                                                                  
[loaders-77cdd444df-wzntq loaders] [2021-09-30 23:47:00,864: WARNING/MainProcess] Retrying RPC call                                                                                                                                                                                      
[loaders-77cdd444df-hskqx loaders] [2021-09-30 23:47:00,866: WARNING/MainProcess] Retrying RPC call                                                             
[loaders-77cdd444df-cfgj8 loaders] [2021-09-30 23:47:00,866: WARNING/MainProcess] Retrying RPC call

The stats are not impressive for the moment but everything collapse (on the swh, stack, cassandra is performing well):

It can do much better

[1] rDSNIP7cc495e333e2753f849aa8756e04d77713e5b64e

The loader were finally stabilized. It was due to a wrong celery configuration.
Changing the pool configuration from solo to prefork solved the problem even if the concurrency is kept to one.
Solo looked indicated in environment like the POC but for obvious reasons, it was not working as expected:

It was possible to increase the load with a stable behavior. Now, it seems cassadra is more in pain with 60 git loaders.
Several timeout are logged in the storage, all are on the revision_missing check:

ERROR:root:Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'consistency': 'LOCAL_QUORUM', 'required_responses': 2, 'received_responses': 0}
Traceback (most recent call last):
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/negotiation.py", line 153, in newf
    return f.negotiator(*args, **kwargs)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/negotiation.py", line 81, in __call__
    result = self.func(*args, **kwargs)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 460, in _f
    return obj_meth(**kw)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/metrics.py", line 24, in d
    return f(*a, **kw)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/storage.py", line 683, in revision_missing
    return self._cql_runner.revision_missing(revisions)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 162, in newf
    self, *args, **kwargs, statement=self._prepared_statements[f.__name__]
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 609, in revision_missing
    return self._missing(statement, ids)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 337, in _missing
    rows = self._execute_with_retries(statement, [list(id_group)])
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 324, in wrapped_f
    return self(f, *args, **kw)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 404, in __call__
    do = self.iter(retry_state=retry_state)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 349, in iter
    return fut.result()
  File "/usr/local/lib/python3.7/concurrent/futures/_base.py", line 428, in result
    return self.__get_result()
  File "/usr/local/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 407, in __call__
    result = fn(*args, **kwargs)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 293, in _execute_with_retries
    return self._session.execute(statement, args, timeout=1000.0)
  File "cassandra/cluster.py", line 2618, in cassandra.cluster.Session.execute
  File "cassandra/cluster.py", line 4894, in cassandra.cluster.ResponseFuture.result
cassandra.ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'consistency': 'LOCAL_QUORUM', 'required_responses': 2, 'received_responses': 0}

The objstorage also show some stability issues:

  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/negotiation.py", line 153, in newf
    return f.negotiator(*args, **kwargs)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/negotiation.py", line 81, in __call__
    result = self.func(*args, **kwargs)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 460, in _f
    return obj_meth(**kw)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/metrics.py", line 24, in d
    return f(*a, **kw)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/metrics.py", line 77, in d
    r = f(*a, **kw)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/storage.py", line 265, in content_add
    return self._content_add(list(contents), with_data=True)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/storage.py", line 197, in _content_add
    c for c in contents if c.status != "absent"
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/objstorage.py", line 62, in content_add
    summary = self.objstorage.add_batch({cont.sha1: cont.data for cont in contents})
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/objstorage/api/client.py", line 47, in add_batch
    {"contents": contents, "check_presence": check_presence,},
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 272, in post
    **opts,
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 256, in raw_verb
    raise self.api_exception(e)
swh.objstorage.exc.ObjStorageAPIError: An unexpected error occurred in the api backend: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

The I/O, read mostly, is quite high on cassandra servers:

Immediate actions for the next run:

as there is plenty of memory on the servers, create a memory volume on all the worker nodes and start several objstorage writing the content on them
check the revision_missing function for possible improvments

vsellier mentioned this in rDSNIP2a7a2efac087: grid5000/cassandra: improve git loader benchmark stability.Oct 6 2021, 2:21 AM

vlorentz mentioned this in D6423: cassandra: Add alternative algorithms to list missing objects.Oct 6 2021, 3:35 PM

Some runs with the fix:
It globally improves the stability of the benchmark by reducing the timeouts.

The 6 first run was with different algorithms with 100 workers.

group_pk_concurrent
group_pk_concurrent
concurrent
concurrent
grouped_naive
group_pk_serial

The last one is with 150 workers and the grouped-pk-concurrent

There is not a big difference between the different runs in number of visits/s, the difference are most in number of client timeouts:

It's not clear for the moment where the bottleneck is located but according to the disk i/o profile, it looks it due to read, probably still on the _missing methods

The most important latencies are on range selects

vsellier mentioned this in rDSNIP0d2c1eb857f2: grid5000/cassandra: generate statistics and graphics.Oct 20 2021, 9:04 AM

This is the impact of configuration the add batch insert size. Tested values were 500, 1000 and 2000.
A full repair of a cassandra node was in progress due to a datalost (disk erased by someone else's experiment after the disk reservation expiration)

The configuration was:

4 swh-storage running (128 workers / 1 thread)
4 objstorage writing in a ramfs
100 loaders
directory entry insert algo: batch
Patched with D6423 but with default values

The graphs are the request durations per decile. The lower the better

content

	total	normalized
missing duration
add_duration

directory:

	total	normalized
missing duration
add_duration

revision

	total	normalized
missing duration
add_duration

Apparently a smaller batch size give better performances for all the scenarios
The 1000 value seems to have a weird behavior for the directories, I launched the test several times for this size but seems to give always the same kind of results, I can't explain why

This is the impacts of the different algorithm usage for retrieving the missing objects.
The configuration is the same as previously, the storage is patched with D6423, D6494 and D6495.
The first without D6423 is a version patched witj D6494 and D6495 only

content:

	total	normalized
missing duration
add_duration

directory:

	total	normalized
missing duration
add_duration

revision:

	total	normalized
missing duration
add_duration

The results are quite different according to the type of object. It looks like the group pk concurrent implementations could give result to an overall performance improvements (revision, directory)
The counterpart is as the *_missing functions are faster, the add functions are more stressed.

Closing here the tests of the workers.

The max number of visit/s reached was around 20ops/s.
The limit is reached by the cassandra HDD limited by the read bandwith so the results are specifics to the current hardware.

It seems a potential configuration could be:

insert batch size reduced to 500 (hard coded to 1000 in the code currently)
storage:

consistency_level: LOCAL_QUORUM
select_missing_algo: grouped-pk-concurrent
directory_entries_insert_algo: batch

-loaders (same configuration copied from the replayers):

- cls: buffer
  min_batch_size:
    content: 500
    content_bytes: 104857600
    directory: 1000
    revision: 50

It will need to be adapted according to the new hardware.

A last run will be launched tonight with a bigger cache on cassandra. The hit ratio is quite low (less than ~25%)

Changing the cache size doesn't seem very effective for the particular workload of the loaders:
Exception some contextual differences due probably due to the different origin loaded, the performances are quite similar.
The hit ratios between the different configurations are also very close