Page MenuHomeSoftware Heritage

Parallel loaders performances
Closed, ResolvedPublic

Description

In order to simulate the cluster behavior with an usual swh load, a test should be prepare to have a ingestion infrastructure with a lot of loader running in parallel

Some adaptions need to be done on the bench lab to have a central swh-storage responding to the workers.
A fake objstorage needs to be added to handle the worker adds
and of course all the ingestion services, a lister / scheduler / rabbitmq ...

The docker infrastucture will probably be used (or perhaps kubernetes as grid5000 provides a k8s deployment environment)

Event Timeline

vsellier changed the task status from Open to Work in Progress.Sep 15 2021, 5:49 PM
vsellier triaged this task as Normal priority.
vsellier created this task.
vsellier updated the task description. (Show Details)

intermediary status:

  • the bench lab is easily deployable on g5k on several workers to distribute the load [1]
  • it's working well when the load is not so high. When the number of worker is increased, it seems the workers have some issues to talk with rabbitmq:
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-p9ds5                                                                                                                                                     
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-n6pvm                    
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-mrcjj                                                                                                                                                     
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-7bn4s                                                                                       
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-lg2bd

and also an unexplained time drift:

[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,447: WARNING/MainProcess] Substantial drift from celery@loaders-77cdd444df-lxjpl may mean clocks are out of sync.  Current drift is 
[loaders-77cdd444df-flcv9 loaders] 356 seconds.  [orig: 2021-09-30 23:46:55.447181 recv: 2021-09-30 23:40:59.633444]                                                                                                                                                                     
[loaders-77cdd444df-flcv9 loaders]                                                                                                                                                                            
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,447: WARNING/MainProcess] Substantial drift from celery@loaders-77cdd444df-jd6v9 may mean clocks are out of sync.  Current drift is                                                                                              
[loaders-77cdd444df-flcv9 loaders] 355 seconds.  [orig: 2021-09-30 23:46:55.447552 recv: 2021-09-30 23:41:00.723983]                                  
[loaders-77cdd444df-flcv9 loaders]

and several other issues with interrupted connections.

there are also issues with storage and objstorage tunning that need to be resolved to have a stable test environment and to have relevant metrics.

[loaders-77cdd444df-mf85h loaders] [2021-09-30 23:47:00,865: WARNING/MainProcess] Retrying RPC call                                                                  
[loaders-77cdd444df-wzntq loaders] [2021-09-30 23:47:00,864: WARNING/MainProcess] Retrying RPC call                                                                                                                                                                                      
[loaders-77cdd444df-hskqx loaders] [2021-09-30 23:47:00,866: WARNING/MainProcess] Retrying RPC call                                                             
[loaders-77cdd444df-cfgj8 loaders] [2021-09-30 23:47:00,866: WARNING/MainProcess] Retrying RPC call

The stats are not impressive for the moment but everything collapse (on the swh, stack, cassandra is performing well):

It can do much better

[1] rDSNIP7cc495e333e2753f849aa8756e04d77713e5b64e

The loader were finally stabilized. It was due to a wrong celery configuration.
Changing the pool configuration from solo to prefork solved the problem even if the concurrency is kept to one.
Solo looked indicated in environment like the POC but for obvious reasons, it was not working as expected:

It was possible to increase the load with a stable behavior. Now, it seems cassadra is more in pain with 60 git loaders.
Several timeout are logged in the storage, all are on the revision_missing check:

ERROR:root:Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'consistency': 'LOCAL_QUORUM', 'required_responses': 2, 'received_responses': 0}
Traceback (most recent call last):
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/negotiation.py", line 153, in newf
    return f.negotiator(*args, **kwargs)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/negotiation.py", line 81, in __call__
    result = self.func(*args, **kwargs)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 460, in _f
    return obj_meth(**kw)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/metrics.py", line 24, in d
    return f(*a, **kw)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/storage.py", line 683, in revision_missing
    return self._cql_runner.revision_missing(revisions)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 162, in newf
    self, *args, **kwargs, statement=self._prepared_statements[f.__name__]
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 609, in revision_missing
    return self._missing(statement, ids)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 337, in _missing
    rows = self._execute_with_retries(statement, [list(id_group)])
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 324, in wrapped_f
    return self(f, *args, **kw)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 404, in __call__
    do = self.iter(retry_state=retry_state)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 349, in iter
    return fut.result()
  File "/usr/local/lib/python3.7/concurrent/futures/_base.py", line 428, in result
    return self.__get_result()
  File "/usr/local/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 407, in __call__
    result = fn(*args, **kwargs)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 293, in _execute_with_retries
    return self._session.execute(statement, args, timeout=1000.0)
  File "cassandra/cluster.py", line 2618, in cassandra.cluster.Session.execute
  File "cassandra/cluster.py", line 4894, in cassandra.cluster.ResponseFuture.result
cassandra.ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'consistency': 'LOCAL_QUORUM', 'required_responses': 2, 'received_responses': 0}

The objstorage also show some stability issues:

  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/negotiation.py", line 153, in newf
    return f.negotiator(*args, **kwargs)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/negotiation.py", line 81, in __call__
    result = self.func(*args, **kwargs)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 460, in _f
    return obj_meth(**kw)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/metrics.py", line 24, in d
    return f(*a, **kw)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/metrics.py", line 77, in d
    r = f(*a, **kw)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/storage.py", line 265, in content_add
    return self._content_add(list(contents), with_data=True)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/storage.py", line 197, in _content_add
    c for c in contents if c.status != "absent"
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/objstorage.py", line 62, in content_add
    summary = self.objstorage.add_batch({cont.sha1: cont.data for cont in contents})
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/objstorage/api/client.py", line 47, in add_batch
    {"contents": contents, "check_presence": check_presence,},
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 272, in post
    **opts,
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 256, in raw_verb
    raise self.api_exception(e)
swh.objstorage.exc.ObjStorageAPIError: An unexpected error occurred in the api backend: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

The I/O, read mostly, is quite high on cassandra servers:

Immediate actions for the next run:

  • as there is plenty of memory on the servers, create a memory volume on all the worker nodes and start several objstorage writing the content on them
  • check the revision_missing function for possible improvments

Some runs with the fix:
It globally improves the stability of the benchmark by reducing the timeouts.

The 6 first run was with different algorithms with 100 workers.

  1. group_pk_concurrent
  2. group_pk_concurrent
  3. concurrent
  4. concurrent
  5. grouped_naive
  6. group_pk_serial

The last one is with 150 workers and the grouped-pk-concurrent

There is not a big difference between the different runs in number of visits/s, the difference are most in number of client timeouts:

It's not clear for the moment where the bottleneck is located but according to the disk i/o profile, it looks it due to read, probably still on the _missing methods

The most important latencies are on range selects

This is the impact of configuration the add batch insert size. Tested values were 500, 1000 and 2000.
A full repair of a cassandra node was in progress due to a datalost (disk erased by someone else's experiment after the disk reservation expiration)

The configuration was:

  • 4 swh-storage running (128 workers / 1 thread)
  • 4 objstorage writing in a ramfs
  • 100 loaders
  • directory entry insert algo: batch
  • Patched with D6423 but with default values

The graphs are the request durations per decile. The lower the better

  • content
totalnormalized
missing duration
add_duration
  • directory:
totalnormalized
missing duration
add_duration
  • revision
totalnormalized
missing duration
add_duration

Apparently a smaller batch size give better performances for all the scenarios
The 1000 value seems to have a weird behavior for the directories, I launched the test several times for this size but seems to give always the same kind of results, I can't explain why

This is the impacts of the different algorithm usage for retrieving the missing objects.
The configuration is the same as previously, the storage is patched with D6423, D6494 and D6495.
The first without D6423 is a version patched witj D6494 and D6495 only

  • content:
totalnormalized
missing duration
add_duration
  • directory:
totalnormalized
missing duration
add_duration
  • revision:
totalnormalized
missing duration
add_duration

The results are quite different according to the type of object. It looks like the group pk concurrent implementations could give result to an overall performance improvements (revision, directory)
The counterpart is as the *_missing functions are faster, the add functions are more stressed.

Closing here the tests of the workers.

The max number of visit/s reached was around 20ops/s.
The limit is reached by the cassandra HDD limited by the read bandwith so the results are specifics to the current hardware.

It seems a potential configuration could be:

  • insert batch size reduced to 500 (hard coded to 1000 in the code currently)
  • storage:
consistency_level: LOCAL_QUORUM
select_missing_algo: grouped-pk-concurrent
directory_entries_insert_algo: batch

-loaders (same configuration copied from the replayers):

- cls: buffer
  min_batch_size:
    content: 500
    content_bytes: 104857600
    directory: 1000
    revision: 50

It will need to be adapted according to the new hardware.

A last run will be launched tonight with a bigger cache on cassandra. The hit ratio is quite low (less than ~25%)

Changing the cache size doesn't seem very effective for the particular workload of the loaders:
Exception some contextual differences due probably due to the different origin loaded, the performances are quite similar.
The hit ratios between the different configurations are also very close

  • content
totalnormalized
missing duration
add_duration
  • directory
totalnormalized
missing duration
add_duration
  • revision
totalnormalized
missing duration
add_duration

Cache Hit ratio:

100Mb (default):

512Mb:

1024 Mb:

vsellier moved this task from Backlog to done on the System administration board.

Stopping here the investigations as the limit is now the hardware running cassandra (HDD). Great improvements were made compared to the first tries