Page MenuHomeSoftware Heritage

Optimize SVN loader performance and memory consumption on large repositories
Open, NormalPublic

Description

During the week end, I have executed the loading of a large subversion repository (http://svn.osdn.net/svnroot/tortoisesvn/ containing
29334 commits and quite a lot of files and directories) in the docker environment to check the subversion loader behavior on such cases.

I noticed the overall loader performance was really poor but also that the memory consumption was way too high (the python process
executing the loading task consumed more than 16G of RAM before it got killed).

Consequently, after 23 hours of execution, 26654 commits were processed but the celery worker got killed afterwards, see stack trace below:

swh-loader_1                        | [2022-01-09 23:15:31,640: DEBUG/ForkPoolWorker-1] rev: 26654, swhrev: 38d2bb6e11f2e96b6b015ca68b8cea0d46b13430, dir: f83163c3505d990dc78c6d20a5135013b918976d
swh-loader_1                        | [2022-01-09 23:15:31,795: DEBUG/ForkPoolWorker-1] Flushing 43696 objects of type content (2157677387 bytes)
swh-loader_1                        | [2022-01-09 23:15:33,198: DEBUG/ForkPoolWorker-1] Flushing 1 objects of type revision (1 parents, 143 estimated bytes)
swh-loader_1                        | [2022-01-09 23:15:33,382: DEBUG/ForkPoolWorker-1] Flushing 7398 objects of type directory (154536 entries)
swh-loader_1                        | [2022-01-09 23:15:49,207: DEBUG/MainProcess] heartbeat_tick : for connection 2f504e27a35f4fbeb34bcb851fbcab82
swh-loader_1                        | [2022-01-09 23:15:49,207: DEBUG/MainProcess] heartbeat_tick : Prev sent/recv: 1275/829966, now - 1275/830128, monotonic - 465794.696619455, last_heartbeat_sent - 465754.692346602, heartbeat int. - 60 for connection 2f504e27a35f4fbeb34bcb851fbcab82
swh-loader_1                        | [2022-01-09 23:15:53,303: INFO/MainProcess] missed heartbeat from loader-opam@0c1a14b916dc
swh-loader_1                        | [2022-01-09 23:16:08,365: INFO/MainProcess] missed heartbeat from indexer@9c0259909b65
swh-loader_1                        | [2022-01-09 23:16:08,426: INFO/MainProcess] missed heartbeat from loader-deposit@37d0762c65e1
swh-loader_1                        | [2022-01-09 23:16:08,426: INFO/MainProcess] missed heartbeat from listers@4021bd5fb6ac
swh-loader_1                        | [2022-01-09 23:16:08,426: INFO/MainProcess] missed heartbeat from vault@7fd3ad6d389b
swh-loader_1                        | [2022-01-09 23:16:08,426: INFO/MainProcess] missed heartbeat from loader-opam@0c1a14b916dc
swh-loader_1                        | [2022-01-09 23:16:09,633: DEBUG/MainProcess] heartbeat_tick : for connection 2f504e27a35f4fbeb34bcb851fbcab82
swh-loader_1                        | [2022-01-09 23:16:09,634: DEBUG/MainProcess] heartbeat_tick : Prev sent/recv: 1275/830128, now - 1275/830248, monotonic - 465815.123781095, last_heartbeat_sent - 465754.692346602, heartbeat int. - 60 for connection 2f504e27a35f4fbeb34bcb851fbcab82
swh-loader_1                        | [2022-01-09 23:16:09,634: DEBUG/MainProcess] heartbeat_tick: sending heartbeat for connection 2f504e27a35f4fbeb34bcb851fbcab82
swh-loader_1                        | [2022-01-09 23:16:29,793: DEBUG/MainProcess] heartbeat_tick : for connection 2f504e27a35f4fbeb34bcb851fbcab82
swh-loader_1                        | [2022-01-09 23:16:30,040: DEBUG/MainProcess] heartbeat_tick : Prev sent/recv: 1275/830248, now - 1276/830248, monotonic - 465835.528932356, last_heartbeat_sent - 465835.528926193, heartbeat int. - 60 for connection 2f504e27a35f4fbeb34bcb851fbcab82
swh-loader_1                        | [2022-01-09 23:16:51,428: DEBUG/MainProcess] heartbeat_tick : for connection 2f504e27a35f4fbeb34bcb851fbcab82
swh-loader_1                        | [2022-01-09 23:16:53,279: DEBUG/MainProcess] heartbeat_tick : Prev sent/recv: 1276/830248, now - 1276/830248, monotonic - 465858.767095142, last_heartbeat_sent - 465835.528926193, heartbeat int. - 60 for connection 2f504e27a35f4fbeb34bcb851fbcab82
swh-loader_1                        | [2022-01-09 23:18:09,542: DEBUG/MainProcess] heartbeat_tick : for connection 2f504e27a35f4fbeb34bcb851fbcab82
swh-loader_1                        | [2022-01-09 23:18:17,446: DEBUG/MainProcess] heartbeat_tick : Prev sent/recv: 1276/830248, now - 1276/830248, monotonic - 465942.935785125, last_heartbeat_sent - 465835.528926193, heartbeat int. - 60 for connection 2f504e27a35f4fbeb34bcb851fbcab82
swh-loader_1                        | [2022-01-09 23:18:17,622: DEBUG/MainProcess] heartbeat_tick: sending heartbeat for connection 2f504e27a35f4fbeb34bcb851fbcab82
swh-loader_1                        | [2022-01-09 23:18:17,623: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
swh-loader_1                        | Traceback (most recent call last):
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 326, in start
swh-loader_1                        |     blueprint.start(self)
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/celery/bootsteps.py", line 116, in start
swh-loader_1                        |     step.start(parent)
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 618, in start
swh-loader_1                        |     c.loop(*c.loop_args())
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/celery/worker/loops.py", line 81, in asynloop
swh-loader_1                        |     next(loop)
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/kombu/asynchronous/hub.py", line 301, in create_loop
swh-loader_1                        |     poll_timeout = fire_timers(propagate=propagate) if scheduled else 1
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/kombu/asynchronous/hub.py", line 143, in fire_timers
swh-loader_1                        |     entry()
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/kombu/asynchronous/timer.py", line 64, in __call__
swh-loader_1                        |     return self.fun(*self.args, **self.kwargs)
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/kombu/asynchronous/timer.py", line 126, in _reschedules
swh-loader_1                        |     return fun(*args, **kwargs)
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/kombu/connection.py", line 306, in heartbeat_check
swh-loader_1                        |     return self.transport.heartbeat_check(self.connection, rate=rate)
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/kombu/transport/pyamqp.py", line 220, in heartbeat_check
swh-loader_1                        |     return connection.heartbeat_tick(rate=rate)
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/amqp/connection.py", line 767, in heartbeat_tick
swh-loader_1                        |     raise ConnectionForced('Too many heartbeats missed')
swh-loader_1                        | amqp.exceptions.ConnectionForced: Too many heartbeats missed
swh-loader_1                        | [2022-01-09 23:19:27,875: DEBUG/MainProcess] Closed channel #1
swh-loader_1                        | [2022-01-09 23:19:27,875: DEBUG/MainProcess] Closed channel #2
swh-loader_1                        | [2022-01-09 23:19:27,875: DEBUG/MainProcess] Closed channel #3
swh-loader_1                        | [2022-01-09 23:19:27,876: WARNING/MainProcess] /srv/softwareheritage/venv/lib/python3.7/site-packages/celery/worker/consumer/consumer.py:361: CPendingDeprecationWarning: 
swh-loader_1                        | In Celery 5.1 we introduced an optional breaking change which
swh-loader_1                        | on connection loss cancels all currently executed tasks with late acknowledgement enabled.
swh-loader_1                        | These tasks cannot be acknowledged as the connection is gone, and the tasks are automatically redelivered back to the queue.
swh-loader_1                        | You can enable this behavior using the worker_cancel_long_running_tasks_on_connection_loss setting.
swh-loader_1                        | In Celery 5.1 it is set to False by default. The setting will be set to True by default in Celery 6.0.
swh-loader_1                        | 
swh-loader_1                        |   warnings.warn(CANCEL_TASKS_BY_DEFAULT, CPendingDeprecationWarning)
swh-loader_1                        | 
swh-loader_1                        | [2022-01-09 23:19:27,938: DEBUG/MainProcess] Start from server, version: 0.9, properties: {'capabilities': {'publisher_confirms': True, 'exchange_exchange_bindings': True, 'basic.nack': True, 'consumer_cancel_notify': True, 'connection.blocked': True, 'consumer_priorities': True, 'authentication_failure_close': True, 'per_consumer_qos': True, 'direct_reply_to': True}, 'cluster_name': 'rabbit@300aa7e628ab', 'copyright': 'Copyright (C) 2007-2018 Pivotal Software, Inc.', 'information': 'Licensed under the MPL.  See http://www.rabbitmq.com/', 'platform': 'Erlang/OTP 19.2.1', 'product': 'RabbitMQ', 'version': '3.6.16'}, mechanisms: [b'AMQPLAIN', b'PLAIN'], locales: ['en_US']
swh-loader_1                        | [2022-01-09 23:19:27,942: INFO/MainProcess] Connected to amqp://guest:**@amqp:5672//
swh-loader_1                        | [2022-01-09 23:19:27,946: DEBUG/MainProcess] Closed channel #1
swh-loader_1                        | [2022-01-09 23:19:27,983: DEBUG/MainProcess] Start from server, version: 0.9, properties: {'capabilities': {'publisher_confirms': True, 'exchange_exchange_bindings': True, 'basic.nack': True, 'consumer_cancel_notify': True, 'connection.blocked': True, 'consumer_priorities': True, 'authentication_failure_close': True, 'per_consumer_qos': True, 'direct_reply_to': True}, 'cluster_name': 'rabbit@300aa7e628ab', 'copyright': 'Copyright (C) 2007-2018 Pivotal Software, Inc.', 'information': 'Licensed under the MPL.  See http://www.rabbitmq.com/', 'platform': 'Erlang/OTP 19.2.1', 'product': 'RabbitMQ', 'version': '3.6.16'}, mechanisms: [b'AMQPLAIN', b'PLAIN'], locales: ['en_US']
swh-loader_1                        | [2022-01-09 23:19:27,988: DEBUG/MainProcess] using channel_id: 1
swh-loader_1                        | [2022-01-09 23:19:28,001: DEBUG/MainProcess] Channel open
swh-loader_1                        | [2022-01-09 23:19:28,004: INFO/MainProcess] mingle: searching for neighbors
swh-loader_1                        | [2022-01-09 23:19:28,004: DEBUG/MainProcess] using channel_id: 1
swh-loader_1                        | [2022-01-09 23:19:28,006: DEBUG/MainProcess] Channel open
swh-loader_1                        | [2022-01-09 23:19:29,048: INFO/MainProcess] mingle: all alone
swh-loader_1                        | [2022-01-09 23:19:29,048: DEBUG/MainProcess] using channel_id: 2
swh-loader_1                        | [2022-01-09 23:19:29,049: DEBUG/MainProcess] Channel open
swh-loader_1                        | [2022-01-09 23:19:29,165: DEBUG/MainProcess] using channel_id: 3
swh-loader_1                        | [2022-01-09 23:19:29,172: DEBUG/MainProcess] Channel open
swh-loader_1                        | Process 'ForkPoolWorker-1' pid:128 exited with 'signal 9 (SIGKILL)'
swh-loader_1                        | [2022-01-09 23:19:29,294: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL) Job: 6.')
swh-loader_1                        | Traceback (most recent call last):
swh-loader_1                        |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/billiard/pool.py", line 1267, in mark_as_worker_lost
swh-loader_1                        |     human_status(exitcode), job._job),
swh-loader_1                        | billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL) Job: 6.
swh-loader_1                        | [2022-01-09 23:19:29,353: DEBUG/ForkPoolWorker-2] Closed channel #1

We need to address these issues as a lot of large subversion repositories will likely fail to load with the current loader implementation.

Regarding slow performance, the main bottleneck is when objects to archive are flushed to the storage.
Indeed, the subversion loader does not perform any filtering on the objects to send to the storage.
After each replayed SVN revision, the whole set of contents and directories of the reconstructed filesystem
are retrieved through the use of the swh.model.from_disk.iter_directory function (see loader code).
The objects to archive filtering is then delegated to the storage filtering proxy.
While performance are acceptable for small repositories, they are not for large ones as the same large sets of objects to archive are
filtered again and again after each revision replay.

Regarding high memory consumption, it might comes from the use of the swh.model.from_disk.iter_directory function
which loads content data in memory (see function code).

The first improvements that comes to my mind to improve performance and likely the high memory consumption
would be to send to the storage only the objects that have been added or modified in a given revision after its replay.
As we use the subversion replay API in the loader implementation, we should be able to maintain a list of such objects
as we know which contents and directories are added or modified in a given revision.