[cassandra] Profile the replayer cpu consumption
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vsellier
	Sep 8 2022, 10:38 AM

Description

On a suggestion of @vlorentz, we should profile the cpu consumption of the replayer to check if some kind of optimization can be done to improve the replaying speed which it's limited by our available cpu resources

10:31 <vlorentz> est-ce que tu pourrais lancer un replayer avec profiling (python3 -m cProfile -o /tmp/blablabla.pyprof $(which swh) ...) pendant environ un quart d'heure, le killer (avec sigterm, pas sigkill), et m'envoyer le .pyprof ?

Related Objects
Search...

		Status	Assigned	Task
		Migrated	gitlab-migration	T4373 [cassandra] Test the new hardware
		Migrated	gitlab-migration	T4510 [cassandra] Profile the replayer cpu consumption

Event Timeline

vsellier triaged this task as Normal priority.Sep 8 2022, 10:38 AM

vsellier created this task.

vsellier changed the task status from Open to Work in Progress.Sep 8 2022, 6:29 PM

vsellier moved this task from Backlog to in-progress on the System administration board.

here some profiling of a couple of replayers:

directory

swh@storage-replayer-directory-798dbd5b84-s648s:~$ time python -m cProfile -o /tmp/directory.pyprof /opt/swh/.local/bin/swh storage replay --stop-after-objects 1000
WARNING:cassandra.cluster:Downgrading core protocol version from 66 to 65 for 192.168.100.185:9042. To avoid this, it is best practice to explicitly set Cluster(protocol_version) to the version supported by your cluster. http://datastax.github.io/python-driver/api/cassandra/cluster.html#cassandra.cluster.Cluster.protocol_version
WARNING:cassandra.cluster:Downgrading core protocol version from 65 to 5 for 192.168.100.185:9042. To avoid this, it is best practice to explicitly set Cluster(protocol_version) to the version supported by your cluster. http://datastax.github.io/python-driver/api/cassandra/cluster.html#cassandra.cluster.Cluster.protocol_version
INFO:cassandra.policies:Using datacenter 'sesi_rocquencourt' for DCAwareRoundRobinPolicy (via host '192.168.100.185:9042'); if incorrect, please specify a local_dc to the constructor, or limit contact points to local cluster nodes
Done.

real	13m36.035s
user	2m0.203s
sys	0m19.395s

directory.pyprof.gz239 KBDownload

origin-visit

swh@storage-replayer-origin-visit-76f6bf9d75-znqfs:~$ time python -m cProfile -o /tmp/origin-visit.pyprof /opt/swh/.local/bin/swh storage replay --stop-after-objects 10000
WARNING:cassandra.cluster:Downgrading core protocol version from 66 to 65 for 192.168.100.181:9042. To avoid this, it is best practice to explicitly set Cluster(protocol_version) to the version supported by your cluster. http://datastax.github.io/python-driver/api/cassandra/cluster.html#cassandra.cluster.Cluster.protocol_version
WARNING:cassandra.cluster:Downgrading core protocol version from 65 to 5 for 192.168.100.181:9042. To avoid this, it is best practice to explicitly set Cluster(protocol_version) to the version supported by your cluster. http://datastax.github.io/python-driver/api/cassandra/cluster.html#cassandra.cluster.Cluster.protocol_version
INFO:cassandra.policies:Using datacenter 'sesi_rocquencourt' for DCAwareRoundRobinPolicy (via host '192.168.100.181:9042'); if incorrect, please specify a local_dc to the constructor, or limit contact points to local cluster nodes
Done.

real	7m43.700s
user	2m42.825s
sys	0m27.594s

origin-visit.pyprof.gz240 KBDownload

revision

swh@storage-replayer-revision-d7f4c666-prwd5:~$ time python -m cProfile -o /tmp/revision.pyprof /opt/swh/.local/bin/swh storage replay --stop-after-objects 20000

... A lot of logs like the following one...
ERROR:swh.storage.replay:Object has id a1e746dc5db73c6f2a6665367d3a563181a9691e, but it should be 2af7da7563c6d41ad0d2a35c6e1aa9e01b8aee6f: Revision(message=b'Update versions in documentation\n', author=Person(fullname=b'Mark <REDACTED> <REDACTED@REDACTED.com>', name=b'Mark REDACTED email=b'REDACTED@REDACTED.com'), ..., date=TimestampWithTimezone(timestamp=Timestamp(seconds=1376758767, microseconds=0), offset_bytes=b'+0000'), committer_date=TimestampWithTimezone(timestamp=Timestamp(seconds=1376758767, microseconds=0), offset_bytes=b'+0000'), type=RevisionType.GIT, directory=hash_to_bytes('f24d2178f07949b36876e7749f5b392610fdb31e'), synthetic=False, metadata=None, parents=(b'\xff\x8f\xe7q!\x02\x9e\x00@/\x9fr\xf9\xaa\xf1O@\xde\x07X',), id=hash_to_bytes('a1e746dc5db73c6f2a6665367d3a563181a9691e'), extra_headers=(), raw_manifest=None)
...
real	8m8.459s
user	4m13.189s
sys	0m29.120s

revision.pyprof.gz243 KBDownload

FTR here a dot export of the directory profiling file:

These are the results of the different algorithms tests for the directory_add (with 20 directory replayers)

one-by-one

concurrent

batch

the batch algorithm is clearly the most effective

I close this issue because after the @vlorentz 's analysis it seems there isn't a lot of things to improve

This task has been migrated to GitLab.

[cassandra] Profile the replayer cpu consumptionClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

[cassandra] Profile the replayer cpu consumption
Closed, MigratedEdits Locked
Actions

Related Objects
Search...