Page MenuHomeSoftware Heritage

[cassandra] Profile the replayer cpu consumption
Closed, MigratedEdits Locked

Description

On a suggestion of @vlorentz, we should profile the cpu consumption of the replayer to check if some kind of optimization can be done to improve the replaying speed which it's limited by our available cpu resources

10:31 <vlorentz> est-ce que tu pourrais lancer un replayer avec profiling (python3 -m cProfile -o /tmp/blablabla.pyprof $(which swh) ...) pendant environ un quart d'heure, le killer (avec sigterm, pas sigkill), et m'envoyer le .pyprof ?

Event Timeline

vsellier triaged this task as Normal priority.Sep 8 2022, 10:38 AM
vsellier created this task.
vsellier changed the task status from Open to Work in Progress.Sep 8 2022, 6:29 PM
vsellier moved this task from Backlog to in-progress on the System administration board.

here some profiling of a couple of replayers:

  • directory
swh@storage-replayer-directory-798dbd5b84-s648s:~$ time python -m cProfile -o /tmp/directory.pyprof /opt/swh/.local/bin/swh storage replay --stop-after-objects 1000
WARNING:cassandra.cluster:Downgrading core protocol version from 66 to 65 for 192.168.100.185:9042. To avoid this, it is best practice to explicitly set Cluster(protocol_version) to the version supported by your cluster. http://datastax.github.io/python-driver/api/cassandra/cluster.html#cassandra.cluster.Cluster.protocol_version
WARNING:cassandra.cluster:Downgrading core protocol version from 65 to 5 for 192.168.100.185:9042. To avoid this, it is best practice to explicitly set Cluster(protocol_version) to the version supported by your cluster. http://datastax.github.io/python-driver/api/cassandra/cluster.html#cassandra.cluster.Cluster.protocol_version
INFO:cassandra.policies:Using datacenter 'sesi_rocquencourt' for DCAwareRoundRobinPolicy (via host '192.168.100.185:9042'); if incorrect, please specify a local_dc to the constructor, or limit contact points to local cluster nodes
Done.

real	13m36.035s
user	2m0.203s
sys	0m19.395s

  • origin-visit
swh@storage-replayer-origin-visit-76f6bf9d75-znqfs:~$ time python -m cProfile -o /tmp/origin-visit.pyprof /opt/swh/.local/bin/swh storage replay --stop-after-objects 10000
WARNING:cassandra.cluster:Downgrading core protocol version from 66 to 65 for 192.168.100.181:9042. To avoid this, it is best practice to explicitly set Cluster(protocol_version) to the version supported by your cluster. http://datastax.github.io/python-driver/api/cassandra/cluster.html#cassandra.cluster.Cluster.protocol_version
WARNING:cassandra.cluster:Downgrading core protocol version from 65 to 5 for 192.168.100.181:9042. To avoid this, it is best practice to explicitly set Cluster(protocol_version) to the version supported by your cluster. http://datastax.github.io/python-driver/api/cassandra/cluster.html#cassandra.cluster.Cluster.protocol_version
INFO:cassandra.policies:Using datacenter 'sesi_rocquencourt' for DCAwareRoundRobinPolicy (via host '192.168.100.181:9042'); if incorrect, please specify a local_dc to the constructor, or limit contact points to local cluster nodes
Done.

real	7m43.700s
user	2m42.825s
sys	0m27.594s

  • revision
swh@storage-replayer-revision-d7f4c666-prwd5:~$ time python -m cProfile -o /tmp/revision.pyprof /opt/swh/.local/bin/swh storage replay --stop-after-objects 20000

... A lot of logs like the following one...
ERROR:swh.storage.replay:Object has id a1e746dc5db73c6f2a6665367d3a563181a9691e, but it should be 2af7da7563c6d41ad0d2a35c6e1aa9e01b8aee6f: Revision(message=b'Update versions in documentation\n', author=Person(fullname=b'Mark <REDACTED> <REDACTED@REDACTED.com>', name=b'Mark REDACTED email=b'REDACTED@REDACTED.com'), ..., date=TimestampWithTimezone(timestamp=Timestamp(seconds=1376758767, microseconds=0), offset_bytes=b'+0000'), committer_date=TimestampWithTimezone(timestamp=Timestamp(seconds=1376758767, microseconds=0), offset_bytes=b'+0000'), type=RevisionType.GIT, directory=hash_to_bytes('f24d2178f07949b36876e7749f5b392610fdb31e'), synthetic=False, metadata=None, parents=(b'\xff\x8f\xe7q!\x02\x9e\x00@/\x9fr\xf9\xaa\xf1O@\xde\x07X',), id=hash_to_bytes('a1e746dc5db73c6f2a6665367d3a563181a9691e'), extra_headers=(), raw_manifest=None)
...
real	8m8.459s
user	4m13.189s
sys	0m29.120s

FTR here a dot export of the directory profiling file:

These are the results of the different algorithms tests for the directory_add (with 20 directory replayers)

  • one-by-one

  • concurrent

  • batch

the batch algorithm is clearly the most effective

vsellier moved this task from in-progress to done on the System administration board.

I close this issue because after the @vlorentz 's analysis it seems there isn't a lot of things to improve