Page MenuHomeSoftware Heritage

[cassandra] directory and content read benchmarks
Closed, MigratedEdits Locked

Description

Top issue around directory_ls and content retrieval performance issue

Will be used to track the tests of D6228, D6229 and potential others

Event Timeline

vsellier changed the task status from Open to Work in Progress.Sep 15 2021, 11:11 AM
vsellier triaged this task as Normal priority.
vsellier created this task.
vsellier renamed this task from [cassandra] directory and content read benchmarkss to [cassandra] directory and content read benchmarks.EditedSep 15 2021, 11:19 AM
vsellier updated the task description. (Show Details)

The directory_ls and indirectly the get_content performace was tested with this small script: P1163
A cold restart (all buffer cleared, cassandra restarted) is done between each tests (P1164)

This is the results of the different runs:

  • run a directory_ls on a big directory
  • postgresql (saam)
vsellier@saam ~ % python3 directory_ls.py c864e846cb339a94da9fd91ae12cabcf083a8685 
c864e846cb339a94da9fd91ae12cabcf083a8685: 8.9192s
  • cassandra

10 directory_ls was launch to test the impacts of having the data in cache:

run idone-by-one[1]concurrent [2]
1233.6952s184.1209s
2102.1767s89.1716s
341.1162s33.0051s
425.4258s21.7695s
518.2770s19.0004s
616.8347s14.6394s
715.4968s13.4750s
814.1320s12.1201s
913.3825s10.6304s
1013.1336s11.1261s

[1] D6228 and D6229 reverted
[2] master of swh-storage https://archive.softwareheritage.org/swh:1:snp:dd585ae36b25c37fc9a4b5ab16fb4d0482a075a7;origin=https://forge.softwareheritage.org/source/swh-storage.git

The concurrent version is globally faster but the performances are not crazy compared to the postgresql version

  • run 2000 directory_ls with 50 threads in parallel

IN PROGRESS

time cat directory_2000.lst | parallel --bar -j50 python3 directory_ls.py {} | tee directory_ls_[concurrent|one-by-one]-2000.output
  • one-by-one:
real    11m34.692s
user    19m6.793s
sys     2m12.372s

The time is mostly the loading time of the longest repository:

bc0a1450e393f47cb34a6f1f1a6ee9206c739579: 670.7532s
  • concurrent :
real    9m4.044s
user    20m12.418s
sys     2m13.689s
bc0a1450e393f47cb34a6f1f1a6ee9206c739579: 520.5311s

2 flame graphs of the previous directory_ls:

  • one-by-one

first run (cache cold):

last run (cache hot):

  • concurrent:

first run (cache cold):

last run (cache hot):

Test of a the new D6269 patch:

  • directory_ls on c864e846cb339a94da9fd91ae12cabcf083a8685
runduration
123.0080s
210.0942s
311.0577s
47.4534s
57.1187s
67.2937s
76.4423s
86.3384s
96.2031s
106.0875s
  • directory_ls on bc0a1450e393f47cb34a6f1f1a6ee9206c739579 (the biggest directory in the 2000 ids)
runduration
167.8805s
238.4308s
333.2637s
430.7730s
530.1840s
627.8511s
728.2692s
827.8125s
927.4106s
1027.6957s

and the associated flamegraphes:

first run:

last run:

  • 2000 directory_ls with 50 thread in parallel:
real    1m40.011s
user    22m5.676s
sys     2m22.493s
bc0a1450e393f47cb34a6f1f1a6ee9206c739579: 70.3515s

much better than previously

Some flame graphs of storage was performed during the ingestion with 50 workers in //

It seems most of them are doing a lot of directory_add

svgpp 1-2021-10-14T06:12:58Z.svg | grep storage.py                                                                                                                                   10:01:28
                directory_add (swh/storage/cassandra/storage.py:517) (112
                directory_add (swh/storage/cassandra/storage.py:517) (77
                directory_add (swh/storage/cassandra/storage.py:517) (88
                directory_add (swh/storage/cassandra/storage.py:517) (123
                content_add (swh/storage/cassandra/storage.py:285) (43 samples,
                directory_add (swh/storage/cassandra/storage.py:517) (78
                directory_add (swh/storage/cassandra/storage.py:517) (104
                directory_add (swh/storage/cassandra/storage.py:517) (49
                directory_add (swh/storage/cassandra/storage.py:517) (73
                directory_add (swh/storage/cassandra/storage.py:517) (76
                directory_add (swh/storage/cassandra/storage.py:517) (45
                directory_add (swh/storage/cassandra/storage.py:517) (69
                directory_add (swh/storage/cassandra/storage.py:517) (62
                directory_add (swh/storage/cassandra/storage.py:517) (93
                directory_add (swh/storage/cassandra/storage.py:517) (42
                directory_add (swh/storage/cassandra/storage.py:517) (104
                directory_add (swh/storage/cassandra/storage.py:517) (40
                directory_add (swh/storage/cassandra/storage.py:517) (42
                directory_add (swh/storage/cassandra/storage.py:517) (54
                directory_add (swh/storage/cassandra/storage.py:517) (73
                directory_add (swh/storage/cassandra/storage.py:517) (145
                directory_add (swh/storage/cassandra/storage.py:517) (68
                directory_add (swh/storage/cassandra/storage.py:517) (51
                directory_add (swh/storage/cassandra/storage.py:517) (52
                directory_add (swh/storage/cassandra/storage.py:517) (45
                directory_add (swh/storage/cassandra/storage.py:517) (67
                content_add (swh/storage/cassandra/storage.py:285) (48 samples,
                directory_add (swh/storage/cassandra/storage.py:517) (235
                directory_add (swh/storage/cassandra/storage.py:517) (67
                directory_add (swh/storage/cassandra/storage.py:517) (53
                directory_add (swh/storage/cassandra/storage.py:517) (210
                directory_add (swh/storage/cassandra/storage.py:517) (45
                directory_add (swh/storage/cassandra/storage.py:517) (129
                directory_add (swh/storage/cassandra/storage.py:517) (46
                directory_add (swh/storage/cassandra/storage.py:517) (41
                directory_add (swh/storage/cassandra/storage.py:517) (69
                directory_add (swh/storage/cassandra/storage.py:517) (70

It sounds it's the longest operation, but it looks strange there are no occurrences of the functions called by the loader filters. I don't know if it's due to the frequency of the py-spy captures (reduced due to the number of threads)

I will try another approach by logging the method call durations on the loader side (need to rebase and adapt P1155)

What "directory_entries_insert_algo" did you use for this?

It seems most of them are doing a lot of directory_add

Do you mean they spend a lot of time in directory_add, or that they call it a lot?

It's not surprising this endpoint is the longest to run, it has a lot of data to insert.

vsellier moved this task from Backlog to done on the System administration board.

It was not easy to know if it's a lot of call or long running calls because it's regular sample and we don't have this granularity.

There is no more work schedule on this task so I change it's status to resolved.