Page MenuHomeSoftware Heritage

cassandra: Make content_missing run in linear time instead of quadratic
ClosedPublic

Authored by vlorentz on Jan 7 2022, 1:04 PM.

Details

Summary

Assuming all contents passed to content_missing() have (at least) a missing algo,
the function used to iterate over the size of the arg squared
in the worst case (when all contents are found).

With this commit, it starts with bucketing them by hash, so it does not
need to iterate over *all* found contents for each content passed as arg.

Depends on D6888.

Diff Detail

Event Timeline

Build is green

Patch application report for D6889 (id=24992)

Could not rebase; Attempt merge onto 4a24505049...

Updating 4a245050..9ee7b92d
Fast-forward
 swh/storage/cassandra/cql.py     |  4 ++--
 swh/storage/cassandra/storage.py | 50 ++++++++++++++++++++++++++++++++++------
 swh/storage/in_memory.py         |  2 +-
 3 files changed, 46 insertions(+), 10 deletions(-)
Changes applied before test
commit 9ee7b92de56a85b7acf8f775573907a1a1f949e2
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jan 7 13:04:25 2022 +0100

    cassandra: Make content_missing run in linear time instead of quadratic
    
    Assuming all contents passed to content_missing() have (at least) a missing algo,
    the function used to iterate over the size of the arg squared
    in the worst case (when all contents are found).
    
    With this commit, it starts with bucketing them by hash, so it does not
    need to iterate over *all* found contents for each content passed as arg.

commit 55141ff2d57ca147efc2235eba2b006814c03817
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Oct 18 13:25:20 2021 +0200

    cassandra: Rewrite content_missing to run queries concurrently.
    
    This is twice as fast, according to
    https://forge.softwareheritage.org/T3577#72791

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1520/ for more details.

This revision is now accepted and ready to land.Jan 12 2022, 11:32 AM
anlambert added a subscriber: anlambert.

It should be faster indeed, looks good to me.

Build is green

Patch application report for D6889 (id=25081)

Could not rebase; Attempt merge onto 4a24505049...

Updating 4a245050..40a57d43
Fast-forward
 swh/storage/cassandra/cql.py     |  4 +--
 swh/storage/cassandra/storage.py | 58 +++++++++++++++++++++++++++++++++++-----
 swh/storage/in_memory.py         |  2 +-
 3 files changed, 54 insertions(+), 10 deletions(-)
Changes applied before test
commit 40a57d43515528134cc76e53fd73198aa4bd0152
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Jan 7 13:04:25 2022 +0100

    cassandra: Make content_missing run in linear time instead of quadratic
    
    Assuming all contents passed to content_missing() have (at least) a missing algo,
    the function used to iterate over the size of the arg squared
    in the worst case (when all contents are found).
    
    With this commit, it starts with bucketing them by hash, so it does not
    need to iterate over *all* found contents for each content passed as arg.

commit d5f1f0ec055477461a000f7eeaece974fa1265b1
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Oct 18 13:25:20 2021 +0200

    cassandra: Rewrite content_missing to run queries concurrently.
    
    This is twice as fast, according to
    https://forge.softwareheritage.org/T3577#72791

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1523/ for more details.