Page MenuHomeSoftware Heritage

cassandra: Use concurrent queries in *_missing() instead of naive grouping
ClosedPublic

Authored by vlorentz on Jan 6 2022, 12:44 PM.

Details

Summary

Instead of grouping ids in queries in arbitrary batches (which forces
the server node to coordinate with other nodes to complete the query),
this sends queries with one id each, directly to the right node.

This is the 'concurrent' algorithm from https://forge.softwareheritage.org/T3577#72791
which gives a >=2x speed-up on directories, and a >=8x speed-up on revisions.

This is essentially D6423, minus the option to select other algos.

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D6885 (id=24967)

Rebasing onto 259bf6fe1e...

Current branch diff-target is up to date.
Changes applied before test
commit 4a24505049d5c34c264d2b27e5feb24719b9e674
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 6 12:41:45 2022 +0100

    cassandra: Use concurrent queries in *_missing() instead of naive grouping
    
    Instead of grouping ids in queries in arbitrary batches (which forces
    the server node to coordinate with other nodes to complete the query),
    this sends queries with one id each, directly to the right node.
    
    This is the 'concurrent' algorithm from https://forge.softwareheritage.org/T3577#72791
    which gives a >=2x speed-up on directories, and a >=8x speed-up on revisions.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1517/ for more details.

This revision is now accepted and ready to land.Jan 6 2022, 4:01 PM