Page MenuHomeSoftware Heritage

cassandra: Use concurrent queries in *_missing() instead of naive grouping
ClosedPublic

Authored by vlorentz on Jan 6 2022, 12:44 PM.

Details

Summary

Instead of grouping ids in queries in arbitrary batches (which forces
the server node to coordinate with other nodes to complete the query),
this sends queries with one id each, directly to the right node.

This is the 'concurrent' algorithm from https://forge.softwareheritage.org/T3577#72791
which gives a >=2x speed-up on directories, and a >=8x speed-up on revisions.

This is essentially D6423, minus the option to select other algos.

Diff Detail

Repository
rDSTO Storage manager
Branch
concurrent-missing
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 25838
Build 40384: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 40383: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D6885 (id=24967)

Rebasing onto 259bf6fe1e...

Current branch diff-target is up to date.
Changes applied before test
commit 4a24505049d5c34c264d2b27e5feb24719b9e674
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jan 6 12:41:45 2022 +0100

    cassandra: Use concurrent queries in *_missing() instead of naive grouping
    
    Instead of grouping ids in queries in arbitrary batches (which forces
    the server node to coordinate with other nodes to complete the query),
    this sends queries with one id each, directly to the right node.
    
    This is the 'concurrent' algorithm from https://forge.softwareheritage.org/T3577#72791
    which gives a >=2x speed-up on directories, and a >=8x speed-up on revisions.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1517/ for more details.

This revision is now accepted and ready to land.Jan 6 2022, 4:01 PM