Page MenuHomeSoftware Heritage

cassandra: Make content_missing query in batches
ClosedPublic

Authored by vlorentz on Aug 20 2021, 1:53 PM.

Details

Summary

Instead of calling content_find() for each object, which needs to make
two queries for each.

Given the latency of Cassandra queries, this should be a significant
speed-up (possibly up to 100 times faster, as this is the value of
PARTITION_KEY_RESTRICTION_MAX_SIZE).

This also changes the schema, because CQL does not allow doing IN
queries on compound partition keys.

Test Plan

Both branches are already covered by existing tests

Diff Detail

Repository
rDSTO Storage manager
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 23095
Build 36016: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 36015: arc lint + arc unit

Event Timeline

Build is green

Patch application report for D6118 (id=22137)

Rebasing onto 9f00eb9dba...

Current branch diff-target is up to date.
Changes applied before test
commit 0f89a9dc7c86eec7dbf2c75180dfd008d6881196
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 20 13:52:17 2021 +0200

    cassandra: Make content_missing query in batches
    
    Instead of calling content_find() for each object, which needs to make
    two queries for each.
    
    Given the latency of Cassandra queries, this should be a significant
    speed-up (possibly up to 100 times faster, as this is the value of
    PARTITION_KEY_RESTRICTION_MAX_SIZE).

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1362/ for more details.

mention schema change in commit

vlorentz edited the test plan for this revision. (Show Details)

Build was aborted

Patch application report for D6118 (id=22138)

Rebasing onto 9f00eb9dba...

Current branch diff-target is up to date.
Changes applied before test
commit a3cc0dc7b104bc8b7f05988a7e0e26fae462ac7f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 20 13:52:17 2021 +0200

    cassandra: Make content_missing query in batches
    
    Instead of calling content_find() for each object, which needs to make
    two queries for each.
    
    Given the latency of Cassandra queries, this should be a significant
    speed-up (possibly up to 100 times faster, as this is the value of
    PARTITION_KEY_RESTRICTION_MAX_SIZE).
    
    This also changes the schema, because CQL does not allow doing `IN`
    queries on compound partition keys.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1363/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1363/console

Build is green

Patch application report for D6118 (id=22138)

Rebasing onto 9f00eb9dba...

Current branch diff-target is up to date.
Changes applied before test
commit a3cc0dc7b104bc8b7f05988a7e0e26fae462ac7f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 20 13:52:17 2021 +0200

    cassandra: Make content_missing query in batches
    
    Instead of calling content_find() for each object, which needs to make
    two queries for each.
    
    Given the latency of Cassandra queries, this should be a significant
    speed-up (possibly up to 100 times faster, as this is the value of
    PARTITION_KEY_RESTRICTION_MAX_SIZE).
    
    This also changes the schema, because CQL does not allow doing `IN`
    queries on compound partition keys.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1364/ for more details.

vsellier added a subscriber: vsellier.

The performance are ok now for the read part with a batch size of 1000 for content, directory and revision.

This revision is now accepted and ready to land.Aug 24 2021, 3:09 PM
vlorentz edited the summary of this revision. (Show Details)

rebase

Build has FAILED

Patch application report for D6118 (id=22162)

Rebasing onto 7113198fd6...

Current branch diff-target is up to date.
Changes applied before test
commit 54b5abfb26267ad56a67ad9fa2dd9d5d075e30f0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 20 13:52:17 2021 +0200

    cassandra: Make content_missing query in batches
    
    Instead of calling content_find() for each object, which needs to make
    two queries for each.
    
    Given the latency of Cassandra queries, this should be a significant
    speed-up (possibly up to 100 times faster, as this is the value of
    PARTITION_KEY_RESTRICTION_MAX_SIZE).
    
    This also changes the schema, because CQL does not allow doing `IN`
    queries on compound partition keys.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1368/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1368/console

This revision was landed with ongoing or failed builds.Aug 24 2021, 4:14 PM
This revision was automatically updated to reflect the committed changes.