Paths

Table of Contentst

Differential D6118

cassandra: Make content_missing query in batches
ClosedPublic
Actions

Authored by vlorentz on Aug 20 2021, 1:53 PM.

Details

Reviewers

vsellier

Group Reviewers

Reviewers

Maniphest Tasks

T3493: [cassandra] Git loader performance are very bad

Commits

rDSTO54b5abfb2626: cassandra: Make content_missing query in batches

Summary

Instead of calling content_find() for each object, which needs to make
two queries for each.

Given the latency of Cassandra queries, this should be a significant
speed-up (possibly up to 100 times faster, as this is the value of
PARTITION_KEY_RESTRICTION_MAX_SIZE).

This also changes the schema, because CQL does not allow doing IN
queries on compound partition keys.

Test Plan

Both branches are already covered by existing tests

Diff Detail

Repository

rDSTO Storage manager

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

vlorentz created this revision.Aug 20 2021, 1:53 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptAug 20 2021, 1:53 PM

Build is green

Patch application report for D6118 (id=22137)

Rebasing onto 9f00eb9dba...

Current branch diff-target is up to date.

Changes applied before test

commit 0f89a9dc7c86eec7dbf2c75180dfd008d6881196
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 20 13:52:17 2021 +0200

    cassandra: Make content_missing query in batches
    
    Instead of calling content_find() for each object, which needs to make
    two queries for each.
    
    Given the latency of Cassandra queries, this should be a significant
    speed-up (possibly up to 100 times faster, as this is the value of
    PARTITION_KEY_RESTRICTION_MAX_SIZE).

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1362/ for more details.

Harbormaster completed remote builds in B23095: Diff 22137.Aug 20 2021, 2:00 PM

vlorentz requested review of this revision.Aug 20 2021, 2:00 PM

mention schema change in commit

vlorentz edited the summary of this revision. (Show Details)Aug 20 2021, 2:15 PM

vlorentz edited the test plan for this revision. (Show Details)

Build was aborted

Patch application report for D6118 (id=22138)

Rebasing onto 9f00eb9dba...

Current branch diff-target is up to date.

Changes applied before test

commit a3cc0dc7b104bc8b7f05988a7e0e26fae462ac7f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 20 13:52:17 2021 +0200

    cassandra: Make content_missing query in batches
    
    Instead of calling content_find() for each object, which needs to make
    two queries for each.
    
    Given the latency of Cassandra queries, this should be a significant
    speed-up (possibly up to 100 times faster, as this is the value of
    PARTITION_KEY_RESTRICTION_MAX_SIZE).
    
    This also changes the schema, because CQL does not allow doing `IN`
    queries on compound partition keys.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1363/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1363/console

Harbormaster failed remote builds in B23096: Diff 22138!Aug 20 2021, 2:36 PM

Build is green

Patch application report for D6118 (id=22138)

Rebasing onto 9f00eb9dba...

Current branch diff-target is up to date.

Changes applied before test

commit a3cc0dc7b104bc8b7f05988a7e0e26fae462ac7f
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 20 13:52:17 2021 +0200

    cassandra: Make content_missing query in batches
    
    Instead of calling content_find() for each object, which needs to make
    two queries for each.
    
    Given the latency of Cassandra queries, this should be a significant
    speed-up (possibly up to 100 times faster, as this is the value of
    PARTITION_KEY_RESTRICTION_MAX_SIZE).
    
    This also changes the schema, because CQL does not allow doing `IN`
    queries on compound partition keys.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1364/ for more details.

Harbormaster completed remote builds in B23096: Diff 22138.Aug 20 2021, 2:46 PM

vsellier added a task: T3493: [cassandra] Git loader performance are very bad.Aug 24 2021, 3:06 PM

The performance are ok now for the read part with a batch size of 1000 for content, directory and revision.

This revision is now accepted and ready to land.Aug 24 2021, 3:09 PM

rebase

vsellier mentioned this in T3493: [cassandra] Git loader performance are very bad.Aug 24 2021, 3:18 PM

Build has FAILED

Patch application report for D6118 (id=22162)

Rebasing onto 7113198fd6...

Current branch diff-target is up to date.

Changes applied before test

commit 54b5abfb26267ad56a67ad9fa2dd9d5d075e30f0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Fri Aug 20 13:52:17 2021 +0200

    cassandra: Make content_missing query in batches
    
    Instead of calling content_find() for each object, which needs to make
    two queries for each.
    
    Given the latency of Cassandra queries, this should be a significant
    speed-up (possibly up to 100 times faster, as this is the value of
    PARTITION_KEY_RESTRICTION_MAX_SIZE).
    
    This also changes the schema, because CQL does not allow doing `IN`
    queries on compound partition keys.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1368/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1368/console

Harbormaster failed remote builds in B23119: Diff 22162!Aug 24 2021, 3:30 PM

This revision was landed with ongoing or failed builds.Aug 24 2021, 4:14 PM

Closed by commit rDSTO54b5abfb2626: cassandra: Make content_missing query in batches (authored by vlorentz). · Explain Why

This revision was automatically updated to reflect the committed changes.

vlorentz added a commit: rDSTO54b5abfb2626: cassandra: Make content_missing query in batches.