cassandra: Add alternative algorithms to list missing objects
AbandonedPublicDraft
Actions

Authored by vlorentz on Oct 6 2021, 3:24 PM.

Details

Reviewers

Maniphest Tasks

Summary

The existing implementation is now referred to as 'grouped-naive'.
It is pretty bad, because it groups together requests that need
to be dispatched to multiple servers.

'concurrent' is a new naive strategy, that is easy to implement
and should perform nicely.

'grouped-pk-serial' and 'grouped-pk-concurrent' still group
the ids they request, but in a smarter way, so each request group
only needs to access a single server.
I expect 'grouped-pk-concurrent' to be faster than 'grouped-pk-serial',
and it *may* be faster than 'concurrent' but we need benchmarks to know.

Should address this issue: https://forge.softwareheritage.org/T3577#71740

Test Plan

Two tests will fail because of some side effect I don't understand,
but they won't affect actual deployments.

Diff Detail

Repository

rDSTO Storage manager

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 24493
Build 38216: Phabricator diff pipeline on jenkins	Jenkins console · Jenkins
Build 38215: arc lint + arc unit

Unit TestsFailed
View All

	Time	Test
	1,184 ms	Jenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra_migration::test_add_content_column[grouped-naive] swh_storage = <swh.storage.cassandra.storage.CassandraStorage object at 0x7fd87ebb2128> swh_storage_backend_config = {'cls': 'cassandra', 'hosts': ['127.0.0.1'], 'journal_writer': {'cls': 'memory'}, 'keyspace': '33f04be9b029a6247c1a', ...} mocker = <pytest_mock.plugin.MockerFixture object at 0x7fd8bbe017b8>
	1,195 ms	Jenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra_migration::test_add_content_column[grouped-pk-concurrent] swh_storage = <swh.storage.cassandra.storage.CassandraStorage object at 0x7fd867c33588> swh_storage_backend_config = {'cls': 'cassandra', 'hosts': ['127.0.0.1'], 'journal_writer': {'cls': 'memory'}, 'keyspace': '33f04be9b029a6247c1a', ...} mocker = <pytest_mock.plugin.MockerFixture object at 0x7fd867c182b0>
	1,217 ms	Jenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra_migration::test_add_content_column[grouped-pk-serial] swh_storage = <swh.storage.cassandra.storage.CassandraStorage object at 0x7fd89e1a67f0> swh_storage_backend_config = {'cls': 'cassandra', 'hosts': ['127.0.0.1'], 'journal_writer': {'cls': 'memory'}, 'keyspace': '33f04be9b029a6247c1a', ...} mocker = <pytest_mock.plugin.MockerFixture object at 0x7fd868d079b0>
	966 ms	Jenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra_migration::test_change_content_pk[grouped-naive] swh_storage = <swh.storage.cassandra.storage.CassandraStorage object at 0x7fd85b0bd9b0> swh_storage_backend_config = {'cls': 'cassandra', 'hosts': ['127.0.0.1'], 'journal_writer': {'cls': 'memory'}, 'keyspace': '33f04be9b029a6247c1a', ...} mocker = <pytest_mock.plugin.MockerFixture object at 0x7fd85b0bf400>
	1,032 ms	Jenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra_migration::test_change_content_pk[grouped-pk-concurrent] swh_storage = <swh.storage.cassandra.storage.CassandraStorage object at 0x7fd828b98da0> swh_storage_backend_config = {'cls': 'cassandra', 'hosts': ['127.0.0.1'], 'journal_writer': {'cls': 'memory'}, 'keyspace': '33f04be9b029a6247c1a', ...} mocker = <pytest_mock.plugin.MockerFixture object at 0x7fd82b5b5748>
		View Full Test Results (6 Failed · 1,689 Passed · 61 Skipped)

Event Timeline

vlorentz created this revision.Oct 6 2021, 3:24 PM

vlorentz edited the summary of this revision. (Show Details)Oct 6 2021, 3:24 PM

vlorentz edited the summary of this revision. (Show Details)

vlorentz edited the summary of this revision. (Show Details)Oct 6 2021, 3:35 PM

Build has FAILED

Patch application report for D6423 (id=23341)

Rebasing onto 113088ab06...

Current branch diff-target is up to date.

Changes applied before test

commit 73f8ccf035b87a7d6c99c360f2c0d4b0ac8a6ae0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Oct 6 15:22:24 2021 +0200

    cassandra: Add alternative algorithms to list missing objects
    
    The existing implementation is now referred to as 'grouped-naive'.
    This is pretty bad, because it groups together requests that need
    to be dispatched to multiple server.
    
    'concurrent' is a new naive strategy, that is easy to implement
    and should perform nicely.
    
    'grouped-pk-serial' and 'grouped-pk-concurrent' still group
    the ids they request, but in a smarter way.
    I expect 'grouped-pk-concurrent' to be faster than 'grouped-pk-serial',
    and it *may* be faster than 'concurrent' but we need benchmarks to know.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1433/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1433/console

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 6 2021, 3:42 PM

Harbormaster failed remote builds in B24254: Diff 23341!

rebase on D6494

vlorentz added a child revision: D6495: cassandra: Rewrite content_missing to run queries concurrently..Oct 18 2021, 1:26 PM

vlorentz added a parent revision: D6494: cassandra: Fix incomplete check of content existence in object_find_by_sha1_git.Oct 18 2021, 1:26 PM

Build has FAILED

Patch application report for D6423 (id=23598)

Could not rebase; Attempt merge onto e9fd74d72d...

Updating e9fd74d7..56474d66
Fast-forward
 swh/storage/cassandra/cql.py        | 73 ++++++++++++++++++++++++++++---------
 swh/storage/cassandra/storage.py    | 32 +++++++++++++---
 swh/storage/tests/test_cassandra.py |  7 ++--
 3 files changed, 85 insertions(+), 27 deletions(-)

Changes applied before test

commit 56474d6667071785078f977ff6cf16064f326143
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Oct 6 15:22:24 2021 +0200

    cassandra: Add alternative algorithms to list missing objects
    
    The existing implementation is now referred to as 'grouped-naive'.
    This is pretty bad, because it groups together requests that need
    to be dispatched to multiple server.
    
    'concurrent' is a new naive strategy, that is easy to implement
    and should perform nicely.
    
    'grouped-pk-serial' and 'grouped-pk-concurrent' still group
    the ids they request, but in a smarter way.
    I expect 'grouped-pk-concurrent' to be faster than 'grouped-pk-serial',
    and it *may* be faster than 'concurrent' but we need benchmarks to know.

commit a9867104771069c0f5f964a1995a2f2ddf47da93
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Oct 18 13:00:00 2021 +0200

    cassandra: Fix incomplete check of content existence in object_find_by_sha1_git
    
    content_missing_by_sha1_git only checks the index and not the main table.
    
    This is incorrect, because contents should not be considered written
    before an entry is written to the main table, even if an entry
    exists in one of the indexes.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1456/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1456/console

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 18 2021, 1:45 PM

Harbormaster failed remote builds in B24493: Diff 23598!

vsellier mentioned this in T3577: Parallel loaders performances .Oct 20 2021, 2:50 PM

I expect 'grouped-pk-concurrent' to be faster than 'grouped-pk-serial',
and it *may* be faster than 'concurrent' but we need benchmarks to know.

the 'grouped-pk-concurrent' algo looks to be faster overall (closed to the 'concurrent' algo). It's not so clear for 'grouped-naive' and 'grouped-pk-serial' [1]

This diff look ok by itself (can't validate it as the status is 'Changes planned')

[1] T3577#72791

vsellier added a task: T3577: Parallel loaders performances .Oct 21 2021, 2:58 PM

vlorentz mentioned this in D6885: cassandra: Use concurrent queries in *_missing() instead of naive grouping.Jan 6 2022, 12:44 PM

vlorentz mentioned this in D6888: cassandra: Rewrite content_missing to run queries concurrently..Jan 6 2022, 5:31 PM

replaced by D6885

vlorentz abandoned this revision.Feb 4 2022, 3:11 PM