Page MenuHomeSoftware Heritage

cassandra: Add alternative algorithms to list missing objects
AbandonedPublicDraft

Authored by vlorentz on Oct 6 2021, 3:24 PM.

Details

Summary

The existing implementation is now referred to as 'grouped-naive'.
It is pretty bad, because it groups together requests that need
to be dispatched to multiple servers.

'concurrent' is a new naive strategy, that is easy to implement
and should perform nicely.

'grouped-pk-serial' and 'grouped-pk-concurrent' still group
the ids they request, but in a smarter way, so each request group
only needs to access a single server.
I expect 'grouped-pk-concurrent' to be faster than 'grouped-pk-serial',
and it *may* be faster than 'concurrent' but we need benchmarks to know.

Should address this issue: https://forge.softwareheritage.org/T3577#71740

Test Plan

Two tests will fail because of some side effect I don't understand,
but they won't affect actual deployments.

Diff Detail

Repository
rDSTO Storage manager
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 24254
Build 37859: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 37858: arc lint + arc unit

Unit TestsFailed

TimeTest
1,213 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra_migration::test_add_content_column[grouped-naive]
swh_storage = <swh.storage.cassandra.storage.CassandraStorage object at 0x7f5ecfed6320> swh_storage_backend_config = {'cls': 'cassandra', 'hosts': ['127.0.0.1'], 'journal_writer': {'cls': 'memory'}, 'keyspace': '9ba6582569500b3dd952', ...} mocker = <pytest_mock.plugin.MockerFixture object at 0x7f5eccd47d30>
1,141 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra_migration::test_add_content_column[grouped-pk-concurrent]
swh_storage = <swh.storage.cassandra.storage.CassandraStorage object at 0x7f5ecbc1a780> swh_storage_backend_config = {'cls': 'cassandra', 'hosts': ['127.0.0.1'], 'journal_writer': {'cls': 'memory'}, 'keyspace': '9ba6582569500b3dd952', ...} mocker = <pytest_mock.plugin.MockerFixture object at 0x7f5ecbc954e0>
1,087 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra_migration::test_add_content_column[grouped-pk-serial]
swh_storage = <swh.storage.cassandra.storage.CassandraStorage object at 0x7f5ecfedb080> swh_storage_backend_config = {'cls': 'cassandra', 'hosts': ['127.0.0.1'], 'journal_writer': {'cls': 'memory'}, 'keyspace': '9ba6582569500b3dd952', ...} mocker = <pytest_mock.plugin.MockerFixture object at 0x7f5eccd49908>
926 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra_migration::test_change_content_pk[grouped-naive]
swh_storage = <swh.storage.cassandra.storage.CassandraStorage object at 0x7f5ec9b0f208> swh_storage_backend_config = {'cls': 'cassandra', 'hosts': ['127.0.0.1'], 'journal_writer': {'cls': 'memory'}, 'keyspace': '9ba6582569500b3dd952', ...} mocker = <pytest_mock.plugin.MockerFixture object at 0x7f5ec403aef0>
814 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra_migration::test_change_content_pk[grouped-pk-concurrent]
swh_storage = <swh.storage.cassandra.storage.CassandraStorage object at 0x7f5ed40642b0> swh_storage_backend_config = {'cls': 'cassandra', 'hosts': ['127.0.0.1'], 'journal_writer': {'cls': 'memory'}, 'keyspace': '9ba6582569500b3dd952', ...} mocker = <pytest_mock.plugin.MockerFixture object at 0x7f5ed4073c18>
View Full Test Results (6 Failed · 1,682 Passed · 61 Skipped)

Event Timeline

vlorentz edited the summary of this revision. (Show Details)

Build has FAILED

Patch application report for D6423 (id=23341)

Rebasing onto 113088ab06...

Current branch diff-target is up to date.
Changes applied before test
commit 73f8ccf035b87a7d6c99c360f2c0d4b0ac8a6ae0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Oct 6 15:22:24 2021 +0200

    cassandra: Add alternative algorithms to list missing objects
    
    The existing implementation is now referred to as 'grouped-naive'.
    This is pretty bad, because it groups together requests that need
    to be dispatched to multiple server.
    
    'concurrent' is a new naive strategy, that is easy to implement
    and should perform nicely.
    
    'grouped-pk-serial' and 'grouped-pk-concurrent' still group
    the ids they request, but in a smarter way.
    I expect 'grouped-pk-concurrent' to be faster than 'grouped-pk-serial',
    and it *may* be faster than 'concurrent' but we need benchmarks to know.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1433/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1433/console

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 6 2021, 3:42 PM
Harbormaster failed remote builds in B24254: Diff 23341!

Build has FAILED

Patch application report for D6423 (id=23598)

Could not rebase; Attempt merge onto e9fd74d72d...

Updating e9fd74d7..56474d66
Fast-forward
 swh/storage/cassandra/cql.py        | 73 ++++++++++++++++++++++++++++---------
 swh/storage/cassandra/storage.py    | 32 +++++++++++++---
 swh/storage/tests/test_cassandra.py |  7 ++--
 3 files changed, 85 insertions(+), 27 deletions(-)
Changes applied before test
commit 56474d6667071785078f977ff6cf16064f326143
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Oct 6 15:22:24 2021 +0200

    cassandra: Add alternative algorithms to list missing objects
    
    The existing implementation is now referred to as 'grouped-naive'.
    This is pretty bad, because it groups together requests that need
    to be dispatched to multiple server.
    
    'concurrent' is a new naive strategy, that is easy to implement
    and should perform nicely.
    
    'grouped-pk-serial' and 'grouped-pk-concurrent' still group
    the ids they request, but in a smarter way.
    I expect 'grouped-pk-concurrent' to be faster than 'grouped-pk-serial',
    and it *may* be faster than 'concurrent' but we need benchmarks to know.

commit a9867104771069c0f5f964a1995a2f2ddf47da93
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Oct 18 13:00:00 2021 +0200

    cassandra: Fix incomplete check of content existence in object_find_by_sha1_git
    
    content_missing_by_sha1_git only checks the index and not the main table.
    
    This is incorrect, because contents should not be considered written
    before an entry is written to the main table, even if an entry
    exists in one of the indexes.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1456/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1456/console

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 18 2021, 1:45 PM
Harbormaster failed remote builds in B24493: Diff 23598!

I expect 'grouped-pk-concurrent' to be faster than 'grouped-pk-serial',
and it *may* be faster than 'concurrent' but we need benchmarks to know.

the 'grouped-pk-concurrent' algo looks to be faster overall (closed to the 'concurrent' algo). It's not so clear for 'grouped-naive' and 'grouped-pk-serial' [1]

This diff look ok by itself (can't validate it as the status is 'Changes planned')

[1] T3577#72791