Page MenuHomeSoftware Heritage

cassandra: Add alternative algorithms to list missing objects
Changes PlannedPublicDraft

Authored by vlorentz on Wed, Oct 6, 3:24 PM.

Details

Reviewers
vsellier
Summary

The existing implementation is now referred to as 'grouped-naive'.
It is pretty bad, because it groups together requests that need
to be dispatched to multiple servers.

'concurrent' is a new naive strategy, that is easy to implement
and should perform nicely.

'grouped-pk-serial' and 'grouped-pk-concurrent' still group
the ids they request, but in a smarter way, so each request group
only needs to access a single server.
I expect 'grouped-pk-concurrent' to be faster than 'grouped-pk-serial',
and it *may* be faster than 'concurrent' but we need benchmarks to know.

Should address this issue: https://forge.softwareheritage.org/T3577#71740

Test Plan

Two tests will fail because of some side effect I don't understand,
but they won't affect actual deployments.

Diff Detail

Repository
rDSTO Storage manager
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 24254
Build 37859: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 37858: arc lint + arc unit

Unit TestsFailed

TimeTest
1,213 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra_migration::test_add_content_column[grouped-naive]
swh_storage = <swh.storage.cassandra.storage.CassandraStorage object at 0x7f5ecfed6320> swh_storage_backend_config = {'cls': 'cassandra', 'hosts': ['127.0.0.1'], 'journal_writer': {'cls': 'memory'}, 'keyspace': '9ba6582569500b3dd952', ...} mocker = <pytest_mock.plugin.MockerFixture object at 0x7f5eccd47d30>
1,141 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra_migration::test_add_content_column[grouped-pk-concurrent]
swh_storage = <swh.storage.cassandra.storage.CassandraStorage object at 0x7f5ecbc1a780> swh_storage_backend_config = {'cls': 'cassandra', 'hosts': ['127.0.0.1'], 'journal_writer': {'cls': 'memory'}, 'keyspace': '9ba6582569500b3dd952', ...} mocker = <pytest_mock.plugin.MockerFixture object at 0x7f5ecbc954e0>
1,087 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra_migration::test_add_content_column[grouped-pk-serial]
swh_storage = <swh.storage.cassandra.storage.CassandraStorage object at 0x7f5ecfedb080> swh_storage_backend_config = {'cls': 'cassandra', 'hosts': ['127.0.0.1'], 'journal_writer': {'cls': 'memory'}, 'keyspace': '9ba6582569500b3dd952', ...} mocker = <pytest_mock.plugin.MockerFixture object at 0x7f5eccd49908>
926 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra_migration::test_change_content_pk[grouped-naive]
swh_storage = <swh.storage.cassandra.storage.CassandraStorage object at 0x7f5ec9b0f208> swh_storage_backend_config = {'cls': 'cassandra', 'hosts': ['127.0.0.1'], 'journal_writer': {'cls': 'memory'}, 'keyspace': '9ba6582569500b3dd952', ...} mocker = <pytest_mock.plugin.MockerFixture object at 0x7f5ec403aef0>
814 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra_migration::test_change_content_pk[grouped-pk-concurrent]
swh_storage = <swh.storage.cassandra.storage.CassandraStorage object at 0x7f5ed40642b0> swh_storage_backend_config = {'cls': 'cassandra', 'hosts': ['127.0.0.1'], 'journal_writer': {'cls': 'memory'}, 'keyspace': '9ba6582569500b3dd952', ...} mocker = <pytest_mock.plugin.MockerFixture object at 0x7f5ed4073c18>
View Full Test Results (6 Failed · 1,682 Passed · 61 Skipped)

Event Timeline

vlorentz edited the summary of this revision. (Show Details)

Build has FAILED

Patch application report for D6423 (id=23341)

Rebasing onto 113088ab06...

Current branch diff-target is up to date.
Changes applied before test
commit 73f8ccf035b87a7d6c99c360f2c0d4b0ac8a6ae0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Oct 6 15:22:24 2021 +0200

    cassandra: Add alternative algorithms to list missing objects
    
    The existing implementation is now referred to as 'grouped-naive'.
    This is pretty bad, because it groups together requests that need
    to be dispatched to multiple server.
    
    'concurrent' is a new naive strategy, that is easy to implement
    and should perform nicely.
    
    'grouped-pk-serial' and 'grouped-pk-concurrent' still group
    the ids they request, but in a smarter way.
    I expect 'grouped-pk-concurrent' to be faster than 'grouped-pk-serial',
    and it *may* be faster than 'concurrent' but we need benchmarks to know.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1433/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1433/console

Harbormaster returned this revision to the author for changes because remote builds failed.Wed, Oct 6, 3:42 PM
Harbormaster failed remote builds in B24254: Diff 23341!