Page MenuHomeSoftware Heritage

cassandra: Add alternative algorithms to list missing objects
Changes PlannedPublicDraft

Authored by vlorentz on Oct 6 2021, 3:24 PM.

Details

Summary

The existing implementation is now referred to as 'grouped-naive'.
It is pretty bad, because it groups together requests that need
to be dispatched to multiple servers.

'concurrent' is a new naive strategy, that is easy to implement
and should perform nicely.

'grouped-pk-serial' and 'grouped-pk-concurrent' still group
the ids they request, but in a smarter way, so each request group
only needs to access a single server.
I expect 'grouped-pk-concurrent' to be faster than 'grouped-pk-serial',
and it *may* be faster than 'concurrent' but we need benchmarks to know.

Should address this issue: https://forge.softwareheritage.org/T3577#71740

Test Plan

Two tests will fail because of some side effect I don't understand,
but they won't affect actual deployments.

Diff Detail

Unit TestsFailed

TimeTest
1,184 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra_migration::test_add_content_column[grouped-naive]
swh_storage = <swh.storage.cassandra.storage.CassandraStorage object at 0x7fd87ebb2128> swh_storage_backend_config = {'cls': 'cassandra', 'hosts': ['127.0.0.1'], 'journal_writer': {'cls': 'memory'}, 'keyspace': '33f04be9b029a6247c1a', ...} mocker = <pytest_mock.plugin.MockerFixture object at 0x7fd8bbe017b8>
1,195 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra_migration::test_add_content_column[grouped-pk-concurrent]
swh_storage = <swh.storage.cassandra.storage.CassandraStorage object at 0x7fd867c33588> swh_storage_backend_config = {'cls': 'cassandra', 'hosts': ['127.0.0.1'], 'journal_writer': {'cls': 'memory'}, 'keyspace': '33f04be9b029a6247c1a', ...} mocker = <pytest_mock.plugin.MockerFixture object at 0x7fd867c182b0>
1,217 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra_migration::test_add_content_column[grouped-pk-serial]
swh_storage = <swh.storage.cassandra.storage.CassandraStorage object at 0x7fd89e1a67f0> swh_storage_backend_config = {'cls': 'cassandra', 'hosts': ['127.0.0.1'], 'journal_writer': {'cls': 'memory'}, 'keyspace': '33f04be9b029a6247c1a', ...} mocker = <pytest_mock.plugin.MockerFixture object at 0x7fd868d079b0>
966 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra_migration::test_change_content_pk[grouped-naive]
swh_storage = <swh.storage.cassandra.storage.CassandraStorage object at 0x7fd85b0bd9b0> swh_storage_backend_config = {'cls': 'cassandra', 'hosts': ['127.0.0.1'], 'journal_writer': {'cls': 'memory'}, 'keyspace': '33f04be9b029a6247c1a', ...} mocker = <pytest_mock.plugin.MockerFixture object at 0x7fd85b0bf400>
1,032 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra_migration::test_change_content_pk[grouped-pk-concurrent]
swh_storage = <swh.storage.cassandra.storage.CassandraStorage object at 0x7fd828b98da0> swh_storage_backend_config = {'cls': 'cassandra', 'hosts': ['127.0.0.1'], 'journal_writer': {'cls': 'memory'}, 'keyspace': '33f04be9b029a6247c1a', ...} mocker = <pytest_mock.plugin.MockerFixture object at 0x7fd82b5b5748>
View Full Test Results (6 Failed · 1,689 Passed · 61 Skipped)

Event Timeline

vlorentz edited the summary of this revision. (Show Details)

Build has FAILED

Patch application report for D6423 (id=23341)

Rebasing onto 113088ab06...

Current branch diff-target is up to date.
Changes applied before test
commit 73f8ccf035b87a7d6c99c360f2c0d4b0ac8a6ae0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Oct 6 15:22:24 2021 +0200

    cassandra: Add alternative algorithms to list missing objects
    
    The existing implementation is now referred to as 'grouped-naive'.
    This is pretty bad, because it groups together requests that need
    to be dispatched to multiple server.
    
    'concurrent' is a new naive strategy, that is easy to implement
    and should perform nicely.
    
    'grouped-pk-serial' and 'grouped-pk-concurrent' still group
    the ids they request, but in a smarter way.
    I expect 'grouped-pk-concurrent' to be faster than 'grouped-pk-serial',
    and it *may* be faster than 'concurrent' but we need benchmarks to know.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1433/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1433/console

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 6 2021, 3:42 PM
Harbormaster failed remote builds in B24254: Diff 23341!

Build has FAILED

Patch application report for D6423 (id=23598)

Could not rebase; Attempt merge onto e9fd74d72d...

Updating e9fd74d7..56474d66
Fast-forward
 swh/storage/cassandra/cql.py        | 73 ++++++++++++++++++++++++++++---------
 swh/storage/cassandra/storage.py    | 32 +++++++++++++---
 swh/storage/tests/test_cassandra.py |  7 ++--
 3 files changed, 85 insertions(+), 27 deletions(-)
Changes applied before test
commit 56474d6667071785078f977ff6cf16064f326143
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Wed Oct 6 15:22:24 2021 +0200

    cassandra: Add alternative algorithms to list missing objects
    
    The existing implementation is now referred to as 'grouped-naive'.
    This is pretty bad, because it groups together requests that need
    to be dispatched to multiple server.
    
    'concurrent' is a new naive strategy, that is easy to implement
    and should perform nicely.
    
    'grouped-pk-serial' and 'grouped-pk-concurrent' still group
    the ids they request, but in a smarter way.
    I expect 'grouped-pk-concurrent' to be faster than 'grouped-pk-serial',
    and it *may* be faster than 'concurrent' but we need benchmarks to know.

commit a9867104771069c0f5f964a1995a2f2ddf47da93
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Oct 18 13:00:00 2021 +0200

    cassandra: Fix incomplete check of content existence in object_find_by_sha1_git
    
    content_missing_by_sha1_git only checks the index and not the main table.
    
    This is incorrect, because contents should not be considered written
    before an entry is written to the main table, even if an entry
    exists in one of the indexes.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1456/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1456/console

Harbormaster returned this revision to the author for changes because remote builds failed.Oct 18 2021, 1:45 PM
Harbormaster failed remote builds in B24493: Diff 23598!

I expect 'grouped-pk-concurrent' to be faster than 'grouped-pk-serial',
and it *may* be faster than 'concurrent' but we need benchmarks to know.

the 'grouped-pk-concurrent' algo looks to be faster overall (closed to the 'concurrent' algo). It's not so clear for 'grouped-naive' and 'grouped-pk-serial' [1]

This diff look ok by itself (can't validate it as the status is 'Changes planned')

[1] T3577#72791