Page MenuHomeSoftware Heritage

cassandra: Rewrite content_missing to run queries concurrently.
ClosedPublic

Authored by vlorentz on Jan 6 2022, 5:31 PM.

Details

Summary

This is twice as fast, according to
https://forge.softwareheritage.org/T3577#72791

This is the same commit as D6495, rebased on D6885 instead of D6423.

Diff Detail

Repository
rDSTO Storage manager
Branch
concurrent-missing
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 25855
Build 40406: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 40405: arc lint + arc unit

Unit TestsFailed

TimeTest
1,792 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_api_client.TestStorageApi::test_content_missing
self = <swh.storage.tests.test_api_client.TestStorageApi object at 0x7ff1c18017b8> swh_storage = <RemoteStorage url=mock://example.com/> sample_data = <swh.storage.tests.storage_data.StorageData object at 0x7ff1c18018d0>
39 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_api_client.TestStorageApi::test_content_missing_per_sha1
self = <swh.storage.tests.test_api_client.TestStorageApi object at 0x7ff1c0e84198> swh_storage = <RemoteStorage url=mock://example.com/> sample_data = <swh.storage.tests.storage_data.StorageData object at 0x7ff1c08603c8>
41 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_api_client.TestStorageApi::test_content_missing_per_sha1_git
self = <swh.storage.tests.test_api_client.TestStorageApi object at 0x7ff1c225fa58> swh_storage = <RemoteStorage url=mock://example.com/> sample_data = <swh.storage.tests.storage_data.StorageData object at 0x7ff1c0e84860>
1,639 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_api_client.TestStorageApi::test_content_missing_unknown_algo
self = <swh.storage.tests.test_api_client.TestStorageApi object at 0x7ff1c1eee7b8> swh_storage = <RemoteStorage url=mock://example.com/> sample_data = <swh.storage.tests.storage_data.StorageData object at 0x7ff1c17dc550>
48 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_api_client.TestStorageApi::test_object_find_by_sha1_git
self = <swh.storage.tests.test_api_client.TestStorageApi object at 0x7ff1c1c3bb38> swh_storage = <RemoteStorage url=mock://example.com/> sample_data = <swh.storage.tests.storage_data.StorageData object at 0x7ff1c24af9b0>
View Full Test Results (21 Failed · 1,146 Passed · 40 Skipped)

Event Timeline

Build has FAILED

Patch application report for D6888 (id=24984)

Rebasing onto 4a24505049...

Current branch diff-target is up to date.
Changes applied before test
commit 2a38ccc4d43fefcc0652da780dd76ad403468323
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Oct 18 13:25:20 2021 +0200

    cassandra: Rewrite content_missing to run queries concurrently.
    
    This is twice as fast, according to
    https://forge.softwareheritage.org/T3577#72791

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1518/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1518/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jan 6 2022, 5:39 PM
Harbormaster failed remote builds in B25855: Diff 24984!

Build is green

Patch application report for D6888 (id=24991)

Rebasing onto 4a24505049...

Current branch diff-target is up to date.
Changes applied before test
commit 55141ff2d57ca147efc2235eba2b006814c03817
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Oct 18 13:25:20 2021 +0200

    cassandra: Rewrite content_missing to run queries concurrently.
    
    This is twice as fast, according to
    https://forge.softwareheritage.org/T3577#72791

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1519/ for more details.

anlambert added a subscriber: anlambert.

Looks good to me.

This revision is now accepted and ready to land.Jan 12 2022, 11:27 AM
douardda added a subscriber: douardda.

fine for me (but plz give a bit more insight)

swh/storage/cassandra/storage.py
414

would be nice to have a comment explaining why this more convoluted code is better (aka remind the reader the concurrency gained with the usage of content_find_many)

418

not a big fan of the double for loop, but meh (alternative implem would probably be much worse)

swh/storage/cassandra/storage.py
418

the alternative implem is in D6889

Build is green

Patch application report for D6888 (id=25080)

Rebasing onto 4a24505049...

Current branch diff-target is up to date.
Changes applied before test
commit d5f1f0ec055477461a000f7eeaece974fa1265b1
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Mon Oct 18 13:25:20 2021 +0200

    cassandra: Rewrite content_missing to run queries concurrently.
    
    This is twice as fast, according to
    https://forge.softwareheritage.org/T3577#72791

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1522/ for more details.