Page MenuHomeSoftware Heritage

Cassandra: Deduplicate lists passed to *_add endpoints
ClosedPublic

Authored by KShivendu on Apr 5 2021, 1:45 PM.

Details

Summary

Related T2316

Previously only release_add supported deduplication.
This commit aligns other _add endpoints with it

Diff Detail

Repository
rDSTO Storage manager
Branch
deduplication-of-add-endpoints
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 20503
Build 31822: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 31821: arc lint + arc unit

Unit TestsFailed

TimeTest
28 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_api_client.TestStorageApi::test_content_add
self = <swh.storage.tests.test_api_client.TestStorageApi object at 0x7f31a089c0b8> swh_storage = <RemoteStorage url=mock://example.com/> sample_data = <swh.storage.tests.storage_data.StorageData object at 0x7f31a08994e0>
35 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_api_client.TestStorageApi::test_content_add_collision
self = <swh.storage.tests.test_api_client.TestStorageApi object at 0x7f31a0c55d30> swh_storage = <RemoteStorage url=mock://example.com/> sample_data = <swh.storage.tests.storage_data.StorageData object at 0x7f31a0c55b00>
27 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_api_client.TestStorageApi::test_content_add_different_input
self = <swh.storage.tests.test_api_client.TestStorageApi object at 0x7f31a0a345c0> swh_storage = <RemoteStorage url=mock://example.com/> sample_data = <swh.storage.tests.storage_data.StorageData object at 0x7f31a1107828>
35 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_api_client.TestStorageApi::test_content_add_duplicate
self = <swh.storage.tests.test_api_client.TestStorageApi object at 0x7f31a0c55a58> swh_storage = <RemoteStorage url=mock://example.com/> sample_data = <swh.storage.tests.storage_data.StorageData object at 0x7f31a0c55128>
26 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_api_client.TestStorageApi::test_content_add_from_lazy_content
self = <swh.storage.tests.test_api_client.TestStorageApi object at 0x7f31a03e1710> swh_storage = <RemoteStorage url=mock://example.com/> sample_data = <swh.storage.tests.storage_data.StorageData object at 0x7f31a03e1860>
View Full Test Results (33 Failed · 773 Passed · 23 Skipped)

Event Timeline

Build has FAILED

Patch application report for D5419 (id=19373)

Rebasing onto 0a270d1a7a...

First, rewinding head to replay your work on top of it...
Applying: Cassandra: Deduplicate lists passed to *_add endpoints
Changes applied before test
commit 1da153ea6881808224a5dd6cbf7b51314903c98f
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Mon Apr 5 17:09:33 2021 +0530

    Cassandra: Deduplicate lists passed to *_add endpoints
    
    Previously only release_add supported deduplication.
    This commit aligns other _add endpoints with it

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1243/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1243/console

Harbormaster returned this revision to the author for changes because remote builds failed.Apr 5 2021, 1:51 PM
Harbormaster failed remote builds in B20430: Diff 19373!

I don't think you need to convert the sets back to lists

Build has FAILED

Patch application report for D5419 (id=19373)

Rebasing onto 0a270d1a7a...

First, rewinding head to replay your work on top of it...
Applying: Cassandra: Deduplicate lists passed to *_add endpoints
Changes applied before test
commit 59383a65f4b85bb87a6d22f67635140de6982230
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Mon Apr 5 17:09:33 2021 +0530

    Cassandra: Deduplicate lists passed to *_add endpoints
    
    Previously only release_add supported deduplication.
    This commit aligns other _add endpoints with it

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1244/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1244/console

I don't think you need to convert the sets back to lists

I did that and got type errors from mypy.
Imo, it's okay to leave it as list(set(..)) because anyhow it gets transformed back into a list within the next 2-3 lines.
What do you think?

I did that and got type errors from mypy.

Then only keep list() when needed. For example, it isn't needed in content_add or snapshot_add

Imo, it's okay to leave it as list(set(..)) because anyhow it gets transformed back into a list within the next 2-3 lines.

But it creates a copy that we don't need to do

I just discovered that tests failed because the set's internally used hash function throws an error for if a dictionary is passed.
Do you know any other trick which can do the de-duplication in one line? Or should I just create a common function to loop over the list and find the unique ones?

These objects has an id attribute. You can use it for deduplication (eg. via a dict)

Updating D5419: Cassandra: Deduplicate lists passed to *_add endpoints

Build has FAILED

Patch application report for D5419 (id=19459)

Rebasing onto 39507b24d0...

First, rewinding head to replay your work on top of it...
Applying: Cassandra: Deduplicate lists passed to *_add endpoints
Changes applied before test
commit 3526f21c81e4d0f8feeb09a88485d1e9790b03c3
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Mon Apr 5 17:09:33 2021 +0530

    Cassandra: Deduplicate lists passed to *_add endpoints
    
    Previously only release_add supported deduplication.
    This commit aligns other _add endpoints with it

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1249/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1249/console

Please don't change the signature to accommodate implementation details of the function

Updating D5419: Cassandra: Deduplicate lists passed to *_add endpoints

Build has FAILED

Patch application report for D5419 (id=19463)

Rebasing onto 39507b24d0...

First, rewinding head to replay your work on top of it...
Applying: Cassandra: Deduplicate lists passed to *_add endpoints
Changes applied before test
commit 9e1a825dbe11a214fe84b0a44d2b390bca85e712
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Mon Apr 5 17:09:33 2021 +0530

    Cassandra: Deduplicate lists passed to *_add endpoints
    
    Previously only release_add supported deduplication.
    This commit aligns other _add endpoints with it

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1250/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1250/console

These objects has an id attribute. You can use it for deduplication (eg. via a dict)

Not all objects had id so I used swhid. But some of the tests are failing.

Exactly these 3 tests are failing for each of in_memory, cassandra, and api_client :

  1. test_content_add_collision
  2. test_release_add
  3. test_release_add_twice

I have ideas of fix these but I think they are most likely wrong. So can you please review these and tell me what can be done ?

Not all objects had id so I used swhid. But some of the tests are failing.

Only content does not have an id, because they should only be deduplicated using all hashes, hence the test_content_add_collision failure

I have ideas of fix these but I think they are most likely wrong. So can you please review these and tell me what can be done ?

What do you think should be done for releases?

Updating D5419: Cassandra: Fixed failing tests

Build is green

Patch application report for D5419 (id=19532)

Rebasing onto ccaac113a0...

First, rewinding head to replay your work on top of it...
Applying: Cassandra: Deduplicate lists passed to *_add endpoints
Changes applied before test
commit 22b3a91e5e180d85483080de05e861cba98070ad
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Mon Apr 5 17:09:33 2021 +0530

    Cassandra: Deduplicate lists passed to *_add endpoints
    
    Previously only release_add supported deduplication.
    This commit aligns other _add endpoints with it

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1251/ for more details.

What do you think should be done for releases?

Turns out I did some mistakes while renaming the variables.

swh/storage/cassandra/storage.py
203

Not covered. Should I do something about this?

vlorentz requested changes to this revision.Apr 8 2021, 6:25 PM
vlorentz added inline comments.
swh/storage/cassandra/storage.py
195–200

use a tuple

203

it's intentional, see test_cassandra.py

358

d.id (swhid() is formatted based on the id, so this just adds useless computation)

490

r.id

633

s.id

906

o.url

911

no need for set() now

This revision now requires changes to proceed.Apr 8 2021, 6:25 PM

Updating D5419: Cassandra: Deduplicate lists passed to *_add endpoints

swh/storage/cassandra/storage.py
195–200

you don't need to convert to strings

911

The comment is outdated, and this line actually does nothing now (it sorts a list using the list's own order...)

Build was aborted

Patch application report for D5419 (id=19547)

Rebasing onto ccaac113a0...

First, rewinding head to replay your work on top of it...
Applying: Cassandra: Deduplicate lists passed to *_add endpoints
Changes applied before test
commit 77989fe5f2800a39fa822806fe4181b1cb3362c4
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Mon Apr 5 17:09:33 2021 +0530

    Cassandra: Deduplicate lists passed to *_add endpoints
    
    Previously only release_add supported deduplication.
    This commit aligns other _add endpoints with it

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1252/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1252/console

Build is green

Patch application report for D5419 (id=19547)

Rebasing onto ccaac113a0...

First, rewinding head to replay your work on top of it...
Applying: Cassandra: Deduplicate lists passed to *_add endpoints
Changes applied before test
commit a24f61d1b3d35007f6773b5843a75ab0cfa0ad10
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Mon Apr 5 17:09:33 2021 +0530

    Cassandra: Deduplicate lists passed to *_add endpoints
    
    Previously only release_add supported deduplication.
    This commit aligns other _add endpoints with it

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1253/ for more details.

Updating D5419: Cassandra: Deduplicate lists passed to *_add endpoints

Build is green

Patch application report for D5419 (id=19563)

Rebasing onto ccaac113a0...

First, rewinding head to replay your work on top of it...
Applying: Cassandra: Deduplicate lists passed to *_add endpoints
Changes applied before test
commit 551c51d543a57c015655d7ababf373e1229bd420
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Mon Apr 5 17:09:33 2021 +0530

    Cassandra: Deduplicate lists passed to *_add endpoints
    
    Previously only release_add supported deduplication.
    This commit aligns other _add endpoints with it

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1254/ for more details.

This revision is now accepted and ready to land.Apr 12 2021, 1:10 PM

Updating D5419: Cassandra: Deduplicate lists passed to *_add endpoints

This revision was landed with ongoing or failed builds.Apr 12 2021, 1:30 PM
This revision was automatically updated to reflect the committed changes.

Build is green

Patch application report for D5419 (id=19605)

Rebasing onto 933289e2ef...

Current branch diff-target is up to date.
Changes applied before test
commit c96942b4064866cda0a36bbbc865462d0d58b6ca
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Mon Apr 5 17:09:33 2021 +0530

    Cassandra: Deduplicate lists passed to *_add endpoints
    
    Previously only release_add supported deduplication.
    This commit aligns other _add endpoints with it

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1257/ for more details.