Details

Reviewers

vlorentz
ardumont

Group Reviewers

Reviewers

Maniphest Tasks

T1349: Storage.content_find should return all matches, not just one.

Commits

rDSTOC02134a705a12: Changes the output of content_find method to a list in case of hash collisions…
rDSTO02134a705a12: Changes the output of content_find method to a list in case of hash collisions…

Required Signatures

L3 Software Heritage Contributor License Agreement, version 1.0

Summary

This changes the output of content_find method to a list in case of hash collisions

Diff Detail

Repository

rDSTO Storage manager

Branch

t1349

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 5410
Build 7331: tox-on-jenkins	Jenkins
Build 7330: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Harbormaster completed remote builds in B4777: Diff 4065.Mar 24 2019, 6:13 PM

In test_content_find_with_present_content, you should not iterate over results, because it would make the test pass if there are no result. Instead, you must check that there is exactly one result, and then check the content of that result.

You must also add another test that checks what happens when a content is duplicated.

swh/storage/in_memory.py
294–295	You can remove that FIXME, because you fixed it :)
321–322	This comment is no longer relevant
swh/storage/storage.py
513–523	Could you rename the variables here? `c` was a bad choice (before your diff, it's not your fault)

This revision now requires changes to proceed.Mar 25 2019, 10:32 AM

Thanks, will get back to you with required changes.

fwiw, from irc discussion:

20:07 <faux__> pinkieval: I have made all the required changes but I am still confused about the test as in what should we do when the content is duplicated? Sorry to be a bit late as I was travelling so rarely had internet connectivity
20:15 <+pinkieval> faux__: The content is not duplicated in the existing tests. You must add a new test where the content is duplicated, to see how content_find behaves
20:17 <faux__> By using content_add right? I did that but apparently content_find only finds one result and not two of the same result in the list
20:31 <+pinkieval> the goal of your change is to make content_find find more than one
10:17 <+ardumont> because content_add filters on existing contents so if you inject the same content twice, you will have only 1 content in the db

In D1288#28317, @ardumont wrote:

fwiw, from irc discussion:

20:07 <faux__> pinkieval: I have made all the required changes but I am still confused about the test as in what should we do when the content is duplicated? Sorry to be a bit late as I was travelling so rarely had internet connectivity
20:15 <+pinkieval> faux__: The content is not duplicated in the existing tests. You must add a new test where the content is duplicated, to see how content_find behaves
20:17 <faux__> By using content_add right? I did that but apparently content_find only finds one result and not two of the same result in the list
20:31 <+pinkieval> the goal of your change is to make content_find find more than one
10:17 <+ardumont> because content_add filters on existing contents so if you inject the same content twice, you will have only 1 content in the db

I also found out the same thing when I was adding duplicate content to database using content_add........ so if it will automatically filter duplicate data then content_find should return only one data as a list, right?.....

ardumont added inline comments.Apr 10 2019, 11:01 PM

swh/storage/db.py
220–221	If there is no longer a need for limit, then remove it.
225	Please, remove the print statement ;)

Have made the requested changes
In db.py : I have changed the content_find method to make the sql query on python side
In test_storage.py : Have added test for duplicate content

Harbormaster failed remote builds in B5320: Diff 4513!Apr 10 2019, 11:36 PM

Build has FAILED

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tox/357/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tox/357/console

ardumont added inline comments.Apr 11 2019, 9:47 AM

swh/storage/db.py
238	Remove the limit if there is no need for it. (`LIMIT ALL` seems like no limit to me ;)

ardumont added inline comments.Apr 11 2019, 10:01 AM

swh/storage/db.py
234	Your query build seems a bit complicated. Can you please try and adapt more along the lines D1345#inline-8002? It's simpler to read. Thanks.

As per previous comment.

This revision now requires changes to proceed.Apr 11 2019, 10:03 AM

vlorentz requested changes to this revision.Apr 11 2019, 10:08 AM

vlorentz added inline comments.

swh/storage/tests/test_storage.py
2425–2426	You shouldn't need an `if` in a test. If you expect `content_find` to return an object, then assume it returns an object. (same comment on all the conditionals below)
2525–2530	You know the length of the expected return value of `self.storage.content_find` (2), no need for a for loop that runs only twice. (Because, if there is a bug in `content_find` that makes it return only a single element, the for loop will not run, and the test won't catch the bug)
2525–2530	only once *

Almost there. Just stuck on the query part. I do think it is similar to https://forge.softwareheritage.org/D1345#inline-8002.

Removed the if(s), for loop, LIMIT ALL. Made the query a bit more readable.

Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/361/ for more details.

Harbormaster completed remote builds in B5347: Diff 4539.Apr 12 2019, 5:52 AM

vlorentz requested changes to this revision.Apr 12 2019, 12:02 PM

vlorentz added inline comments.

swh/storage/tests/test_storage.py
2425–2426	You should also test the length of the list returned by `content_find`. (same comment on the calls below)
2525–2530	You should fully test the return value of `content_find`: result = list(self.storage.content_find(finder)) expected_result = [ { ... }, { ... }, ] self.assertEqual(expected_result, actual_result)

This revision now requires changes to proceed.Apr 12 2019, 12:02 PM

Added more tests for content_find.

Looking good!

Could you just add one more test, this time with only sha256 (or blake2s256) that collides, and tests:

content_find with only sha256 in the finder
content_find with both sha256 and blake2s256 in the finder

Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/364/ for more details.

Harbormaster completed remote builds in B5407: Diff 4556.Apr 12 2019, 4:43 PM

Added the tests for colliding sha256 and blake2s256 hashes

Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/366/ for more details.

Harbormaster completed remote builds in B5410: Diff 4559.Apr 13 2019, 9:30 AM

vlorentz requested changes to this revision.Apr 13 2019, 10:50 AM

vlorentz added inline comments.

swh/storage/db.py
234	You can merge this `for` loop with the other one.
243–247	There is nothing wrong with returning an empty list. So always return `content` even if it's empty. It only made sense to return `None` when we returned a single item.
swh/storage/tests/test_storage.py
2566	You don't need to compare the length, assertCountEqual does it already.
2624	same

This revision now requires changes to proceed.Apr 13 2019, 10:50 AM

@vlorentz is doing a great job already ;)

Merged the for loop removed assertEqual for length and removed if from content find

Sorry pushed the wrong thing...

Build has FAILED

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tox/367/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tox/367/console

Harbormaster failed remote builds in B5412: Diff 4560!Apr 13 2019, 11:19 AM

Made the required changes

Build has FAILED

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tox/368/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tox/368/console

Harbormaster failed remote builds in B5426: Diff 4564!Apr 13 2019, 11:27 AM

Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/369/ for more details.

Harbormaster completed remote builds in B5426: Diff 4564.Apr 13 2019, 11:36 AM

Missed a function in in_memory

Build has FAILED

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tox/374/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tox/374/console

Harbormaster failed remote builds in B5459: Diff 4594!Apr 16 2019, 2:23 PM

anlambert added a child revision: D1420: Made changes to adapt it to new content_find return type.May 2 2019, 11:33 AM

First build failed uploading again

faux marked 4 inline comments as done.May 15 2019, 10:41 PM

Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/384/ for more details.

Harbormaster completed remote builds in B5746: Diff 4843.May 15 2019, 10:44 PM

Used the for loop once removed new_checksum_dict as it was not needed.

faux marked an inline comment as done.May 15 2019, 11:04 PM

Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/385/ for more details.

Harbormaster completed remote builds in B5747: Diff 4844.May 15 2019, 11:10 PM

vlorentz added inline comments.May 16 2019, 10:45 AM

swh/storage/db.py
222–230	nitpick: you can rewrite it like this to be more readable: where_parts = [] args = [] for algorithm in checksum_dict: if checksum_dict[algorithm] is not None: parts.append(checksum_dict[algorithm]) where_parts.append(algorithm + ' = %s') then use `' AND '.join(where_parts)` below.