Page MenuHomeSoftware Heritage

Storage.content_find returns list instead of single value
Needs ReviewPublic

Authored by faux on Mar 24 2019, 6:10 PM.

Details

Summary

This changes the output of content_find method to a list in case of hash collisions

Diff Detail

Repository
rDSTO Storage manager
Branch
t1349
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 5459
Build 7400: tox-on-jenkinsJenkins
Build 7399: arc lint + arc unit

Event Timeline

faux created this revision.Mar 24 2019, 6:10 PM
faux retitled this revision from changes the output of content_find method to a list in case of hash collisions to Storage.content_find returns list instead of single value.Mar 24 2019, 6:12 PM
faux edited the summary of this revision. (Show Details)
vlorentz requested changes to this revision.Mar 25 2019, 10:32 AM
vlorentz added a subscriber: vlorentz.

In test_content_find_with_present_content, you should not iterate over results, because it would make the test pass if there are no result. Instead, you must check that there is exactly one result, and then check the content of that result.

You must also add another test that checks what happens when a content is duplicated.

swh/storage/in_memory.py
294–295

You can remove that FIXME, because you fixed it :)

321–322

This comment is no longer relevant

swh/storage/storage.py
513–523

Could you rename the variables here? c was a bad choice (before your diff, it's not your fault)

This revision now requires changes to proceed.Mar 25 2019, 10:32 AM
faux added a comment.Wed, Mar 27, 4:53 PM

Thanks, will get back to you with required changes.

fwiw, from irc discussion:

20:07 <faux__> pinkieval: I have made all the required changes but I am still confused about the test as in what should we do when the content is duplicated? Sorry to be a bit late as I was travelling so rarely had internet connectivity
20:15 <+pinkieval> faux__: The content is not duplicated in the existing tests. You must add a new test where the content is duplicated, to see how content_find behaves
20:17 <faux__> By using content_add right? I did that but apparently content_find only finds one result and not two of the same result in the list
20:31 <+pinkieval> the goal of your change is to make content_find find more than one
10:17 <+ardumont> because content_add filters on existing contents so if you inject the same content twice, you will have only 1 content in the db
faux added a comment.Tue, Apr 2, 5:30 AM

fwiw, from irc discussion:

20:07 <faux__> pinkieval: I have made all the required changes but I am still confused about the test as in what should we do when the content is duplicated? Sorry to be a bit late as I was travelling so rarely had internet connectivity
20:15 <+pinkieval> faux__: The content is not duplicated in the existing tests. You must add a new test where the content is duplicated, to see how content_find behaves
20:17 <faux__> By using content_add right? I did that but apparently content_find only finds one result and not two of the same result in the list
20:31 <+pinkieval> the goal of your change is to make content_find find more than one
10:17 <+ardumont> because content_add filters on existing contents so if you inject the same content twice, you will have only 1 content in the db

I also found out the same thing when I was adding duplicate content to database using content_add........ so if it will automatically filter duplicate data then content_find should return only one data as a list, right?.....

ardumont added inline comments.Wed, Apr 10, 11:01 PM
swh/storage/db.py
220–221

If there is no longer a need for limit, then remove it.

226

Please, remove the print statement ;)

faux updated this revision to Diff 4513.Wed, Apr 10, 11:35 PM
faux edited the summary of this revision. (Show Details)

Have made the requested changes
In db.py : I have changed the content_find method to make the sql query on python side
In test_storage.py : Have added test for duplicate content

ardumont added inline comments.Thu, Apr 11, 9:47 AM
swh/storage/db.py
239

Remove the limit if there is no need for it.

(LIMIT ALL seems like no limit to me ;)

ardumont added inline comments.Thu, Apr 11, 10:01 AM
swh/storage/db.py
235

Your query build seems a bit complicated.
Can you please try and adapt more along the lines D1345#inline-8002?

It's simpler to read.
Thanks.

ardumont requested changes to this revision.Thu, Apr 11, 10:03 AM

As per previous comment.

This revision now requires changes to proceed.Thu, Apr 11, 10:03 AM
vlorentz requested changes to this revision.Thu, Apr 11, 10:08 AM
vlorentz added inline comments.
swh/storage/tests/test_storage.py
2425–2426

You shouldn't need an if in a test.

If you expect content_find to return an object, then assume it returns an object.

(same comment on all the conditionals below)

2526–2531

You know the length of the expected return value of self.storage.content_find (2), no need for a for loop that runs only twice.

(Because, if there is a bug in content_find that makes it return only a single element, the for loop will not run, and the test won't catch the bug)

2526–2531

only once *

faux added a comment.EditedThu, Apr 11, 3:38 PM

Almost there. Just stuck on the query part. I do think it is similar to https://forge.softwareheritage.org/D1345#inline-8002.

faux updated this revision to Diff 4539.Fri, Apr 12, 5:47 AM

Removed the if(s), for loop, LIMIT ALL. Made the query a bit more readable.

vlorentz requested changes to this revision.Fri, Apr 12, 12:02 PM
vlorentz added inline comments.
swh/storage/tests/test_storage.py
2425–2426

You should also test the length of the list returned by content_find.

(same comment on the calls below)

2526–2531

You should fully test the return value of content_find:

result = list(self.storage.content_find(finder))


expected_result = [
    {
        ...
    },
    {
        ...
    },
]
self.assertEqual(expected_result, actual_result)
This revision now requires changes to proceed.Fri, Apr 12, 12:02 PM
faux updated this revision to Diff 4556.Fri, Apr 12, 4:37 PM
faux marked 9 inline comments as done.

Added more tests for content_find.

Looking good!

Could you just add one more test, this time with only sha256 (or blake2s256) that collides, and tests:

  1. content_find with only sha256 in the finder
  2. content_find with both sha256 and blake2s256 in the finder
faux updated this revision to Diff 4559.Sat, Apr 13, 9:25 AM

Added the tests for colliding sha256 and blake2s256 hashes

vlorentz requested changes to this revision.Sat, Apr 13, 10:50 AM
vlorentz added inline comments.
swh/storage/db.py
234

You can merge this for loop with the other one.

243–247

There is nothing wrong with returning an empty list. So always return content even if it's empty.
It only made sense to return None when we returned a single item.

swh/storage/tests/test_storage.py
2566

You don't need to compare the length, assertCountEqual does it already.

2624

same

This revision now requires changes to proceed.Sat, Apr 13, 10:50 AM
ardumont resigned from this revision.Sat, Apr 13, 11:06 AM

@vlorentz is doing a great job already ;)

faux updated this revision to Diff 4560.EditedSat, Apr 13, 11:14 AM
faux marked 2 inline comments as done.

Merged the for loop removed assertEqual for length and removed if from content find

faux added a comment.Sat, Apr 13, 11:18 AM

Sorry pushed the wrong thing...

faux updated this revision to Diff 4564.Sat, Apr 13, 11:22 AM

Made the required changes

faux updated this revision to Diff 4594.Tue, Apr 16, 2:19 PM

Missed a function in in_memory