Page MenuHomeSoftware Heritage

Storage.content_find returns list instead of single value
ClosedPublic

Authored by faux on Mar 24 2019, 6:10 PM.

Diff Detail

Repository
rDSTO Storage manager
Branch
t1349
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 5412
Build 7335: tox-on-jenkinsJenkins
Build 7334: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
vlorentz added a subscriber: vlorentz.

In test_content_find_with_present_content, you should not iterate over results, because it would make the test pass if there are no result. Instead, you must check that there is exactly one result, and then check the content of that result.

You must also add another test that checks what happens when a content is duplicated.

swh/storage/in_memory.py
294–295

You can remove that FIXME, because you fixed it :)

321–322

This comment is no longer relevant

swh/storage/storage.py
513–523

Could you rename the variables here? c was a bad choice (before your diff, it's not your fault)

This revision now requires changes to proceed.Mar 25 2019, 10:32 AM

Thanks, will get back to you with required changes.

fwiw, from irc discussion:

20:07 <faux__> pinkieval: I have made all the required changes but I am still confused about the test as in what should we do when the content is duplicated? Sorry to be a bit late as I was travelling so rarely had internet connectivity
20:15 <+pinkieval> faux__: The content is not duplicated in the existing tests. You must add a new test where the content is duplicated, to see how content_find behaves
20:17 <faux__> By using content_add right? I did that but apparently content_find only finds one result and not two of the same result in the list
20:31 <+pinkieval> the goal of your change is to make content_find find more than one
10:17 <+ardumont> because content_add filters on existing contents so if you inject the same content twice, you will have only 1 content in the db

fwiw, from irc discussion:

20:07 <faux__> pinkieval: I have made all the required changes but I am still confused about the test as in what should we do when the content is duplicated? Sorry to be a bit late as I was travelling so rarely had internet connectivity
20:15 <+pinkieval> faux__: The content is not duplicated in the existing tests. You must add a new test where the content is duplicated, to see how content_find behaves
20:17 <faux__> By using content_add right? I did that but apparently content_find only finds one result and not two of the same result in the list
20:31 <+pinkieval> the goal of your change is to make content_find find more than one
10:17 <+ardumont> because content_add filters on existing contents so if you inject the same content twice, you will have only 1 content in the db

I also found out the same thing when I was adding duplicate content to database using content_add........ so if it will automatically filter duplicate data then content_find should return only one data as a list, right?.....

swh/storage/db.py
220–221

If there is no longer a need for limit, then remove it.

225

Please, remove the print statement ;)

faux edited the summary of this revision. (Show Details)

Have made the requested changes
In db.py : I have changed the content_find method to make the sql query on python side
In test_storage.py : Have added test for duplicate content

swh/storage/db.py
238

Remove the limit if there is no need for it.

(LIMIT ALL seems like no limit to me ;)

swh/storage/db.py
234

Your query build seems a bit complicated.
Can you please try and adapt more along the lines D1345#inline-8002?

It's simpler to read.
Thanks.

As per previous comment.

This revision now requires changes to proceed.Apr 11 2019, 10:03 AM
vlorentz added inline comments.
swh/storage/tests/test_storage.py
2425–2426

You shouldn't need an if in a test.

If you expect content_find to return an object, then assume it returns an object.

(same comment on all the conditionals below)

2525–2530

You know the length of the expected return value of self.storage.content_find (2), no need for a for loop that runs only twice.

(Because, if there is a bug in content_find that makes it return only a single element, the for loop will not run, and the test won't catch the bug)

2525–2530

only once *

Almost there. Just stuck on the query part. I do think it is similar to https://forge.softwareheritage.org/D1345#inline-8002.

Removed the if(s), for loop, LIMIT ALL. Made the query a bit more readable.

vlorentz added inline comments.
swh/storage/tests/test_storage.py
2425–2426

You should also test the length of the list returned by content_find.

(same comment on the calls below)

2525–2530

You should fully test the return value of content_find:

result = list(self.storage.content_find(finder))


expected_result = [
    {
        ...
    },
    {
        ...
    },
]
self.assertEqual(expected_result, actual_result)
This revision now requires changes to proceed.Apr 12 2019, 12:02 PM
faux marked 9 inline comments as done.

Added more tests for content_find.

Looking good!

Could you just add one more test, this time with only sha256 (or blake2s256) that collides, and tests:

  1. content_find with only sha256 in the finder
  2. content_find with both sha256 and blake2s256 in the finder

Added the tests for colliding sha256 and blake2s256 hashes

vlorentz added inline comments.
swh/storage/db.py
233

You can merge this for loop with the other one.

242–246

There is nothing wrong with returning an empty list. So always return content even if it's empty.
It only made sense to return None when we returned a single item.

swh/storage/tests/test_storage.py
2565

You don't need to compare the length, assertCountEqual does it already.

2623

same

This revision now requires changes to proceed.Apr 13 2019, 10:50 AM

@vlorentz is doing a great job already ;)

faux marked 2 inline comments as done.

Merged the for loop removed assertEqual for length and removed if from content find

Sorry pushed the wrong thing...

Made the required changes

Missed a function in in_memory

First build failed uploading again

Used the for loop once removed new_checksum_dict as it was not needed.

swh/storage/db.py
222–230

nitpick: you can rewrite it like this to be more readable:

where_parts = []
args = []
for algorithm in checksum_dict:
    if checksum_dict[algorithm] is not None:
        parts.append(checksum_dict[algorithm])
        where_parts.append(algorithm + ' = %s')

then use ' AND '.join(where_parts) below.

This revision is now accepted and ready to land.May 16 2019, 1:43 PM
This revision was automatically updated to reflect the committed changes.