Page MenuHomeSoftware Heritage

Implement content_get and content_get_range for the in-mem storage.
ClosedPublic

Authored by vlorentz on Wed, Nov 21, 6:32 PM.

Details

Summary

Needed to work on T1307 for the indexers.

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

vlorentz created this revision.Wed, Nov 21, 6:32 PM
vlorentz edited the summary of this revision. (Show Details)Wed, Nov 21, 6:33 PM
vlorentz updated this revision to Diff 2177.Wed, Nov 21, 6:36 PM
  • remove irrelevant fixme.
ardumont requested changes to this revision.Thu, Nov 22, 10:06 AM
ardumont added subscribers: zack, ardumont.

Some question, docstring improvments, formatting alignment are the requirement changes.

Other that that, this seems ready to roll.

Thanks.

swh/storage/in_memory.py
88

Why?
A comment would help.

102

Dict[bytes, bytes]?

104

sha1 (bytes): content id

105

data (bytes): content's raw data

112

remove irrelevant fixme.

You mean irrelevant for the in-memory usage (test) i guess.

Explaining why i ask.
For the initial storage, as we starting developing paginated endpoints, i'm wondering if it would not be better to align the endpoint as well. If it's possible that is (i'm not sure we enforce an order for example).

That would be out of scope for that diff ;)
The question might be worth discussing (the conclusion could also be to drop the conditional altogether ;).

126

Can you please clarify what you mean by 'enforced with multiplicity'?

128

Can't we skip those?

Pushing filtering on the endpoint consumer feels bad to me.

(and if in the main storage there is such a case, that's an out of scope issue, an issue nonetheless).

156

I never said it before.
We are growing inconsistent with the formatting.
That seems like nothing but that can end up doing some irrelevant noise in commits.

So this should be avoided.

Naturally, that's up for discussion but in the mean time, that'd be great to align to current conventions.
So. in this case, ending character aligned at the beginning of the initial keyword.

return {
    'contents': matched,
    'next': sha1,
}

As per tool, that'd be great not to bother with this and have some tool doing that for us.
@zack mentioned black 'recently' but we never (re)acted on that matter...

This revision now requires changes to proceed.Thu, Nov 22, 10:06 AM
douardda requested changes to this revision.Thu, Nov 22, 10:44 AM
douardda added a subscriber: douardda.
douardda added inline comments.
swh/storage/in_memory.py
147–164

Same as a previous comment, prefer a single return statement when possible/not too obfuscating. Here, something like:

from itertools import chain, islice
next = None,
contents = []
sha1s = islice(self._sorted_sha1s, 0, from_index)
it_contents = (((sha1, v) for v in self._content_indexes['sha1'][sha1])
            for sha1 in sha1s)

for sha1, key in islice(chain.from_iterable(it_contents), limit):
    if sha1 > end:
        break
    contents.append({
            'data': self._contents_data[key],
            **self._contents[key],
            })
else:
    next = sha1
return dict(next=next, contents=contents)
vlorentz marked 9 inline comments as done.Thu, Nov 22, 12:01 PM
vlorentz added inline comments.
swh/storage/in_memory.py
88

Why what? It just inserts the new sha1 in the list

112

nvm, I misunderstood that comment

128

That sentence was wrong, content_get_range returns no None item.

147–164

You can't do that, because it assigns the last returned value to next, instead of the first non-returned one. As a consequence, UnboundLocalError when there is no result because the for loop does not run.

So the for loop needs to be rewritten like this:

next = None
for sha1, key in chain.from_iterable(it_contents):
    if sha1 > end:
        break
    if len(contents) >= limit:
        next = sha1
        break
    contents.append({
            'data': self._contents_data[key],
            **self._contents[key],
            })
return dict(next=next, contents=contents)

Which I find slightly less readable than this:

for sha1, key in chain.from_iterable(it_contents):
    if sha1 > end:
        break
    if len(contents) >= limit:
        return dict(next=sha1, contents=contents)
    contents.append({
            'data': self._contents_data[key],
            **self._contents[key],
            })
return dict(next=None, contents=contents)

because the former uses an extra next variable to do the same thing (with a "break then immediately return" instead of just a return).

vlorentz updated this revision to Diff 2182.Thu, Nov 22, 12:01 PM
ardumont added inline comments.Thu, Nov 22, 12:46 PM
swh/storage/in_memory.py
147–164

Yes, i was wondering whether what david proposed would work.

Pick either one of the last one you proposed.
They sound both fine to me.

swh/storage/storage.py
231–232

[bytes, bytes]?

douardda added inline comments.Thu, Nov 22, 1:27 PM
swh/storage/in_memory.py
147–164

Indeed I changed the behavior in my snippet, but a small fix could be:

for sha1, key in islice(chain.from_iterable(it_contents), limit+1):
  #[...]
return dict(next=next, contents=contents[:limit])

would do the trick, but again, it's readability is debatable.

Whatever, I still don't like your return "hidden" in the middle of the loop, but I won't fight.

ardumont added inline comments.Thu, Nov 22, 1:39 PM
swh/storage/in_memory.py
88

You change something in the function content_add.
Your diff is about content_get and content_get_range.
It's not immediately apparent what this does.

Now it's clearer ;)

102

Why is the key str? What did i miss?

112

Since you touch the comment, you can now fix it to make it clearer then ;)

vlorentz marked 3 inline comments as done.Thu, Nov 22, 1:45 PM
vlorentz added inline comments.
swh/storage/in_memory.py
88

Yeah, I added self._sorted_sha1 to make searches run in logtime.

102

Keys are 'sha1' and 'data'.

swh/storage/storage.py
231–232

No, keys are 'sha1' and 'data'.

vlorentz updated this revision to Diff 2185.Thu, Nov 22, 1:47 PM
  • improve the fixme.
ardumont added inline comments.Thu, Nov 22, 2:33 PM
swh/storage/in_memory.py
102

right!

swh/storage/storage.py
231–232

yes, my bad ;)

vlorentz updated this revision to Diff 2230.Fri, Nov 23, 5:35 PM
  • Use a generator to merge the two loops.
douardda accepted this revision.Mon, Nov 26, 10:30 AM
ardumont accepted this revision.Mon, Nov 26, 11:01 AM
This revision is now accepted and ready to land.Mon, Nov 26, 11:01 AM
This revision was automatically updated to reflect the committed changes.