Page MenuHomeSoftware Heritage

Implement content_get and content_get_range for the in-mem storage.
ClosedPublic

Authored by vlorentz on Nov 21 2018, 6:32 PM.

Diff Detail

Repository
rDSTO Storage manager
Branch
inmem-content-get-range
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 2525
Build 3132: tox-on-jenkinsJenkins
Build 3131: arc lint + arc unit

Event Timeline

  • remove irrelevant fixme.
ardumont added subscribers: zack, ardumont.

Some question, docstring improvments, formatting alignment are the requirement changes.

Other that that, this seems ready to roll.

Thanks.

swh/storage/in_memory.py
87

Why?
A comment would help.

101

Dict[bytes, bytes]?

103

sha1 (bytes): content id

104

data (bytes): content's raw data

111

remove irrelevant fixme.

You mean irrelevant for the in-memory usage (test) i guess.

Explaining why i ask.
For the initial storage, as we starting developing paginated endpoints, i'm wondering if it would not be better to align the endpoint as well. If it's possible that is (i'm not sure we enforce an order for example).

That would be out of scope for that diff ;)
The question might be worth discussing (the conclusion could also be to drop the conditional altogether ;).

125

Can you please clarify what you mean by 'enforced with multiplicity'?

127

Can't we skip those?

Pushing filtering on the endpoint consumer feels bad to me.

(and if in the main storage there is such a case, that's an out of scope issue, an issue nonetheless).

155

I never said it before.
We are growing inconsistent with the formatting.
That seems like nothing but that can end up doing some irrelevant noise in commits.

So this should be avoided.

Naturally, that's up for discussion but in the mean time, that'd be great to align to current conventions.
So. in this case, ending character aligned at the beginning of the initial keyword.

return {
    'contents': matched,
    'next': sha1,
}

As per tool, that'd be great not to bother with this and have some tool doing that for us.
@zack mentioned black 'recently' but we never (re)acted on that matter...

This revision now requires changes to proceed.Nov 22 2018, 10:06 AM
douardda added a subscriber: douardda.
douardda added inline comments.
swh/storage/in_memory.py
146–163

Same as a previous comment, prefer a single return statement when possible/not too obfuscating. Here, something like:

from itertools import chain, islice
next = None,
contents = []
sha1s = islice(self._sorted_sha1s, 0, from_index)
it_contents = (((sha1, v) for v in self._content_indexes['sha1'][sha1])
            for sha1 in sha1s)

for sha1, key in islice(chain.from_iterable(it_contents), limit):
    if sha1 > end:
        break
    contents.append({
            'data': self._contents_data[key],
            **self._contents[key],
            })
else:
    next = sha1
return dict(next=next, contents=contents)
vlorentz added inline comments.
swh/storage/in_memory.py
87

Why what? It just inserts the new sha1 in the list

111

nvm, I misunderstood that comment

127

That sentence was wrong, content_get_range returns no None item.

146–163

You can't do that, because it assigns the last returned value to next, instead of the first non-returned one. As a consequence, UnboundLocalError when there is no result because the for loop does not run.

So the for loop needs to be rewritten like this:

next = None
for sha1, key in chain.from_iterable(it_contents):
    if sha1 > end:
        break
    if len(contents) >= limit:
        next = sha1
        break
    contents.append({
            'data': self._contents_data[key],
            **self._contents[key],
            })
return dict(next=next, contents=contents)

Which I find slightly less readable than this:

for sha1, key in chain.from_iterable(it_contents):
    if sha1 > end:
        break
    if len(contents) >= limit:
        return dict(next=sha1, contents=contents)
    contents.append({
            'data': self._contents_data[key],
            **self._contents[key],
            })
return dict(next=None, contents=contents)

because the former uses an extra next variable to do the same thing (with a "break then immediately return" instead of just a return).

swh/storage/in_memory.py
146–163

Yes, i was wondering whether what david proposed would work.

Pick either one of the last one you proposed.
They sound both fine to me.

swh/storage/storage.py
231–232

[bytes, bytes]?

swh/storage/in_memory.py
146–163

Indeed I changed the behavior in my snippet, but a small fix could be:

for sha1, key in islice(chain.from_iterable(it_contents), limit+1):
  #[...]
return dict(next=next, contents=contents[:limit])

would do the trick, but again, it's readability is debatable.

Whatever, I still don't like your return "hidden" in the middle of the loop, but I won't fight.

swh/storage/in_memory.py
87

You change something in the function content_add.
Your diff is about content_get and content_get_range.
It's not immediately apparent what this does.

Now it's clearer ;)

101

Why is the key str? What did i miss?

111

Since you touch the comment, you can now fix it to make it clearer then ;)

vlorentz added inline comments.
swh/storage/in_memory.py
87

Yeah, I added self._sorted_sha1 to make searches run in logtime.

101

Keys are 'sha1' and 'data'.

swh/storage/storage.py
231–232

No, keys are 'sha1' and 'data'.

swh/storage/in_memory.py
101

right!

swh/storage/storage.py
231–232

yes, my bad ;)

  • Use a generator to merge the two loops.
This revision is now accepted and ready to land.Nov 26 2018, 11:01 AM
This revision was automatically updated to reflect the committed changes.