Page MenuHomeSoftware Heritage

Add in-memory storage.
ClosedPublic

Authored by vlorentz on Nov 9 2018, 12:57 PM.

Details

Test Plan

Land the two parent diffs, then rebase this one on top of them.

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
swh/storage/in_memory.py
58–62

forget it I misread the condition. Nonetheless I don't understand the reason for this type check, nor I get the true utility of this method. eg. at line 598 below we can read:

visit_id = self._origin_visit_key({'origin': origin, 'date': visit_date})

I prefer a simple:

visit_id = OriginVisitKey(origin, visit_date)

Simple, readable, no useless indirection

492–500

wouldn't be better to create the snapshot dict only in the if block?

554–558

isn't

return self.snapshot_get(self._origin_visits[visit]['snapshot'])

enough here?

589–590

I find this a bit easier to read, but might not be the case for everyone:

list(itertools.chain(*self._origins[origin]['visits_dates'].values()))
593–595

why not use itertools also in here?

vlorentz marked 20 inline comments as done.
  • Move protected methods at the end of the class.
  • Don't make the tool config part of the key.
  • key tuples.
  • no temp dict
  • Create snap dict only in the if block.
  • itertools.chain
  • tool_add is a generator.
vlorentz added inline comments.
swh/storage/in_memory.py
58–62

The goal is to check that visit['origin'] is an origin's key, not the origin's data.

68–71

I don't know why I did that.

84–87

Even though it doesn't affect this particular backend, I believe it should remain here for users of the API to be backend-agnostic.

134
311–312

Do we care about the efficiency of this backend?

349–352

Oops.

413–414

Either way is fine with me

440

TIL

554–558

"Explicit is better than implicit."

But I can change it if you prefer.

973

I use *_key methods so all key computation is at the same place, and keys can be used as a black box by other methods (eg. can be turned into hashes later).

1008–1036

The old storage's docstrings said it returned an iterable, I didn't notice its implementation didn't match.

I'll work on your other comments after D642 lands; it's a bit hard to test without it.

vlorentz added inline comments.
swh/storage/in_memory.py
32

It would be nice to also add a revision that modifies the storage.py to make it clear what's the "public API" of this class and make sure it's properly documented.

I don't understand. The public API is the set of methods that don't have an underscore prefix, and they all have a docstring.

34–47

I have another diff in the waiting that does this; but it's quite a big one.

134

Note: these comments used to be on this line: key = random.sample(objs, 1)[0]

149–163

one would expect the method to return only the blob that has the given sha1 and the given sha254, whereas this implementation will return potentially 2 different blobs

No, this implementation only returns one; it does a set intersection of all the matching keys, and returns one item of the set.

(cf. your FIXME)

Actually, that was @olasd 's FIXME.

(Note: these comments used to be on content_find.)

  • Fix docstring.
  • Remove person ids.
  • Use the same HashCollision as the pgsql storage.
swh/storage/in_memory.py
68–71

I think we've done it in the past (e.g. when we were tuning the config for a given tool), so yeah I think it's useful.

134

(the "view on previous revision" button - that looks like a big rewind button - makes these comments useless)

134

In the future we could have several objects with the same sha1. The current API "contract" currently expects us to return only one of these objects.

149–163

Same as the previous remark for content_get_metadata, we could have several objects that match all the hashes we were given as input. So, formally, this method should return a list of all the matches.

311–312

This method should probably be a client-side algorithm (that can itself do something less stack-heavy than recursion) rather than a server-side recursion anyway.

vlorentz added inline comments.
swh/storage/in_memory.py
149–163

I see. Shall I open a new task to fix this one this Diff is merged (as it affects the pg Storage as well as this new one)?

311–312

Shall I open a new task for this as well?

  • Fix order of args of HashCollision to match the pg storage.
swh/storage/in_memory.py
32

My point was to ensure all those 'public API' (as defined as methods nto starting by a single _) are really "public API" and, is so, ensure that is clear soehow (hence the "documented" part of teh comment).

Now, I don't fully get this result https://forge.softwareheritage.org/P333

34–47

I don't understand why this has to be done in a different diff or set of revisions, but I won't veto the diff for this.

58–62

But who are you trying to protect here? from who? It's a private method, and it's the only one in which you check the argument type. Once again, It's not big deal, it's just it looks to me some kind of (tiny) over engineering.

68–71

I would love some more explanation then, cause I don't get the point (not in the code, an IRL discussion is fine for me there).

84–87

I disagree. The "Note: in case of DB errors [...]" is in fact not an API spec, it an implementation detail of the db backend. We do not have a abstract base class where the API is documented, but that does not imply that the current only implementation (DB) IS the API, it's the API plus an implementation. So there is no need to tell irrelevant things on other implementations of the API.

134

There should be a comment stating this, then. Otherwise, reading this line is pretty frightening. And explain why random is better than [0] there...

149–163

No, this implementation only returns one; it does a set intersection of all the matching keys, and returns one item of the set.

Sorry, I missed this intersection step.

149–163

I see. Shall I open a new task to fix this one this Diff is merged (as it affects the pg Storage as well as this new one)?

yes, it would be the way to go

311–312

We do care about universe's entropy 😄

554–558

"Explicit is better than implicit."

yes but smaller is better; easier to read, understand and maintain (if it does not obfuscate the statement)

swh/storage/in_memory.py
830

I'd prefer to use the exact same code as storage.py when possible (would make a potential refactoring/factorization easier) when possible, eg. here. unless I'm wrong you could write:

return {
  'origin': origin,
 #...
876

[0:limit] ? really? get rid of this useless and unpythonic 0 please 😈

877

I'm repeating myself, but I really find this _origin_visit_key useless and adding unnecessary indirection.

It's in fact used 3 times in this class, 2 of them in which you create a dict especially for calling the method.

Once again, it's a detail in view of the amount of code involved, but it appears to me like a (small) bad smell...

892

Why do you specify the `None` value here? (not the only place where you do that).

I know 'explicit is better', but...

973

(eg. can be turned into hashes later).

YAGNI!

Do NEVER add uneeded complexity today for an hypothetical need tomorrow!

976

All these `copy.deepcopy(dict)` makes me a bit sick. Not sure a better solution is easy to consider here, just a gratuitous complaint!

1009–1011

Why on earth building this inserted list then yield from it?
Why not just yield the elements from within the for loop?

swh/storage/tests/test_storage.py
789–794

These hunks (in test_storage.py) should not be in this Diff. Having to do so in order to make your test_in_memory pass (I guess it's the reason for these modifications) prove your implementation does not respect the "API" of the original storage class.

It's probably no big deal here, but it's a bad practice...

vlorentz added inline comments.
swh/storage/in_memory.py
32

My point was to ensure all those 'public API' (as defined as methods nto starting by a single _) are really "public API" and, is so, ensure that is clear soehow (hence the "documented" part of teh comment).

I see. That seems outside the scope of this Diff, could you open a Task?

Now, I don't fully get this result https://forge.softwareheritage.org/P333

I did not implement methods that are not covered by tests. As you pointed out, it's not obvious which methods are public, so I used the assumption "tested == public".

34–47

That other diff programmatically enforces the keys. It's better than informative comments that may not always be accurate.

58–62

From myself, that's why I used an assertion. It came useful while writing the code and I figured there was no harm in keeping it. I agree it's unclear; removed.

84–87

Agreed.

vlorentz added inline comments.
swh/storage/in_memory.py
554–558

I just re-read the code, and I really don't think return self.snapshot_get(self._origin_visits[visit]['snapshot']) is better. It just happens to work because snapshot_get behaves like this with non-bytes as argument, but this is not documented, there is no guarantee this will last forever.

vlorentz added inline comments.
swh/storage/in_memory.py
892

I only learned last week the second argument of dict.get is optional ^^

976

Wait for the next diff, I'll store everything in namedtuples with immutable values :)

1009–1011

If the generator is not consumed, yield statements block, so nothing would be inserted.

swh/storage/tests/test_storage.py
789–794

comments like # hack: ids generated show that these IDs are quirks from the SQL backend, not features.

789–794

Actually, according to Storage's docstrings, actual_result['author'] and actual_result['committer'] should not even exist.

vlorentz marked an inline comment as done.
  • Remove useless assertion.
  • Remove irrelevant doc.
  • Small fixes.
  • Fix test.
  • Remove _origin_visit_key.
  • dict.get(foo, None) -> dict.get(foo)
  • Remove _origin_metadata_key.
  • Explain why we use random.sample.
This revision is now accepted and ready to land.Nov 19 2018, 9:37 AM
This revision was automatically updated to reflect the committed changes.