⚙ D645 Add in-memory storage.

douardda added inline comments.Nov 9 2018, 5:22 PM

swh/storage/in_memory.py
58–62	forget it I misread the condition. Nonetheless I don't understand the reason for this type check, nor I get the true utility of this method. eg. at line 598 below we can read: visit_id = self._origin_visit_key({'origin': origin, 'date': visit_date}) I prefer a simple: visit_id = OriginVisitKey(origin, visit_date) Simple, readable, no useless indirection
492–500	wouldn't be better to create the snapshot dict only in the if block?
554–558	isn't return self.snapshot_get(self._origin_visits[visit]['snapshot']) enough here?
589–590	I find this a bit easier to read, but might not be the case for everyone: list(itertools.chain(*self._origins[origin]['visits_dates'].values()))
593–595	why not use itertools also in here?

Move protected methods at the end of the class.
Don't make the tool config part of the key.
key tuples.
no temp dict
Create snap dict only in the if block.
itertools.chain
tool_add is a generator.

vlorentz added a subscriber: olasd.Nov 9 2018, 6:36 PM

vlorentz added inline comments.

swh/storage/in_memory.py
58–62	The goal is to check that `visit['origin']` is an origin's key, not the origin's data.
68–71	I don't know why I did that.
84–87	Even though it doesn't affect this particular backend, I believe it should remain here for users of the API to be backend-agnostic.
134	@olasd ?
311–312	Do we care about the efficiency of this backend?
349–352	Oops.
413–414	Either way is fine with me
440	TIL
554–558	"Explicit is better than implicit." But I can change it if you prefer.
973	I use `*_key` methods so all key computation is at the same place, and keys can be used as a black box by other methods (eg. can be turned into hashes later).
1008–1036	The old storage's docstrings said it returned an iterable, I didn't notice its implementation didn't match.

I'll work on your other comments after D642 lands; it's a bit hard to test without it.

Build has FAILED

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tox/58/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tox/58/console

Harbormaster failed remote builds in B2318: Diff 2009!Nov 9 2018, 6:42 PM

vlorentz marked 6 inline comments as done.Nov 12 2018, 11:23 AM

vlorentz added inline comments.

swh/storage/in_memory.py
32	It would be nice to also add a revision that modifies the storage.py to make it clear what's the "public API" of this class and make sure it's properly documented. I don't understand. The public API is the set of methods that don't have an underscore prefix, and they all have a docstring.
34–47	I have another diff in the waiting that does this; but it's quite a big one.
134	Note: these comments used to be on this line: `key = random.sample(objs, 1)[0]`
149–163	one would expect the method to return only the blob that has the given sha1 and the given sha254, whereas this implementation will return potentially 2 different blobs No, this implementation only returns one; it does a set intersection of all the matching keys, and returns one item of the set. (cf. your FIXME) Actually, that was @olasd 's FIXME. (Note: these comments used to be on `content_find`.)

Fix docstring.
Remove person ids.

Harbormaster failed remote builds in B2321: Diff 2012!Nov 12 2018, 2:39 PM

Build has FAILED

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tox/63/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tox/63/console

Use the same HashCollision as the pgsql storage.

Harbormaster failed remote builds in B2362: Diff 2045!Nov 14 2018, 2:50 PM

Build has FAILED

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tox/71/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tox/71/console

olasd added inline comments.Nov 14 2018, 3:00 PM

swh/storage/in_memory.py
68–71	I think we've done it in the past (e.g. when we were tuning the config for a given tool), so yeah I think it's useful.
134	(the "view on previous revision" button - that looks like a big rewind button - makes these comments useless)
134	In the future we could have several objects with the same sha1. The current API "contract" currently expects us to return only one of these objects.
149–163	Same as the previous remark for content_get_metadata, we could have several objects that match all the hashes we were given as input. So, formally, this method should return a list of all the matches.
311–312	This method should probably be a client-side algorithm (that can itself do something less stack-heavy than recursion) rather than a server-side recursion anyway.

vlorentz marked 2 inline comments as done.Nov 14 2018, 3:11 PM

vlorentz added inline comments.

swh/storage/in_memory.py
149–163	I see. Shall I open a new task to fix this one this Diff is merged (as it affects the pg Storage as well as this new one)?
311–312	Shall I open a new task for this as well?

Rebase (D642 has landed)

Fix order of args of HashCollision to match the pg storage.

Build has FAILED

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tox/79/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tox/79/console

Harbormaster failed remote builds in B2376: Diff 2064!Nov 14 2018, 4:05 PM

Harbormaster completed remote builds in B2377: Diff 2065.Nov 14 2018, 4:15 PM

Build is green
See https://jenkins.softwareheritage.org/job/DSTO/job/tox/81/ for more details.

douardda added inline comments.Nov 15 2018, 10:38 AM

swh/storage/in_memory.py
32	My point was to ensure all those 'public API' (as defined as methods nto starting by a single _) are really "public API" and, is so, ensure that is clear soehow (hence the "documented" part of teh comment). Now, I don't fully get this result https://forge.softwareheritage.org/P333
34–47	I don't understand why this has to be done in a different diff or set of revisions, but I won't veto the diff for this.
58–62	But who are you trying to protect here? from who? It's a private method, and it's the only one in which you check the argument type. Once again, It's not big deal, it's just it looks to me some kind of (tiny) over engineering.
68–71	I would love some more explanation then, cause I don't get the point (not in the code, an IRL discussion is fine for me there).
84–87	I disagree. The "Note: in case of DB errors [...]" is in fact not an API spec, it an implementation detail of the db backend. We do not have a abstract base class where the API is documented, but that does not imply that the current only implementation (DB) IS the API, it's the API plus an implementation. So there is no need to tell irrelevant things on other implementations of the API.
134	There should be a comment stating this, then. Otherwise, reading this line is pretty frightening. And explain why random is better than [0] there...
149–163	No, this implementation only returns one; it does a set intersection of all the matching keys, and returns one item of the set. Sorry, I missed this intersection step.
149–163	I see. Shall I open a new task to fix this one this Diff is merged (as it affects the pg Storage as well as this new one)? yes, it would be the way to go
311–312	We do care about universe's entropy 😄
554–558	"Explicit is better than implicit." yes but smaller is better; easier to read, understand and maintain (if it does not obfuscate the statement)

douardda added inline comments.Nov 15 2018, 11:25 AM

swh/storage/in_memory.py
830	I'd prefer to use the exact same code as storage.py when possible (would make a potential refactoring/factorization easier) when possible, eg. here. unless I'm wrong you could write: return { 'origin': origin, #...
876	`[0:limit]` ? really? get rid of this useless and unpythonic 0 please 😈
877	I'm repeating myself, but I really find this _origin_visit_key useless and adding unnecessary indirection. It's in fact used 3 times in this class, 2 of them in which you create a dict especially for calling the method. Once again, it's a detail in view of the amount of code involved, but it appears to me like a (small) bad smell...
892	Why do you specify the `None` value here? (not the only place where you do that). I know 'explicit is better', but...
973	(eg. can be turned into hashes later). YAGNI! Do NEVER add uneeded complexity today for an hypothetical need tomorrow!
976	All these `copy.deepcopy(dict)` makes me a bit sick. Not sure a better solution is easy to consider here, just a gratuitous complaint!
1009–1011	Why on earth building this `inserted` list then yield from it? Why not just yield the elements from within the `for` loop?
swh/storage/tests/test_storage.py
785–790	These hunks (in test_storage.py) should not be in this Diff. Having to do so in order to make your test_in_memory pass (I guess it's the reason for these modifications) prove your implementation does not respect the "API" of the original storage class. It's probably no big deal here, but it's a bad practice...

vlorentz marked 9 inline comments as done.Nov 15 2018, 11:40 AM

vlorentz added inline comments.

swh/storage/in_memory.py
32	My point was to ensure all those 'public API' (as defined as methods nto starting by a single _) are really "public API" and, is so, ensure that is clear soehow (hence the "documented" part of teh comment). I see. That seems outside the scope of this Diff, could you open a Task? Now, I don't fully get this result https://forge.softwareheritage.org/P333 I did not implement methods that are not covered by tests. As you pointed out, it's not obvious which methods are public, so I used the assumption "tested == public".
34–47	That other diff programmatically enforces the keys. It's better than informative comments that may not always be accurate.
58–62	From myself, that's why I used an assertion. It came useful while writing the code and I figured there was no harm in keeping it. I agree it's unclear; removed.
84–87	Agreed.

vlorentz marked 3 inline comments as done.Nov 15 2018, 11:48 AM

vlorentz added inline comments.

swh/storage/in_memory.py
554–558	I just re-read the code, and I really don't think `return self.snapshot_get(self._origin_visits[visit]['snapshot'])` is better. It just happens to work because `snapshot_get` behaves like this with non-`bytes` as argument, but this is not documented, there is no guarantee this will last forever.

vlorentz marked 11 inline comments as done.Nov 15 2018, 12:04 PM

vlorentz added inline comments.

swh/storage/in_memory.py
892	I only learned last week the second argument of dict.get is optional ^^
976	Wait for the next diff, I'll store everything in namedtuples with immutable values :)
1009–1011	If the generator is not consumed, `yield` statements block, so nothing would be inserted.
swh/storage/tests/test_storage.py
785–790	comments like `# hack: ids generated` show that these IDs are quirks from the SQL backend, not features.
785–790	Actually, according to Storage's docstrings, `actual_result['author']` and `actual_result['committer']` should not even exist.