Page MenuHomeSoftware Heritage

Improve code quality and doc in BufferedProxyStorage
ClosedPublic

Authored by tenma on Sep 23 2020, 1:08 PM.

Details

Summary
  • better names related to the object buffers
  • extracted parameter dicts from the constructor
  • used more generic typing in function parameters and more specific in

other contexts in order to apply the principle of robustness

Close T2287

Diff Detail

Event Timeline

Build is green

Patch application report for D4017 (id=14169)

Rebasing onto 30cdb78f17...

Current branch diff-target is up to date.
Changes applied before test
commit aa7ca077a4c516f3fa30e45b0cf880a519232743
Author: tenma <tenma+swh@mailbox.org>
Date:   Tue Sep 22 17:57:24 2020 +0200

    Improve code quality and doc in BufferedProxyStorage
    
    - better names related to the object buffers
    - extracted parameter dicts from the constructor
    - used more generic typing in function parameters and more specific in
    other contexts in order to apply the principle of robustness

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/963/ for more details.

ardumont added inline comments.
swh/storage/buffer.py
63
63–64

why not != ?

swh/storage/buffer.py
75

Lifting some doubt i have now (or again), i gather content and skipped_content (included in OBJECT_TYPES) are not concerned by that conditional anyway (asking because their id is not under the "id" name).

I recall that since content_add and skipped_content_add are namely overridden here, they don't pass into the __getattr__ call.

(I noticed that it is the current implementation anyways but still, not really apparent ;)

swh/storage/buffer.py
63

That's common py>=3.5 dict merging, but I hesitated to use this which is shallow merge vs. core/config merge_configs which is a deep merge. For now it is OK, but it would be great that this function was moved to a core utility module with a generic name like deepmerge_dicts so that we could really use it here.

63–64

because it IS a case for identity equality rather than structural one : we test on default value. This check is not needed for correctness but for both clarity and performance.

lgtm

swh/storage/buffer.py
58–60

what does that mean?

63

I think this is enough for our use case though.

i think the todo before the constructor can go away ;)

This revision is now accepted and ready to land.Sep 23 2020, 3:56 PM
swh/storage/buffer.py
75

Yeah I also wondered and checked that getattr gets called after lookup fail, whereas getattribute gets called before

swh/storage/buffer.py
58–60

got it, renaming the parameters, min_batch_size to buffer_thresholds...

swh/storage/buffer.py
21

T`H`RESHOLDS

Fix typo and remove TODO comment

Build is green

Patch application report for D4017 (id=14187)

Rebasing onto 924621fff8...

First, rewinding head to replay your work on top of it...
Applying: Improve code quality and doc in BufferedProxyStorage
Changes applied before test
commit 9d99ba09470f92205bdd3fc8cb13ea7103fdf6dc
Author: tenma <tenma+swh@mailbox.org>
Date:   Tue Sep 22 17:57:24 2020 +0200

    Improve code quality and doc in BufferedProxyStorage
    
    - better names related to the object buffers
    - extracted parameter dicts from the constructor
    - used more generic typing in function parameters and more specific in
    other contexts in order to apply the principle of robustness

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/965/ for more details.

Build is green

Patch application report for D4017 (id=14202)

Rebasing onto e37f63930b...

First, rewinding head to replay your work on top of it...
Fast-forwarded diff-target to base-revision-968-D4017.
Changes applied before test

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/968/ for more details.

olasd requested changes to this revision.Sep 24 2020, 9:36 AM
olasd added a subscriber: olasd.

Thanks for these improvements, which look good in general! I only have a nit to pick on the min_batch_size default argument.

We usually avoid using mutable structures (e.g. a dict) as default function arguments (because that's a footgun: if you mutate the argument for some reason, you can end up mutating the default value). The common pattern is to make the default value None, and to use (or copy) the default value there.

Identity checking a mutable default argument is a very surprising construct (at least I don't think we're using it anywhere else), so I'd rather we avoid it altogether. By making the default value None, you end up killing two birds with one stone.

This revision now requires changes to proceed.Sep 24 2020, 9:36 AM

I view it that way:

  • identity checking a default value is always good, even if mutable, precisely because static values like these can never change identity (identity/value dichotomy). It would be wrong to value equality check it through if it is mutable of course;
  • this value can only be modified inside the module (modulo import time hacking), so the question are:
      • do we want it immutable (we don't plan to have code that mutate it)?
      • trust the module code for well behaving? it is a real policy question, I think your answer is no, but then why manual replacing None if we can do without it?
    • this pattern is equivalent to the None one, except we statically (as opposed to manually replace None by a default) never allow any invalid type like NoneType. I try to leverage static typing to the fullest, and avoid the potential "billion dollar mistake".

After this rationale, which could be a valid policy to introduce in the project, now the proposition is:

Do I make it immutable with ImmutableDict?

vlorentz added inline comments.
swh/storage/buffer.py
63

Just use min_batch_size: Optional[Dict], and test if it's None instead of doing identity comparison on dicts.

As for merging dicts: https://forge.softwareheritage.org/source/swh-core/browse/master/swh/core/config.py$173

Sorry, I missed your two comments before writing mine.

In D4017#99650, @tenma wrote:

I view it that way:

  • identity checking a default value is always good, even if mutable, precisely because static values like these can never change identity (identity/value dichotomy). It would be wrong to value equality check it through if it is mutable of course;

It's just a footgun, and is not an idiomatic pattern in python

  • this value can only be modified inside the module (modulo import time hacking), so the question are:
    • do we want it immutable (we don't plan to have code that mutate it)?

possibly yes, but we don't generally do it, again because it's not idiomatic in Python. We only use ImmutableDict in the model because that one is really important.

  • trust the module code for well behaving? it is a real policy question, I think your answer is no, but then why manual replacing None if we can do without it?

Somewhat. But if the constant has a name in allcaps, I think it's good enough to make sure we don't accidentally mutate it.

  • this pattern is equivalent to the None one, except we statically (as opposed to manually replace None by a default) never allow any invalid type like NoneType.

Again, is None is idiomatic in Python, is <other dict> isn't. And NoneType is not an "invalid type".

I try to leverage static typing to the fullest, and avoid the potential "billion dollar mistake".

When we write the type is Optional[X], we explicitly mention the None case, and mypy will check it.

After this rationale, which could be a valid policy to introduce in the project, now the proposition is:

Do I make it immutable with ImmutableDict?

I'm not against it

In D4017#99650, @tenma wrote:

I view it that way:

  • identity checking a default value is always good, even if mutable, precisely because static values like these can never change identity (identity/value dichotomy). It would be wrong to value equality check it through if it is mutable of course;

Doing an identity check on a mutable value is very often a bug. Plenty of static code checkers (pylint, pyflakes-bugbear) will throw a warning, with good reason.

  • this value can only be modified inside the module (modulo import time hacking), so the question are:
    • do we want it immutable (we don't plan to have code that mutate it)?
    • trust the module code for well behaving? it is a real policy question, I think your answer is no, but then why manual replacing None if we can do without it?

Again, having a mutable default value is a common antipattern that /will/ throw a warning if we add more static code checkers to our tool belt.

  • this pattern is equivalent to the None one, except we statically (as opposed to manually replace None by a default) never allow any invalid type like NoneType. I try to leverage static typing to the fullest, and avoid the potential "billion dollar mistake".

This static type checking argument seems a bit specious to me. None is the idiomatic "default value" to pass to such a function taking a dict as an argument. What's really missing, however, is documenting the actual behavior of the argument: what happens when I pass a mapping with a missing key? Once that's documented, the intent behind the default None will be crystal clear.

After this rationale, which could be a valid policy to introduce in the project, now the proposition is:

Do I make it immutable with ImmutableDict?

*shrug* I really don't think that would be useful.

I see that e37f63930b323a1e811861d298e62516ed94649a has been pushed. Meh.

To properly push after a rebase, you need to do arc diff --update to make sure that the changes after rebase will be picked up by phabricator and the diff will be automatically closed.

In D4017#99688, @olasd wrote:

I see that e37f63930b323a1e811861d298e62516ed94649a has been pushed. Meh.

To properly push after a rebase, you need to do arc diff --update to make sure that the changes after rebase will be picked up by phabricator and the diff will be automatically closed.

Yeah because it was validated by ardumont earlier, and I just made a little change then updated the commit.
As for why it didn't closed it: I did arc diff --update after the merge to master, I think that's why. Then I didn't had time to undo it.

The fact that is not so common a pattern does not make it bad.
It IS exactly like using None if we make this dict immutable, with added benefits of readability and checking. It even has no performance penalty or anything.

The is None pattern is so common because it is the easiest way to work around the mutable argument problem which exists by design (I am not saying it is wrong, but that we can have better).
There are reasons some other languages encourage specific values instead of the null/None singleton, for one it is overloaded with too many usages/meanings that are often conflated and make the code brittle.
I was saying None is invalid because we semantically do not want our value to be None, proof of that is that we always replace it by a meaningful value.

In D4017#99716, @tenma wrote:

The fact that is not so common a pattern does not make it bad.

Please let this go @tenma. Two people in the team have already argued against it. Thanks.

Well, first, this code:

    def __init__(
	        self, storage: Mapping, min_batch_size: Mapping = DEFAULT_BUFFER_THRESHOLDS,
	    ):
	        self.storage: StorageInterface = get_storage(**storage)
	
	        if min_batch_size is not DEFAULT_BUFFER_THRESHOLDS:
	            self._buffer_thresholds = {**DEFAULT_BUFFER_THRESHOLDS, **min_batch_size}

is incorrect, because if the min_batch_size is not given, then self._buffer_thresholds is not set, which will cause a crash later.

A correct version of this code is:

    def __init__(
	        self, storage: Mapping, min_batch_size: Mapping = DEFAULT_BUFFER_THRESHOLDS,
	    ):
	        self.storage: StorageInterface = get_storage(**storage)
	
	        if min_batch_size is not DEFAULT_BUFFER_THRESHOLDS:
	            self._buffer_thresholds = {**DEFAULT_BUFFER_THRESHOLDS, **min_batch_size}
	        else:
	            self._buffer_thresholds = {**DEFAULT_BUFFER_THRESHOLDS}

which is equivalent to:

    def __init__(
	        self, storage: Mapping, min_batch_size: Mapping = {},
	    ):
	        self.storage: StorageInterface = get_storage(**storage)
	
	        self._buffer_thresholds = {**DEFAULT_BUFFER_THRESHOLDS, **min_batch_size}

which solves everyone's concerns.

And look! I can get rid of the identity check, just it would merge with itself. This one is to avoid perf penalty.

@vlorentz: true, this is bad. I added the check after, and didn't check this...

@zack @olasd @vlorentz I am puzzled. Do I go with the ImmutableDict solution or throw away another part of what I did the last 3 days (did it also in another commit)? Already removed code that was against the strict signature checker in the unit tests.

Wanted also to rename constructor parameters but they are tied to the config parameters.

ImmutableDict and None are both fine with me, as long as there's no identity check on the dict

Last attempt:
https://docs.python.org/3/reference/datamodel.html#objects-values-and-types
clearly states what is object identity and that it does not change and that we compare it with is.

https://docs.python.org/3/reference/compound_stmts.html#function-definitions, paragraph 6
"the expression is evaluated once, when the function is defined, and that the same “pre-computed” value is used for each call".

This is using features of Python, it is not a hack or something arcane/impl detail.

Now if you definitely want to revert to None, I will. You may also want to say "just this time then".

This diff should be closed (as the commit has landed on master already) and any further changes should happen on a further diff.

I've given all my arguments already, I won't rehash them again. I'm fine with @vlorentz's second example code (with s/{}/None/, even though this mutable default is always used correctly; but I won't argue about it too hard.)

I don't think ImmutableDict should leak anywhere else than in swh.model and related objects.

This revision is now accepted and ready to land.Sep 25 2020, 3:34 PM