Page MenuHomeSoftware Heritage

swh.core.api: Expose a serializable PagedResult object
ClosedPublic

Authored by ardumont on Jul 28 2020, 3:42 PM.

Details

Summary

This will allow typing paginated endpoints across our modules (swh.storage,
swh.search, swh.indexer, ...).

As a first step, this will simplify D3627 which introduced it ;)

Related to T645

Test Plan

tox

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Build is green

Patch application report for D3632 (id=12778)

Rebasing onto 028ef739da...

Current branch diff-target is up to date.
Changes applied before test
commit 3cab2527744dfcdbb831d7251258ded514036a1b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jul 28 15:41:23 2020 +0200

    swh.core.api: Expose a serializable PagedResult object
    
    This will allow typing paginated endpoints across our modules (swh.storage,
    swh.search, swh.indexer, ...).
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DCORE/job/tests-on-diff/25/ for more details.

vlorentz added a subscriber: vlorentz.

cool!

swh/core/api/serializers.py
56

I wanted to name it like this for a moment, but I have a small preference for "api_class" now. Up to you.

or "swh_core_class" / "swh_api_class"

This revision is now accepted and ready to land.Jul 28 2020, 3:56 PM

Oh, and please rename "swh/core/api/model.py" to "swh/core/api/classes.py", it has nothing to do with the model

swh/core/api/serializers.py
56

api_class appeals to me ;)
(more than the swh prefixed one)

swh/core/api/tests/test_serializers.py
135

Just to show that the nesting encoding/decoding actually works already.
Which is nice, nothing fancy to do to deal with our data model object (for D3627 for example ;).

Oh, and please rename "swh/core/api/model.py" to "swh/core/api/classes.py", it has nothing to do with the model

well i could not make up a proper name so...
I'll adapt accordingly.

olasd requested changes to this revision.Jul 28 2020, 4:27 PM
olasd added a subscriber: olasd.

This looks like a good direction to go in, and a generic useful thing, but I'm a bit worried about how open-ended/lax the serialization/deserialization is, as it stands. Nothing would prevent a user expecting a PagedResult[ResultTypeA] from receiving a response full of ResultTypeB results if the internal API wires get crossed somehow (which seems unlikely, but...).

Of course, doing runtime type checking on these objects means opening a https://stackoverflow.com/questions/48572831/how-to-access-the-type-arguments-of-typing-generic-shaped can of worms, which doesn't seem very appealing either.

I'm not sure if the runtime type checking is tractable, but please at least address the inline comments (you can drop me from reviewers after you've done that)

swh/core/api/model.py
18 ↗(On Diff #12778)

Please give this type variable a meaningful name (e.g. ResultType).

25 ↗(On Diff #12778)

If the class is frozen then this shouldn't be a mutable object. But I don't think making it frozen has much value in the first place.

I'm not enthused by the lack of runtime type checking. Deserializing a PagedResult containing serialized results of an unexpected type would not report any error.

27 ↗(On Diff #12778)

I think this should be a generic type too. We may need to use composite keys (e.g. what swh.scheduler needs for get_listed_origins).

swh/core/api/serializers.py
33

I don't think this is needed.

56

Why not just call it paged_result? your serializer/deserializer only supports these anyway.

This revision now requires changes to proceed.Jul 28 2020, 4:27 PM
swh/core/api/model.py
18 ↗(On Diff #12778)

This goes against conventions in multiple languages (at least Python, C++, Rust) and PEP8:

Names of type variables introduced in PEP 484 should normally use CapWords preferring short names: T, AnyStr, Num.

https://www.python.org/dev/peps/pep-0008/#type-variable-names

27 ↗(On Diff #12778)

The whole point of next_page_token is that it's an opaque token that can change at the server's will. If you need composite values, the server can use msgpack + base64 (that's what swh-search does)

swh/core/api/serializers.py
33

It will be if we ever need another class that works like PagedResult

56

In case we need another class in the future.

Adapt according to the first review first:

  • core_class -> api_class
  • rename model to classes
swh/core/api/model.py
18 ↗(On Diff #12778)

ResultType is a short, CapWords name.

27 ↗(On Diff #12778)

This is a return type for our backend API/RPC calls. It doesn't have to be opaque, let's not make it gratuitously so.

swh/core/api/serializers.py
33

So, it's not needed yet.

56

Then add another encoder and decoder pair then.

Build is green

Patch application report for D3632 (id=12784)

Rebasing onto 028ef739da...

Current branch diff-target is up to date.
Changes applied before test
commit 94947bcba5b9e934f841b47c22d188388ab05ed3
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jul 28 15:41:23 2020 +0200

    swh.core.api: Expose a serializable PagedResult object
    
    This will allow typing paginated endpoints across our modules (swh.storage,
    swh.search, swh.indexer, ...).
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DCORE/job/tests-on-diff/26/ for more details.

swh/core/api/model.py
27 ↗(On Diff #12778)

It's not gratuitous, it's hiding implementation details. It also keeps PagedResult simpler and with a type validator

swh/core/api/model.py
27 ↗(On Diff #12778)

Yeah, i was gonna ask to "type validate" it.

swh/core/api/serializers.py
56

then why do we need to abstract now the Optional[str] for the PagedResult next_page_token field?

swh/core/api/classes.py
18

ok, ResultType it is then.

swh/core/api/serializers.py
33

oh yeah, i don't even use it...
Dropping it.

Adapt according to most of the second review:

  • Drop unnecessary type
  • Rename generic T type to ResultType
  • Rename api_class to paged_result
  • Drop the frozen property of PagedResult

What did not change:

  • next_page_token is still typed Optional[str]
ardumont added inline comments.
swh/core/api/model.py
27 ↗(On Diff #12778)

how* to "type validate" it...

Build is green

Patch application report for D3632 (id=12785)

Rebasing onto 028ef739da...

Current branch diff-target is up to date.
Changes applied before test
commit 14f75721488bf0085f543cbdc5dd49e0b9f5bb9c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jul 28 15:41:23 2020 +0200

    swh.core.api: Expose a serializable PagedResult object
    
    This will allow typing paginated endpoints across our modules (swh.storage,
    swh.search, swh.indexer, ...).
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DCORE/job/tests-on-diff/27/ for more details.

swh/core/api/serializers.py
56

We have other typed paginated API endpoints already existing in our codebase; One of them uses a composite next page token:

https://forge.softwareheritage.org/source/swh-scheduler/browse/master/swh/scheduler/model.py$183-193

If you hardcode the next page token type as a string, then to be able to reuse this code, you would need the API function using this return type to handle encoding / decoding an opaque next page token, instead of just using its own explicit and transparent types.

I'm not convinced these internal APIs generally need opaque page tokens, on the contrary.

swh/core/api/serializers.py
56

We have other typed paginated API endpoints already existing in our codebase; One of them uses a composite
next page token:

I missed that part. Thanks.
I'll try to do the necessary to reconcile those... (no idea how yet ;)

I'm not convinced these internal APIs generally need opaque page tokens, on the contrary.

Well, I recall we decided to go that road for cassandra storage and for storage paginated endpoints in general then. Though i don't have links that demonstrates this...
(And i don't have the will to dig in for that...)
I just want to type the storage properly... ;)

(Note that i'm not entirely clear on what you explained to me yet, still thinking, it's on me heh ;)

swh/core/api/serializers.py
56

Yeah, sorry, I didn't mean to question whether the specific swh.storage APIs you're looking at should have opaque tokens or not; I understand why they ended up that way and I think that's the most sensible choice.

I'm saying having opaque pagination tokens in internal APIs should be a last resort for specific use-cases where we can't do better.

Abstract the TokenType as well

Build is green

Patch application report for D3632 (id=12786)

Rebasing onto 028ef739da...

Current branch diff-target is up to date.
Changes applied before test
commit 2d88a36722d162eb458af437e16844a664832771
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jul 28 15:41:23 2020 +0200

    swh.core.api: Expose a serializable PagedResult object
    
    This will allow typing paginated endpoints across our modules (swh.storage,
    swh.search, swh.indexer, ...).
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DCORE/job/tests-on-diff/28/ for more details.

swh/core/api/classes.py
25

Otherwise:

swh/core/api/classes.py:25: error: 'TokenType' is a type variable and only valid in type context
swh/core/api/model.py
18 ↗(On Diff #12778)

I disagree that it's short. The examples in PEP8 all shorten nouns (ie. they are not AnyString nor Number)

But as a compromise we could use TRes, or even TResult.

swh/core/api/serializers.py
56

should be a last resort for specific use-cases where we can't do better.

Why do you see them as a bad thing?

Even if they are for internal APIs, it's good to hide implementation details of one component from the others.

And if you want clients to use the data they contain, then this data shouldn't be in a pagination token.

olasd removed reviewers: Reviewers, olasd.
This revision is now accepted and ready to land.Jul 28 2020, 6:50 PM

Rename new generic types to shorter name:

  • ResultType -> TResult
  • TokenType -> TToken
swh/core/api/classes.py
18

Compromising, TResult for shorter as expressed by val below.

Build is green

Patch application report for D3632 (id=12791)

Rebasing onto 028ef739da...

Current branch diff-target is up to date.
Changes applied before test
commit 1e48fca78aa0a180c2af18be2cf2766aba6e3379
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jul 28 15:41:23 2020 +0200

    core.api: Expose serializable PagedResult object for pagination api
    
    This will allow typing paginated endpoints across our modules (swh.storage,
    swh.search, swh.indexer, swh.scheduler, ...).
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DCORE/job/tests-on-diff/29/ for more details.

We no longer use attrs-strict, so you should remove it from the requirements (and add attrs directly instead).

Actually, since we're no longer using any features specific to attrs, could you switch to a dataclass, so that's one less dependency for swh-core?

This revision now requires changes to proceed.Jul 29 2020, 9:21 AM
swh/core/api/classes.py
24

I'd prefer to use a default_factory (dataclasses.field(default_factory=list) when you'll switch to dataclasses). As lists are mutable, it's a footgun to have a mutable object as default.

25

wtf...

We no longer use attrs-strict, so you should remove it from the requirements (and add attrs directly instead).

yeah i saw, was gonna remove it but got side-tracked on how to properly type stuff in the storage...

Actually, since we're no longer using any features specific to attrs, could you switch to a dataclass, so that's one less dependency for swh-core?

do you have a link for the dataclass stuff please?

swh/core/api/classes.py
24

Do we want to have a List here or an Iterable or a Sequence?

do you have a link for the dataclass stuff please?

https://docs.python.org/3/library/dataclasses.html

You shouldn't need any more than what the example at the top does

swh/core/api/classes.py
24

all three are fine, but I don't see any reason not to use List.

swh/core/api/classes.py
24

In my tryouts, in storage, at some point, I tried a specific type involving PagedResult[BaseModel, str].
Which I eliminated because too broad.

Still, mypy tried to lure me into the variance/co-variance/invariance things.
On top of which it said, if you want this to work, you might want to use a Sequence instead of a List.
So I thought of asking ;)


Keeping List then ;)

Adapt according to discussion:

  • Use dataclasses over attr, 1 less deps
  • Drop attr_strict as it's no longer used here

Build is green

Patch application report for D3632 (id=12793)

Rebasing onto 028ef739da...

Current branch diff-target is up to date.
Changes applied before test
commit 83d82112cf984ae9e473348d9532a254cfd90d9a
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jul 28 15:41:23 2020 +0200

    core.api: Expose serializable PagedResult object for pagination api
    
    This will allow typing paginated endpoints across our modules (swh.storage,
    swh.search, swh.indexer, swh.scheduler, ...).
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DCORE/job/tests-on-diff/30/ for more details.

This revision is now accepted and ready to land.Jul 29 2020, 9:50 AM

Build is green

Patch application report for D3632 (id=12795)

Rebasing onto 028ef739da...

Current branch diff-target is up to date.
Changes applied before test
commit 1ec4df82e4774de702bd6614ab64b0609dcad720
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jul 28 15:41:23 2020 +0200

    core.api: Expose serializable PagedResult object for pagination api
    
    This will allow typing paginated endpoints across our modules (swh.storage,
    swh.search, swh.indexer, swh.scheduler, ...).
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DCORE/job/tests-on-diff/32/ for more details.