Page MenuHomeSoftware Heritage

storage*.origin_visit_get_random: Return model object instead of dict
ClosedPublic

Authored by ardumont on Jul 22 2020, 2:30 PM.

Details

Summary

This is used by tests.

Just to have a feeling on how hard it is to change the implementation from dict
to model objects.

Note that it's particular here as we are returning an optional tuple of
OriginVisit and OriginVisitStatus. Whereas before we returned a mixed
visit/visit-status dict (which was no longer neither a visit nor a visit status...)

As an incremental change approach, this introduces similar functions to existing
ones to manipulate directly model objects (instead of dicts as currently).
This does not refactor the common behavior too much yet to avoid impacting
the diff size and the other endpoints. This will be eventually dealt with in subsequent
diffs.

Related to T645

Test Plan

tox

Diff Detail

Repository
rDSTO Storage manager
Branch
read-model-objects
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 13874
Build 21272: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 21271: arc lint + arc unit

Event Timeline

ardumont edited the summary of this revision. (Show Details)
  • Simplify some part
  • Update docstrings some more

Build is green

Patch application report for D3594 (id=12648)

Could not rebase; Attempt merge onto bbe840e089...

Updating bbe840e0..7f0a0268
Fast-forward
 swh/storage/cassandra/converters.py          |   16 +-
 swh/storage/cassandra/storage.py             |   31 +-
 swh/storage/in_memory.py                     |   31 +-
 swh/storage/interface.py                     |    9 +-
 swh/storage/pytest_plugin.py                 |   91 +--
 swh/storage/storage.py                       |   17 +-
 swh/storage/tests/algos/test_origin.py       |   69 +-
 swh/storage/tests/algos/test_snapshot.py     |   16 +-
 swh/storage/tests/storage_data.py            | 1010 +++++++++++++-------------
 swh/storage/tests/test_api_client.py         |    4 +-
 swh/storage/tests/test_buffer.py             |  102 ++-
 swh/storage/tests/test_cassandra.py          |   20 +-
 swh/storage/tests/test_filter.py             |   20 +-
 swh/storage/tests/test_pytest_plugin.py      |   55 +-
 swh/storage/tests/test_retry.py              |  164 +++--
 swh/storage/tests/test_revision_bw_compat.py |    6 +-
 swh/storage/tests/test_storage.py            |  865 +++++++++++-----------
 swh/storage/tests/test_storage_data.py       |   29 +
 18 files changed, 1271 insertions(+), 1284 deletions(-)
 create mode 100644 swh/storage/tests/test_storage_data.py
Changes applied before test
commit 7f0a02684b21d3eb5cf3d2ab137546eff2bed9e9
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 22 14:22:45 2020 +0200

    storage*.origin_visit_get_random: Read model objects

commit ccbd2e9c21e6862fe8efbbf9b99e4ac3b838be97
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 22 08:46:10 2020 +0200

    pytest_plugin: Make sample_data an object
    
    Note that this:
    - drops the no longer needed copy done by the StorageData instance (used by
    sample_data) since now it returned immutable BaseModel objects.
    
    - centralizes some left-over tests to use sample_data as well

commit 67a909e0fb69be9ea11c951d771002f37544d847
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jul 21 18:15:41 2020 +0200

    pytest_plugin: Rename sample_data_model to sample_data
    
    Related to T2494

commit e005900bd7edd3dd12b10beb2cf9c00db8488093
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jul 21 18:07:44 2020 +0200

    pytest_plugin: Drop sample_data in favor of sample_data_model
    
    Related to T2494

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/547/ for more details.

Build is green

Patch application report for D3594 (id=12649)

Could not rebase; Attempt merge onto bbe840e089...

Updating bbe840e0..b9b0b361
Fast-forward
 swh/storage/cassandra/converters.py          |   16 +-
 swh/storage/cassandra/storage.py             |   30 +-
 swh/storage/in_memory.py                     |   31 +-
 swh/storage/interface.py                     |   10 +-
 swh/storage/pytest_plugin.py                 |   91 +--
 swh/storage/storage.py                       |   17 +-
 swh/storage/tests/algos/test_origin.py       |   69 +-
 swh/storage/tests/algos/test_snapshot.py     |   16 +-
 swh/storage/tests/storage_data.py            | 1010 +++++++++++++-------------
 swh/storage/tests/test_api_client.py         |    4 +-
 swh/storage/tests/test_buffer.py             |  102 ++-
 swh/storage/tests/test_cassandra.py          |   20 +-
 swh/storage/tests/test_filter.py             |   20 +-
 swh/storage/tests/test_pytest_plugin.py      |   55 +-
 swh/storage/tests/test_retry.py              |  164 +++--
 swh/storage/tests/test_revision_bw_compat.py |    6 +-
 swh/storage/tests/test_storage.py            |  865 +++++++++++-----------
 swh/storage/tests/test_storage_data.py       |   29 +
 18 files changed, 1269 insertions(+), 1286 deletions(-)
 create mode 100644 swh/storage/tests/test_storage_data.py
Changes applied before test
commit b9b0b3613844bfdeee24125ffefe7864c439e767
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 22 14:22:45 2020 +0200

    storage*.origin_visit_get_random: Read model objects

commit ccbd2e9c21e6862fe8efbbf9b99e4ac3b838be97
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 22 08:46:10 2020 +0200

    pytest_plugin: Make sample_data an object
    
    Note that this:
    - drops the no longer needed copy done by the StorageData instance (used by
    sample_data) since now it returned immutable BaseModel objects.
    
    - centralizes some left-over tests to use sample_data as well

commit 67a909e0fb69be9ea11c951d771002f37544d847
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jul 21 18:15:41 2020 +0200

    pytest_plugin: Rename sample_data_model to sample_data
    
    Related to T2494

commit e005900bd7edd3dd12b10beb2cf9c00db8488093
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jul 21 18:07:44 2020 +0200

    pytest_plugin: Drop sample_data in favor of sample_data_model
    
    Related to T2494

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/548/ for more details.

swh/storage/cassandra/storage.py
877

I initially thought that this missing coverage was a lie but apparently that's dead code now!
I'll remove it.

Remove dead code

swh/storage/cassandra/storage.py
877

done

Build is green

Patch application report for D3594 (id=12665)

Rebasing onto ccbd2e9c21...

Current branch diff-target is up to date.
Changes applied before test
commit c84c76833dffaaf758481c556babe6b3d88e1c10
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 22 14:22:45 2020 +0200

    storage*.origin_visit_get_random: Read model objects

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/549/ for more details.

vlorentz added a subscriber: vlorentz.
vlorentz added inline comments.
swh/storage/cassandra/converters.py
84

ResultSet is an iterable of Rows.

97–98

so this annotation is wrong too

swh/storage/cassandra/storage.py
41–48

considering how many converters we are going to have, I suggest we use from . import converters instead.

And converters are not involved in long expressions anyway, so it won't be an issue for readability.

swh/storage/interface.py
916–919

meh.

Could you rename the endpoint to origin_visit_status_get_random, if we want to return a status?

swh/storage/storage.py
970–978

This violates the abstraction, as storage.py now needs to hardcode the map between indices and names.

I recommend you keep the dict(zip()) and use dict access.

Or alternatively, make db.py return namedtuples (like cql.py does with Row)

This revision now requires changes to proceed.Jul 23 2020, 9:55 AM
swh/storage/storage.py
970–978

I was interested in "performance".
I'm a bit wary of how many conversions we keep on doing everywhere...

The alternative i did not comprehend fully yet interests me, i'll look into that ;)

swh/storage/storage.py
970–978

Nothing to worry about performance-wise:

In [1]: from swh.model.model import OriginVisitStatus, OriginVisit

In [2]: import datetime

In [3]: cols = ["origin", "visit", "date", "type", "status", "metadata", "snapshot"]

In [4]: row = ("o", 42, datetime.datetime.now(), "t", "full", None, None)

In [5]: %timeit d = dict(zip(cols, row)); OriginVisit(origin=d["origin"], visit=d["visit"], date=d["date"], type=d["type"]); OriginVisitStatus(origin=d["origin"], visit=d["visit"], date=d["date"], status=d["status"], metadata=d["metadata"], snapshot=d["snapshot"])
100000 loops, best of 5: 16.6 µs per loop

In [6]: %timeit OriginVisit(origin=row[0], visit=row[1], date=row[2], type=row[3]); OriginVisitStatus(origin=row[0], visit=row[1], date=row[2], status=row[4], metadata=row[5], snapshot=row[6])
100000 loops, best of 5: 15.8 µs per loop

Building the OriginVisit and OriginVisitStatus objects takes much longer (about 15µs) than the dict and tuple operations (respectively 100ns and 500ns in your version and mine)

970–978

not to mention the cost of network operations, which trump all this

swh/storage/cassandra/converters.py
97–98

So it's plain Row i want, got it.
Thanks.

(why don't mypy say something?)

ardumont added inline comments.
swh/storage/interface.py
916–919

I keep the Tuple as result though, right.

Because you know, the origin-visit holds the type the status can't have.
More generally, how do we deal with the origin-visit*get endpoint now, do we do as this one?

swh/storage/storage.py
970–978

cool, thanks ;)

Adapt mostly according to review:

  • cassandra: Import converters module
  • Rename to origin_visit_status_get_random
  • Kept the dict(zip... stanza (did not go with the namedtuple finally)

Mostly because, heads up, I dropped the wrong type from the converters method.
I did not find a way to make mypy comply with the Row type. When typing it with
type Row, I got the following:

swh/storage/cassandra/converters.py:88: error: "Tuple[Any, ...]" has no attribute "origin"
swh/storage/cassandra/converters.py:89: error: "Tuple[Any, ...]" has no attribute "visit"
swh/storage/cassandra/converters.py:90: error: "Tuple[Any, ...]" has no attribute "date"
swh/storage/cassandra/converters.py:91: error: "Tuple[Any, ...]" has no attribute "type"
swh/storage/cassandra/converters.py:101: error: "Tuple[Any, ...]" has no attribute "_asdict"
swh/storage/cassandra/converters.py:102: error: "Tuple[Any, ...]" has no attribute "origin"
swh/storage/cassandra/converters.py:103: error: "Tuple[Any, ...]" has no attribute "date"
swh/storage/cassandra/converters.py:104: error: "Tuple[Any, ...]" has no attribute "metadata"

It's slightly off-topic so I consider this can be dealt with later (unless you
know what to do and I'll adapt ;).

Cheers,

Build is green

Patch application report for D3594 (id=12680)

Rebasing onto b2055f4c45...

Current branch diff-target is up to date.
Changes applied before test
commit d91282a35b0a86a44d59031b51df69eb792b9462
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 22 14:22:45 2020 +0200

    storage*.origin_visit_get_random: Read model objects

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/553/ for more details.

swh/storage/interface.py
916–919

I guess. Or maybe we should move the type to the status...

This revision is now accepted and ready to land.Jul 23 2020, 1:46 PM
swh/storage/interface.py
916–919

lol at the ...

That leaves a taste like back to square one ¯\_(ツ)_/¯ ;)

swh/storage/cassandra/converters.py
106

damn, the type ignore is useless...

  • remove useless type ignore
  • amend commit message to reference the task

Build is green

Patch application report for D3594 (id=12682)

Rebasing onto b2055f4c45...

Current branch diff-target is up to date.
Changes applied before test
commit d8583eb4685deabbdc58d58c6640d8972af71855
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 22 14:22:45 2020 +0200

    storage*.origin_visit_get_random: Read model objects
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/554/ for more details.