Page MenuHomeSoftware Heritage

storage*.origin_visit_get_random: Return model object instead of dict
ClosedPublic

Authored by ardumont on Wed, Jul 22, 2:30 PM.

Details

Summary

This is used by tests.

Just to have a feeling on how hard it is to change the implementation from dict
to model objects.

Note that it's particular here as we are returning an optional tuple of
OriginVisit and OriginVisitStatus. Whereas before we returned a mixed
visit/visit-status dict (which was no longer neither a visit nor a visit status...)

As an incremental change approach, this introduces similar functions to existing
ones to manipulate directly model objects (instead of dicts as currently).
This does not refactor the common behavior too much yet to avoid impacting
the diff size and the other endpoints. This will be eventually dealt with in subsequent
diffs.

Related to T645

Test Plan

tox

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ardumont created this revision.Wed, Jul 22, 2:30 PM
ardumont edited the summary of this revision. (Show Details)Wed, Jul 22, 2:35 PM
ardumont updated this revision to Diff 12649.Wed, Jul 22, 2:41 PM
ardumont edited the summary of this revision. (Show Details)
  • Simplify some part
  • Update docstrings some more

Build is green

Patch application report for D3594 (id=12648)

Could not rebase; Attempt merge onto bbe840e089...

Updating bbe840e0..7f0a0268
Fast-forward
 swh/storage/cassandra/converters.py          |   16 +-
 swh/storage/cassandra/storage.py             |   31 +-
 swh/storage/in_memory.py                     |   31 +-
 swh/storage/interface.py                     |    9 +-
 swh/storage/pytest_plugin.py                 |   91 +--
 swh/storage/storage.py                       |   17 +-
 swh/storage/tests/algos/test_origin.py       |   69 +-
 swh/storage/tests/algos/test_snapshot.py     |   16 +-
 swh/storage/tests/storage_data.py            | 1010 +++++++++++++-------------
 swh/storage/tests/test_api_client.py         |    4 +-
 swh/storage/tests/test_buffer.py             |  102 ++-
 swh/storage/tests/test_cassandra.py          |   20 +-
 swh/storage/tests/test_filter.py             |   20 +-
 swh/storage/tests/test_pytest_plugin.py      |   55 +-
 swh/storage/tests/test_retry.py              |  164 +++--
 swh/storage/tests/test_revision_bw_compat.py |    6 +-
 swh/storage/tests/test_storage.py            |  865 +++++++++++-----------
 swh/storage/tests/test_storage_data.py       |   29 +
 18 files changed, 1271 insertions(+), 1284 deletions(-)
 create mode 100644 swh/storage/tests/test_storage_data.py
Changes applied before test
commit 7f0a02684b21d3eb5cf3d2ab137546eff2bed9e9
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 22 14:22:45 2020 +0200

    storage*.origin_visit_get_random: Read model objects

commit ccbd2e9c21e6862fe8efbbf9b99e4ac3b838be97
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 22 08:46:10 2020 +0200

    pytest_plugin: Make sample_data an object
    
    Note that this:
    - drops the no longer needed copy done by the StorageData instance (used by
    sample_data) since now it returned immutable BaseModel objects.
    
    - centralizes some left-over tests to use sample_data as well

commit 67a909e0fb69be9ea11c951d771002f37544d847
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jul 21 18:15:41 2020 +0200

    pytest_plugin: Rename sample_data_model to sample_data
    
    Related to T2494

commit e005900bd7edd3dd12b10beb2cf9c00db8488093
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jul 21 18:07:44 2020 +0200

    pytest_plugin: Drop sample_data in favor of sample_data_model
    
    Related to T2494

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/547/ for more details.

ardumont edited the summary of this revision. (Show Details)Wed, Jul 22, 2:47 PM

Build is green

Patch application report for D3594 (id=12649)

Could not rebase; Attempt merge onto bbe840e089...

Updating bbe840e0..b9b0b361
Fast-forward
 swh/storage/cassandra/converters.py          |   16 +-
 swh/storage/cassandra/storage.py             |   30 +-
 swh/storage/in_memory.py                     |   31 +-
 swh/storage/interface.py                     |   10 +-
 swh/storage/pytest_plugin.py                 |   91 +--
 swh/storage/storage.py                       |   17 +-
 swh/storage/tests/algos/test_origin.py       |   69 +-
 swh/storage/tests/algos/test_snapshot.py     |   16 +-
 swh/storage/tests/storage_data.py            | 1010 +++++++++++++-------------
 swh/storage/tests/test_api_client.py         |    4 +-
 swh/storage/tests/test_buffer.py             |  102 ++-
 swh/storage/tests/test_cassandra.py          |   20 +-
 swh/storage/tests/test_filter.py             |   20 +-
 swh/storage/tests/test_pytest_plugin.py      |   55 +-
 swh/storage/tests/test_retry.py              |  164 +++--
 swh/storage/tests/test_revision_bw_compat.py |    6 +-
 swh/storage/tests/test_storage.py            |  865 +++++++++++-----------
 swh/storage/tests/test_storage_data.py       |   29 +
 18 files changed, 1269 insertions(+), 1286 deletions(-)
 create mode 100644 swh/storage/tests/test_storage_data.py
Changes applied before test
commit b9b0b3613844bfdeee24125ffefe7864c439e767
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 22 14:22:45 2020 +0200

    storage*.origin_visit_get_random: Read model objects

commit ccbd2e9c21e6862fe8efbbf9b99e4ac3b838be97
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 22 08:46:10 2020 +0200

    pytest_plugin: Make sample_data an object
    
    Note that this:
    - drops the no longer needed copy done by the StorageData instance (used by
    sample_data) since now it returned immutable BaseModel objects.
    
    - centralizes some left-over tests to use sample_data as well

commit 67a909e0fb69be9ea11c951d771002f37544d847
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jul 21 18:15:41 2020 +0200

    pytest_plugin: Rename sample_data_model to sample_data
    
    Related to T2494

commit e005900bd7edd3dd12b10beb2cf9c00db8488093
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Tue Jul 21 18:07:44 2020 +0200

    pytest_plugin: Drop sample_data in favor of sample_data_model
    
    Related to T2494

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/548/ for more details.

ardumont added inline comments.Wed, Jul 22, 5:06 PM
swh/storage/cassandra/storage.py
857–858

I initially thought that this missing coverage was a lie but apparently that's dead code now!
I'll remove it.

ardumont edited the summary of this revision. (Show Details)Wed, Jul 22, 5:07 PM
ardumont updated this revision to Diff 12665.Wed, Jul 22, 5:11 PM

Remove dead code

swh/storage/cassandra/storage.py
857–858

done

Build is green

Patch application report for D3594 (id=12665)

Rebasing onto ccbd2e9c21...

Current branch diff-target is up to date.
Changes applied before test
commit c84c76833dffaaf758481c556babe6b3d88e1c10
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 22 14:22:45 2020 +0200

    storage*.origin_visit_get_random: Read model objects

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/549/ for more details.

vlorentz requested changes to this revision.Thu, Jul 23, 9:55 AM
vlorentz added a subscriber: vlorentz.
vlorentz added inline comments.
swh/storage/cassandra/converters.py
82–83

so this annotation is wrong too

83

ResultSet is an iterable of Rows.

swh/storage/cassandra/storage.py
40–41

considering how many converters we are going to have, I suggest we use from . import converters instead.

And converters are not involved in long expressions anyway, so it won't be an issue for readability.

swh/storage/interface.py
915–919

meh.

Could you rename the endpoint to origin_visit_status_get_random, if we want to return a status?

swh/storage/storage.py
943–951

This violates the abstraction, as storage.py now needs to hardcode the map between indices and names.

I recommend you keep the dict(zip()) and use dict access.

Or alternatively, make db.py return namedtuples (like cql.py does with Row)

This revision now requires changes to proceed.Thu, Jul 23, 9:55 AM
ardumont added inline comments.Thu, Jul 23, 9:58 AM
swh/storage/storage.py
943–951

I was interested in "performance".
I'm a bit wary of how many conversions we keep on doing everywhere...

The alternative i did not comprehend fully yet interests me, i'll look into that ;)

vlorentz added inline comments.Thu, Jul 23, 10:16 AM
swh/storage/storage.py
943–951

Nothing to worry about performance-wise:

In [1]: from swh.model.model import OriginVisitStatus, OriginVisit

In [2]: import datetime

In [3]: cols = ["origin", "visit", "date", "type", "status", "metadata", "snapshot"]

In [4]: row = ("o", 42, datetime.datetime.now(), "t", "full", None, None)

In [5]: %timeit d = dict(zip(cols, row)); OriginVisit(origin=d["origin"], visit=d["visit"], date=d["date"], type=d["type"]); OriginVisitStatus(origin=d["origin"], visit=d["visit"], date=d["date"], status=d["status"], metadata=d["metadata"], snapshot=d["snapshot"])
100000 loops, best of 5: 16.6 µs per loop

In [6]: %timeit OriginVisit(origin=row[0], visit=row[1], date=row[2], type=row[3]); OriginVisitStatus(origin=row[0], visit=row[1], date=row[2], status=row[4], metadata=row[5], snapshot=row[6])
100000 loops, best of 5: 15.8 µs per loop

Building the OriginVisit and OriginVisitStatus objects takes much longer (about 15µs) than the dict and tuple operations (respectively 100ns and 500ns in your version and mine)

943–951

not to mention the cost of network operations, which trump all this

ardumont added inline comments.Thu, Jul 23, 10:39 AM
swh/storage/cassandra/converters.py
82–83

So it's plain Row i want, got it.
Thanks.

(why don't mypy say something?)

ardumont marked 3 inline comments as done.Thu, Jul 23, 10:46 AM
ardumont added inline comments.
swh/storage/interface.py
915–919

I keep the Tuple as result though, right.

Because you know, the origin-visit holds the type the status can't have.
More generally, how do we deal with the origin-visit*get endpoint now, do we do as this one?

swh/storage/storage.py
943–951

cool, thanks ;)

ardumont updated this revision to Diff 12680.Thu, Jul 23, 12:42 PM

Adapt mostly according to review:

  • cassandra: Import converters module
  • Rename to origin_visit_status_get_random
  • Kept the dict(zip... stanza (did not go with the namedtuple finally)

Mostly because, heads up, I dropped the wrong type from the converters method.
I did not find a way to make mypy comply with the Row type. When typing it with
type Row, I got the following:

swh/storage/cassandra/converters.py:88: error: "Tuple[Any, ...]" has no attribute "origin"
swh/storage/cassandra/converters.py:89: error: "Tuple[Any, ...]" has no attribute "visit"
swh/storage/cassandra/converters.py:90: error: "Tuple[Any, ...]" has no attribute "date"
swh/storage/cassandra/converters.py:91: error: "Tuple[Any, ...]" has no attribute "type"
swh/storage/cassandra/converters.py:101: error: "Tuple[Any, ...]" has no attribute "_asdict"
swh/storage/cassandra/converters.py:102: error: "Tuple[Any, ...]" has no attribute "origin"
swh/storage/cassandra/converters.py:103: error: "Tuple[Any, ...]" has no attribute "date"
swh/storage/cassandra/converters.py:104: error: "Tuple[Any, ...]" has no attribute "metadata"

It's slightly off-topic so I consider this can be dealt with later (unless you
know what to do and I'll adapt ;).

Cheers,

Build is green

Patch application report for D3594 (id=12680)

Rebasing onto b2055f4c45...

Current branch diff-target is up to date.
Changes applied before test
commit d91282a35b0a86a44d59031b51df69eb792b9462
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 22 14:22:45 2020 +0200

    storage*.origin_visit_get_random: Read model objects

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/553/ for more details.

vlorentz added inline comments.Thu, Jul 23, 1:45 PM
swh/storage/interface.py
915–919

I guess. Or maybe we should move the type to the status...

vlorentz accepted this revision.Thu, Jul 23, 1:46 PM
This revision is now accepted and ready to land.Thu, Jul 23, 1:46 PM
ardumont added inline comments.Thu, Jul 23, 2:07 PM
swh/storage/interface.py
915–919

lol at the ...

That leaves a taste like back to square one ¯\_(ツ)_/¯ ;)

ardumont added inline comments.Thu, Jul 23, 2:08 PM
swh/storage/cassandra/converters.py
92

damn, the type ignore is useless...

ardumont updated this revision to Diff 12682.Thu, Jul 23, 2:09 PM
  • remove useless type ignore
  • amend commit message to reference the task

Build is green

Patch application report for D3594 (id=12682)

Rebasing onto b2055f4c45...

Current branch diff-target is up to date.
Changes applied before test
commit d8583eb4685deabbdc58d58c6640d8972af71855
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 22 14:22:45 2020 +0200

    storage*.origin_visit_get_random: Read model objects
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/554/ for more details.

This revision was automatically updated to reflect the committed changes.