Page MenuHomeSoftware Heritage

storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
ClosedPublic

Authored by ardumont on Jul 29 2020, 4:35 PM.

Details

Summary

This opens new origin-visit-status-get endpoint.

Related to T645

Test Plan

tox

Diff Detail

Repository
rDSTO Storage manager
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 14065
Build 21594: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 21593: arc lint + arc unit

Unit TestsFailed

TimeTest
1,529 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_api_client.TestStorage::test_types
self = <swh.storage.tests.test_api_client.TestStorage object at 0x7fc7c94c6a58> swh_storage_backend_config = {'cls': 'local', 'db': 'postgresql://postgres@127.0.0.1:22104/tests', 'journal_writer': {'cls': 'memory'}, 'objstorage': {'args': {}, 'cls': 'memory'}}
632 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra.TestCassandraStorage::test_origin_visit_status_get__unknown_cases
self = <swh.storage.tests.test_cassandra.TestCassandraStorage object at 0x7fc7006b4160> swh_storage = <swh.storage.cassandra.storage.CassandraStorage object at 0x7fc700639b00> sample_data = <swh.storage.tests.storage_data.StorageData object at 0x7fc7004aea58>
779 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra.TestCassandraStorage::test_origin_visit_status_get__validation_failure
self = <swh.storage.tests.test_cassandra.TestCassandraStorage object at 0x7fc7007e73c8> swh_storage = <swh.storage.cassandra.storage.CassandraStorage object at 0x7fc7007e7358> sample_data = <swh.storage.tests.storage_data.StorageData object at 0x7fc7003ea2b0>
708 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra.TestCassandraStorage::test_origin_visit_status_get_all
self = <swh.storage.tests.test_cassandra.TestCassandraStorage object at 0x7fc70072c6d8> swh_storage = <swh.storage.cassandra.storage.CassandraStorage object at 0x7fc700537668> sample_data = <swh.storage.tests.storage_data.StorageData object at 0x7fc70058a2e8>
790 msJenkins > .tox.py3.lib.python3.7.site-packages.swh.storage.tests.test_cassandra.TestCassandraStorage::test_types
self = <swh.storage.tests.test_cassandra.TestCassandraStorage object at 0x7fc7c430ceb8> swh_storage_backend_config = {'cls': 'cassandra', 'hosts': ['127.0.0.1'], 'journal_writer': {'cls': 'memory'}, 'keyspace': '222db03d5fb6945683bc', ...}
View Full Test Results (7 Failed · 747 Passed · 17 Skipped)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
swh/storage/db.py
577

salvaging that no longer used endpoint.

Build has FAILED

Patch application report for D3641 (id=12823)

Rebasing onto 24559bb3ba...

Current branch diff-target is up to date.
Changes applied before test
commit 64f773a133f723c247ba1944a47f1184a98f0847
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: new origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/602/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/602/console

done:

  • Add in-memory storage implementation
  • Improve opaque identifier tests for that endpoints
  • Improve docstring
  • Simplify next_page_token computation

todo:

  • Remain cassandra implementation to do

Build has FAILED

Patch application report for D3641 (id=12829)

Rebasing onto 2bfd2f7091...

Current branch diff-target is up to date.
Changes applied before test
commit ddfb3a992c91508a12b012b6c1832f9a191f9b72
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: new origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/608/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/608/console

Add cassandra implementations

For some reason, the tests on signature type is not happy though...

I'll delve into that later:

>           assert expected_signature == actual_signature, meth_name
E           AssertionError: origin_visit_status_get
E           assert <Signature (origin: str, visit: int, page_token: Union[str, NoneType] = None, order: swh.storage.interface.ListOrder = <ListOrder.ASC: 'asc'>, limit: int = 10) -> swh.core.api.classes.PagedResult[swh.model.model.OriginVisitStatus, str]> == <Signature (origin: str, visit: int, page_token: Union[str, NoneType] = None, order: swh.storage.interface.ListOrder = <ListOrder.ASC: 'asc'>, limit: int = 10) -> swh.core.api.classes.PagedResult[swh.model.model.OriginVisit, str]>
ardumont retitled this revision from wip: storage*: new origin_visit_status_get(...) -> PagedResult[OriginVisitStatus] to storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus].Jul 29 2020, 6:23 PM
ardumont edited the summary of this revision. (Show Details)
ardumont edited the test plan for this revision. (Show Details)
ardumont added inline comments.
swh/storage/in_memory.py
955

@ardumont have a look here... see... that's why it fails ;)

swh/storage/storage.py
984

same here, it's OriginVisitStatus you want ;)

Fix mistyped signatures (pgstorage, in-memory)

Build has FAILED

Patch application report for D3641 (id=12830)

Rebasing onto 2bfd2f7091...

Current branch diff-target is up to date.
Changes applied before test
commit d4a505a7e9aea2f5ce92526f2ec59135b2b6755a
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/609/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/609/console

Build is green

Patch application report for D3641 (id=12831)

Rebasing onto 2bfd2f7091...

Current branch diff-target is up to date.
Changes applied before test
commit 63ae21d27d6fd28fbcb059ade83629627ff15b4b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/610/ for more details.

Build is green

Patch application report for D3641 (id=12832)

Rebasing onto 2bfd2f7091...

Current branch diff-target is up to date.
Changes applied before test
commit b4972fbb2e0a435b256c66617837a3dc3f6a23eb
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/611/ for more details.

  • Rebase on latest master.
  • Fix wrong names (was origin_visit instead of origin_visit_status, values wise, it was ok nonetheless)

Build is green

Patch application report for D3641 (id=12835)

Rebasing onto 7667f7eef9...

Current branch diff-target is up to date.
Changes applied before test
commit de11ef60ea74474c16df45149a97e9cfd8244ff0
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/612/ for more details.

vlorentz added inline comments.
swh/storage/cassandra/cql.py
822–825 ↗(On Diff #12832)

why the name page_name?

swh/storage/tests/test_storage.py
1355–1356

assert date_visit1 < date_visit2 < date_visit3

swh/storage/cassandra/cql.py
822–825 ↗(On Diff #12832)

i don't know, i kept the naming logic from the function just before.
i adapted the value from pagination to token as you seemed keen on that (from the test).

What do you propose instead (nothing comes to mind for my part)?

swh/storage/cassandra/cql.py
822–825 ↗(On Diff #12832)

It's related to pagination so page for short.

douardda added inline comments.
swh/storage/cassandra/storage.py
873–876 ↗(On Diff #12835)

do we really need those? I mean, these are type annotated, so it seems unnecessary to me to add execution time type validation.

885 ↗(On Diff #12835)

a small comment on the reason for this +1 could be handy (aka get one more result to handle the next_page logic).

Also not sure this extra_limit variable is really helpful, as well as the assertion below seems unnecessary to me (add unit tests for _cql_runner.origin_visit_status_get_range is needed, then trust them).

895 ↗(On Diff #12835)

this is wrong, the next_page_token should be before the visit_statuses above.

swh/storage/cassandra/storage.py
898 ↗(On Diff #12835)

long story short, here is my proposal for this part of the code:

def origin_visit_status_get([...]):
    next_page_token = None
    date_from = None
    if page_token is not None:
        date_from = datetime.datetime.fromisoformat(page_token)
    rows = self._cql_runner.origin_visit_status_get_range(
        origin, visit, date_from, limit+1, order
    )
    visit_statuses = [converters.row_to_visit_status(row) for row in rows[:limit]]
    if len(rows) > limit:
        next_page_token = str(converters.row_to_visit_status(rows[limit+1].date)
    return PagedResult(results=visit_statuses, next_page_token=next_page_token)
swh/storage/cassandra/storage.py
873–876 ↗(On Diff #12835)

except until recently it was moot...
D3635 fixed it...

But that reassures me a bit to have those ;)

885 ↗(On Diff #12835)

i use the extra-limit variable to avoid having + 1 all over the places which i found annoying to read and prone to error.

895 ↗(On Diff #12835)

tests disagrees there.

898 ↗(On Diff #12835)

let's see if that works ;)

swh/storage/cassandra/storage.py
898 ↗(On Diff #12835)

Note: this applies to other storage implementations too, i guess.

swh/storage/cassandra/storage.py
895 ↗(On Diff #12835)

then either tests are wrong or the limit+1 is useless, since you never use the extra returned value...

swh/storage/cassandra/storage.py
895 ↗(On Diff #12835)

I'm using it but not for its value for its presence.

swh/storage/cassandra/storage.py
873–876 ↗(On Diff #12835)

but it adds code and (at least) "visual complexity", so I'd rather say "no", but I won't fight too hard, it's not a big deal, really.

swh/storage/cassandra/storage.py
895 ↗(On Diff #12835)

but like i said, i'll try your version which seems simpler to read and to the point alright ;)

swh/storage/cassandra/storage.py
885 ↗(On Diff #12835)

You can avoid multiple +1 without extra_limit:

        rows = self._cql_runner.origin_visit_status_get_range(
            origin, visit, date_from, limit+1, order
        )
...
        if len(visit_statuses) > limit:
...
895 ↗(On Diff #12835)

It's not just the test that disagrees. next_page_token is the date of the last returned visit, and the next call will return all visits with a date strictly greater than the one in the token

898 ↗(On Diff #12835)

if you do that, then you need to change the inequalities in the backend to be non-strict inequalities

swh/storage/cassandra/storage.py
895 ↗(On Diff #12835)

ok then this MUST be documented somehow because it really feels wrong for the fellow dev reading this piece of code.

BTW, do we really need to gather this extra result from the base? why not just add the next link if len(rows) == limit (according we do not use limit+1 when calling it).

Worst case, we do an extra call with no results if the total number of visits is a multiple of limit.

swh/storage/cassandra/storage.py
898 ↗(On Diff #12835)

oh and your code is incorrect david, because you're returning the limit first results, and use the limit+2's date as token

swh/storage/cassandra/storage.py
898 ↗(On Diff #12835)

yes it's indeed incorrect since I did not get the behavior of _cql_runner.origin_visit_status_get_range() w.r.t the date_from argument. ( I mean calling a function with a date_from argument, I expect to have results starting from this date by default).
So I say this should be explained in a oneliner comment.

But as said above, I'm not sure I see the real advantage of gathering limit+1 results from the base (just add the next_page_token iff len(rows) == limit )

swh/storage/cassandra/storage.py
898 ↗(On Diff #12835)

It does not work for the reasons val exposed.

either the tests are wrong...

I think the tests are fine, spent quite some time on it (either here and for origin-visit-get endpoint).

In the end, the improvments i can do here is:

  • using the list comprehension proposed.
  • drop the extra check on the argument (plus it's not consistent, we don't do it on all parameters)

Note:
len(rows) does not work, rows is a ResultSet and they complain about it (mypy and tests).

swh/storage/cassandra/storage.py
898 ↗(On Diff #12835)

( I mean calling a function with a date_from argument, I expect to have results starting from this date by default).

ah yes, that's reasonable.

(just add the next_page_token iff len(rows) == limit

That saves one call to the api from the user perspective?
Here we know and say as much, there is still something to iterate again over.

without the extra limit, i gather, the user needs to make one more call with possibly nothing to retrieve.

swh/storage/cassandra/storage.py
898 ↗(On Diff #12835)

ah yes, that's reasonable.

I'll check how hard it is to invert the strictness in the backend...
A priori, i should not have anything to change in the tests...
and in the end, everything should stay green... i think

swh/storage/cassandra/storage.py
898 ↗(On Diff #12835)

That saves one call to the api from the user perspective?

and David's argument is that it's not worth the extra complexity + fetching an extra record.

Personally I don't care either way.

swh/storage/cassandra/storage.py
898 ↗(On Diff #12835)

Yes, but I'm more for the "extra complexity" argument, as you would have guessed ;-) But I don't really care either as long as the code is easy to read and understand (i.e. with comments for traps like the date_from argument)

swh/storage/cassandra/cql.py
822–825 ↗(On Diff #12832)

I'll try something...

date_name = "date"
...
date_name = "no_date"
swh/storage/cassandra/storage.py
898 ↗(On Diff #12835)

(i.e. with comments for traps like the date_from argument)

I successfully inverted the strictness.
So it's no longer a trap, we include the date as starting point.

(And note that i did not realize it was a trap, otherwise, i would have done something about it...)

Though that means we need to exclude the last result from the fetch.
So that means we need the +1 to fetch that result but not returning it. (to respect the limit number,)

We use its date as value for the next-page-token (so it is used!).

i'll check the extra comments you made to adapt according before pushing an update.

Note: I can remove the extra-limit though and just use limit + 1 etc...

Adapt according to discussion:

  • Add some comments
  • Inverse strictness comparison check so we can start from the date (page-token) including it
  • rename a variable in cql to something more clear
  • Drop considered extraneous checks on parameters (and drop the corresponding tests)

For the part about strictness comparison check, note I did not touch the tests at all \m/

Build is green

Patch application report for D3641 (id=12846)

Rebasing onto 7667f7eef9...

Current branch diff-target is up to date.
Changes applied before test
commit 8de3a714051e326a01aa122e6fb09bf081a316cf
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/618/ for more details.

Build is green

Patch application report for D3641 (id=12850)

Rebasing onto e63b78c6f9...

Current branch diff-target is up to date.
Changes applied before test
commit a40a982cf61e2f430cd33456db0fccd522a5d14f
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/621/ for more details.

This revision is now accepted and ready to land.Jul 30 2020, 1:58 PM

Build is green

Patch application report for D3641 (id=12853)

Rebasing onto 8cf6efa2e6...

Current branch diff-target is up to date.
Changes applied before test
commit b81f928fa7b941d569bd5d2c9ae8ac7208dbb2c3
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/623/ for more details.