Page MenuHomeSoftware Heritage

storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
ClosedPublic

Authored by ardumont on Wed, Jul 29, 4:35 PM.

Details

Summary

This opens new origin-visit-status-get endpoint.

Related to T645

Test Plan

tox

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ardumont added inline comments.Wed, Jul 29, 4:40 PM
swh/storage/db.py
577

salvaging that no longer used endpoint.

Build has FAILED

Patch application report for D3641 (id=12823)

Rebasing onto 24559bb3ba...

Current branch diff-target is up to date.
Changes applied before test
commit 64f773a133f723c247ba1944a47f1184a98f0847
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: new origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/602/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/602/console

ardumont planned changes to this revision.Wed, Jul 29, 5:15 PM
ardumont updated this revision to Diff 12829.EditedWed, Jul 29, 5:53 PM

done:

  • Add in-memory storage implementation
  • Improve opaque identifier tests for that endpoints
  • Improve docstring
  • Simplify next_page_token computation

todo:

  • Remain cassandra implementation to do

Build has FAILED

Patch application report for D3641 (id=12829)

Rebasing onto 2bfd2f7091...

Current branch diff-target is up to date.
Changes applied before test
commit ddfb3a992c91508a12b012b6c1832f9a191f9b72
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: new origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/608/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/608/console

ardumont updated this revision to Diff 12830.Wed, Jul 29, 6:22 PM

Add cassandra implementations

For some reason, the tests on signature type is not happy though...

I'll delve into that later:

>           assert expected_signature == actual_signature, meth_name
E           AssertionError: origin_visit_status_get
E           assert <Signature (origin: str, visit: int, page_token: Union[str, NoneType] = None, order: swh.storage.interface.ListOrder = <ListOrder.ASC: 'asc'>, limit: int = 10) -> swh.core.api.classes.PagedResult[swh.model.model.OriginVisitStatus, str]> == <Signature (origin: str, visit: int, page_token: Union[str, NoneType] = None, order: swh.storage.interface.ListOrder = <ListOrder.ASC: 'asc'>, limit: int = 10) -> swh.core.api.classes.PagedResult[swh.model.model.OriginVisit, str]>
ardumont retitled this revision from wip: storage*: new origin_visit_status_get(...) -> PagedResult[OriginVisitStatus] to storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus].Wed, Jul 29, 6:23 PM
ardumont edited the summary of this revision. (Show Details)
ardumont edited the test plan for this revision. (Show Details)
ardumont added inline comments.
swh/storage/in_memory.py
956

@ardumont have a look here... see... that's why it fails ;)

ardumont added inline comments.Wed, Jul 29, 6:26 PM
swh/storage/storage.py
985

same here, it's OriginVisitStatus you want ;)

ardumont updated this revision to Diff 12831.Wed, Jul 29, 6:27 PM

Fix mistyped signatures (pgstorage, in-memory)

ardumont edited the test plan for this revision. (Show Details)Wed, Jul 29, 6:27 PM

Build has FAILED

Patch application report for D3641 (id=12830)

Rebasing onto 2bfd2f7091...

Current branch diff-target is up to date.
Changes applied before test
commit d4a505a7e9aea2f5ce92526f2ec59135b2b6755a
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/609/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/609/console

ardumont updated this revision to Diff 12832.Wed, Jul 29, 6:35 PM

Fix db dosctring

Build is green

Patch application report for D3641 (id=12831)

Rebasing onto 2bfd2f7091...

Current branch diff-target is up to date.
Changes applied before test
commit 63ae21d27d6fd28fbcb059ade83629627ff15b4b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/610/ for more details.

Build is green

Patch application report for D3641 (id=12832)

Rebasing onto 2bfd2f7091...

Current branch diff-target is up to date.
Changes applied before test
commit b4972fbb2e0a435b256c66617837a3dc3f6a23eb
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/611/ for more details.

ardumont updated this revision to Diff 12835.Wed, Jul 29, 7:16 PM
  • Rebase on latest master.
  • Fix wrong names (was origin_visit instead of origin_visit_status, values wise, it was ok nonetheless)

Build is green

Patch application report for D3641 (id=12835)

Rebasing onto 7667f7eef9...

Current branch diff-target is up to date.
Changes applied before test
commit de11ef60ea74474c16df45149a97e9cfd8244ff0
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/612/ for more details.

vlorentz added inline comments.
swh/storage/cassandra/cql.py
822–825

why the name page_name?

swh/storage/tests/test_storage.py
1374–1375

assert date_visit1 < date_visit2 < date_visit3

ardumont added inline comments.Thu, Jul 30, 10:56 AM
swh/storage/cassandra/cql.py
822–825

i don't know, i kept the naming logic from the function just before.
i adapted the value from pagination to token as you seemed keen on that (from the test).

What do you propose instead (nothing comes to mind for my part)?

ardumont added inline comments.Thu, Jul 30, 11:21 AM
swh/storage/cassandra/cql.py
822–825

It's related to pagination so page for short.

douardda added inline comments.
swh/storage/cassandra/storage.py
873–876

do we really need those? I mean, these are type annotated, so it seems unnecessary to me to add execution time type validation.

885

a small comment on the reason for this +1 could be handy (aka get one more result to handle the next_page logic).

Also not sure this extra_limit variable is really helpful, as well as the assertion below seems unnecessary to me (add unit tests for _cql_runner.origin_visit_status_get_range is needed, then trust them).

895

this is wrong, the next_page_token should be before the visit_statuses above.

douardda added inline comments.Thu, Jul 30, 11:30 AM
swh/storage/cassandra/storage.py
898

long story short, here is my proposal for this part of the code:

def origin_visit_status_get([...]):
    next_page_token = None
    date_from = None
    if page_token is not None:
        date_from = datetime.datetime.fromisoformat(page_token)
    rows = self._cql_runner.origin_visit_status_get_range(
        origin, visit, date_from, limit+1, order
    )
    visit_statuses = [converters.row_to_visit_status(row) for row in rows[:limit]]
    if len(rows) > limit:
        next_page_token = str(converters.row_to_visit_status(rows[limit+1].date)
    return PagedResult(results=visit_statuses, next_page_token=next_page_token)
ardumont added inline comments.Thu, Jul 30, 11:32 AM
swh/storage/cassandra/storage.py
873–876

except until recently it was moot...
D3635 fixed it...

But that reassures me a bit to have those ;)

885

i use the extra-limit variable to avoid having + 1 all over the places which i found annoying to read and prone to error.

895

tests disagrees there.

898

let's see if that works ;)

douardda added inline comments.Thu, Jul 30, 11:32 AM
swh/storage/cassandra/storage.py
898

Note: this applies to other storage implementations too, i guess.

douardda added inline comments.Thu, Jul 30, 11:34 AM
swh/storage/cassandra/storage.py
895

then either tests are wrong or the limit+1 is useless, since you never use the extra returned value...

ardumont added inline comments.Thu, Jul 30, 11:35 AM
swh/storage/cassandra/storage.py
895

I'm using it but not for its value for its presence.

douardda added inline comments.Thu, Jul 30, 11:36 AM
swh/storage/cassandra/storage.py
873–876

but it adds code and (at least) "visual complexity", so I'd rather say "no", but I won't fight too hard, it's not a big deal, really.

ardumont added inline comments.Thu, Jul 30, 11:37 AM
swh/storage/cassandra/storage.py
895

but like i said, i'll try your version which seems simpler to read and to the point alright ;)

vlorentz added inline comments.Thu, Jul 30, 11:38 AM
swh/storage/cassandra/storage.py
885

You can avoid multiple +1 without extra_limit:

        rows = self._cql_runner.origin_visit_status_get_range(
            origin, visit, date_from, limit+1, order
        )
...
        if len(visit_statuses) > limit:
...
895

It's not just the test that disagrees. next_page_token is the date of the last returned visit, and the next call will return all visits with a date strictly greater than the one in the token

898

if you do that, then you need to change the inequalities in the backend to be non-strict inequalities

douardda added inline comments.Thu, Jul 30, 11:41 AM
swh/storage/cassandra/storage.py
895

ok then this MUST be documented somehow because it really feels wrong for the fellow dev reading this piece of code.

BTW, do we really need to gather this extra result from the base? why not just add the next link if len(rows) == limit (according we do not use limit+1 when calling it).

Worst case, we do an extra call with no results if the total number of visits is a multiple of limit.

vlorentz added inline comments.Thu, Jul 30, 11:45 AM
swh/storage/cassandra/storage.py
898

oh and your code is incorrect david, because you're returning the limit first results, and use the limit+2's date as token

douardda added inline comments.Thu, Jul 30, 11:56 AM
swh/storage/cassandra/storage.py
898

yes it's indeed incorrect since I did not get the behavior of _cql_runner.origin_visit_status_get_range() w.r.t the date_from argument. ( I mean calling a function with a date_from argument, I expect to have results starting from this date by default).
So I say this should be explained in a oneliner comment.

But as said above, I'm not sure I see the real advantage of gathering limit+1 results from the base (just add the next_page_token iff len(rows) == limit )

ardumont added inline comments.Thu, Jul 30, 11:56 AM
swh/storage/cassandra/storage.py
898

It does not work for the reasons val exposed.

either the tests are wrong...

I think the tests are fine, spent quite some time on it (either here and for origin-visit-get endpoint).

In the end, the improvments i can do here is:

  • using the list comprehension proposed.
  • drop the extra check on the argument (plus it's not consistent, we don't do it on all parameters)

Note:
len(rows) does not work, rows is a ResultSet and they complain about it (mypy and tests).

ardumont added inline comments.Thu, Jul 30, 12:00 PM
swh/storage/cassandra/storage.py
898

( I mean calling a function with a date_from argument, I expect to have results starting from this date by default).

ah yes, that's reasonable.

(just add the next_page_token iff len(rows) == limit

That saves one call to the api from the user perspective?
Here we know and say as much, there is still something to iterate again over.

without the extra limit, i gather, the user needs to make one more call with possibly nothing to retrieve.

ardumont added inline comments.Thu, Jul 30, 12:03 PM
swh/storage/cassandra/storage.py
898

ah yes, that's reasonable.

I'll check how hard it is to invert the strictness in the backend...
A priori, i should not have anything to change in the tests...
and in the end, everything should stay green... i think

vlorentz added inline comments.Thu, Jul 30, 12:10 PM
swh/storage/cassandra/storage.py
898

That saves one call to the api from the user perspective?

and David's argument is that it's not worth the extra complexity + fetching an extra record.

Personally I don't care either way.

douardda added inline comments.Thu, Jul 30, 12:14 PM
swh/storage/cassandra/storage.py
898

Yes, but I'm more for the "extra complexity" argument, as you would have guessed ;-) But I don't really care either as long as the code is easy to read and understand (i.e. with comments for traps like the date_from argument)

ardumont added inline comments.Thu, Jul 30, 12:30 PM
swh/storage/cassandra/cql.py
822–825

I'll try something...

date_name = "date"
...
date_name = "no_date"
swh/storage/cassandra/storage.py
898

(i.e. with comments for traps like the date_from argument)

I successfully inverted the strictness.
So it's no longer a trap, we include the date as starting point.

(And note that i did not realize it was a trap, otherwise, i would have done something about it...)

Though that means we need to exclude the last result from the fetch.
So that means we need the +1 to fetch that result but not returning it. (to respect the limit number,)

We use its date as value for the next-page-token (so it is used!).

i'll check the extra comments you made to adapt according before pushing an update.

Note: I can remove the extra-limit though and just use limit + 1 etc...

ardumont updated this revision to Diff 12846.Thu, Jul 30, 12:42 PM

Adapt according to discussion:

  • Add some comments
  • Inverse strictness comparison check so we can start from the date (page-token) including it
  • rename a variable in cql to something more clear
  • Drop considered extraneous checks on parameters (and drop the corresponding tests)

For the part about strictness comparison check, note I did not touch the tests at all \m/

ardumont marked 17 inline comments as done.Thu, Jul 30, 12:43 PM

Build is green

Patch application report for D3641 (id=12846)

Rebasing onto 7667f7eef9...

Current branch diff-target is up to date.
Changes applied before test
commit 8de3a714051e326a01aa122e6fb09bf081a316cf
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/618/ for more details.

Build is green

Patch application report for D3641 (id=12850)

Rebasing onto e63b78c6f9...

Current branch diff-target is up to date.
Changes applied before test
commit a40a982cf61e2f430cd33456db0fccd522a5d14f
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/621/ for more details.

This revision is now accepted and ready to land.Thu, Jul 30, 1:58 PM

Build is green

Patch application report for D3641 (id=12853)

Rebasing onto 8cf6efa2e6...

Current branch diff-target is up to date.
Changes applied before test
commit b81f928fa7b941d569bd5d2c9ae8ac7208dbb2c3
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/623/ for more details.