Page MenuHomeSoftware Heritage

storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
ClosedPublic

Authored by ardumont on Jul 29 2020, 4:35 PM.

Details

Summary

This opens new origin-visit-status-get endpoint.

Related to T645

Test Plan

tox

Diff Detail

Repository
rDSTO Storage manager
Branch
open-snapshot-id
Lint
Lint Skipped
Unit
Unit Tests Skipped
Build Status
Buildable 14089
Build 21638: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 21637: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
swh/storage/db.py
577

salvaging that no longer used endpoint.

Build has FAILED

Patch application report for D3641 (id=12823)

Rebasing onto 24559bb3ba...

Current branch diff-target is up to date.
Changes applied before test
commit 64f773a133f723c247ba1944a47f1184a98f0847
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: new origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/602/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/602/console

done:

  • Add in-memory storage implementation
  • Improve opaque identifier tests for that endpoints
  • Improve docstring
  • Simplify next_page_token computation

todo:

  • Remain cassandra implementation to do

Build has FAILED

Patch application report for D3641 (id=12829)

Rebasing onto 2bfd2f7091...

Current branch diff-target is up to date.
Changes applied before test
commit ddfb3a992c91508a12b012b6c1832f9a191f9b72
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: new origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/608/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/608/console

Add cassandra implementations

For some reason, the tests on signature type is not happy though...

I'll delve into that later:

>           assert expected_signature == actual_signature, meth_name
E           AssertionError: origin_visit_status_get
E           assert <Signature (origin: str, visit: int, page_token: Union[str, NoneType] = None, order: swh.storage.interface.ListOrder = <ListOrder.ASC: 'asc'>, limit: int = 10) -> swh.core.api.classes.PagedResult[swh.model.model.OriginVisitStatus, str]> == <Signature (origin: str, visit: int, page_token: Union[str, NoneType] = None, order: swh.storage.interface.ListOrder = <ListOrder.ASC: 'asc'>, limit: int = 10) -> swh.core.api.classes.PagedResult[swh.model.model.OriginVisit, str]>
ardumont retitled this revision from wip: storage*: new origin_visit_status_get(...) -> PagedResult[OriginVisitStatus] to storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus].Jul 29 2020, 6:23 PM
ardumont edited the summary of this revision. (Show Details)
ardumont edited the test plan for this revision. (Show Details)
ardumont added inline comments.
swh/storage/in_memory.py
955

@ardumont have a look here... see... that's why it fails ;)

swh/storage/storage.py
984

same here, it's OriginVisitStatus you want ;)

Fix mistyped signatures (pgstorage, in-memory)

Build has FAILED

Patch application report for D3641 (id=12830)

Rebasing onto 2bfd2f7091...

Current branch diff-target is up to date.
Changes applied before test
commit d4a505a7e9aea2f5ce92526f2ec59135b2b6755a
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/609/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/609/console

Build is green

Patch application report for D3641 (id=12831)

Rebasing onto 2bfd2f7091...

Current branch diff-target is up to date.
Changes applied before test
commit 63ae21d27d6fd28fbcb059ade83629627ff15b4b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/610/ for more details.

Build is green

Patch application report for D3641 (id=12832)

Rebasing onto 2bfd2f7091...

Current branch diff-target is up to date.
Changes applied before test
commit b4972fbb2e0a435b256c66617837a3dc3f6a23eb
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/611/ for more details.

  • Rebase on latest master.
  • Fix wrong names (was origin_visit instead of origin_visit_status, values wise, it was ok nonetheless)

Build is green

Patch application report for D3641 (id=12835)

Rebasing onto 7667f7eef9...

Current branch diff-target is up to date.
Changes applied before test
commit de11ef60ea74474c16df45149a97e9cfd8244ff0
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/612/ for more details.

vlorentz added inline comments.
swh/storage/cassandra/cql.py
821–824

why the name page_name?

swh/storage/tests/test_storage.py
1373–1374

assert date_visit1 < date_visit2 < date_visit3

swh/storage/cassandra/cql.py
821–824

i don't know, i kept the naming logic from the function just before.
i adapted the value from pagination to token as you seemed keen on that (from the test).

What do you propose instead (nothing comes to mind for my part)?

swh/storage/cassandra/cql.py
821–824

It's related to pagination so page for short.

douardda added inline comments.
swh/storage/cassandra/storage.py
872–875

do we really need those? I mean, these are type annotated, so it seems unnecessary to me to add execution time type validation.

884

a small comment on the reason for this +1 could be handy (aka get one more result to handle the next_page logic).

Also not sure this extra_limit variable is really helpful, as well as the assertion below seems unnecessary to me (add unit tests for _cql_runner.origin_visit_status_get_range is needed, then trust them).

894

this is wrong, the next_page_token should be before the visit_statuses above.

swh/storage/cassandra/storage.py
897

long story short, here is my proposal for this part of the code:

def origin_visit_status_get([...]):
    next_page_token = None
    date_from = None
    if page_token is not None:
        date_from = datetime.datetime.fromisoformat(page_token)
    rows = self._cql_runner.origin_visit_status_get_range(
        origin, visit, date_from, limit+1, order
    )
    visit_statuses = [converters.row_to_visit_status(row) for row in rows[:limit]]
    if len(rows) > limit:
        next_page_token = str(converters.row_to_visit_status(rows[limit+1].date)
    return PagedResult(results=visit_statuses, next_page_token=next_page_token)
swh/storage/cassandra/storage.py
872–875

except until recently it was moot...
D3635 fixed it...

But that reassures me a bit to have those ;)

884

i use the extra-limit variable to avoid having + 1 all over the places which i found annoying to read and prone to error.

894

tests disagrees there.

897

let's see if that works ;)

swh/storage/cassandra/storage.py
897

Note: this applies to other storage implementations too, i guess.

swh/storage/cassandra/storage.py
894

then either tests are wrong or the limit+1 is useless, since you never use the extra returned value...

swh/storage/cassandra/storage.py
894

I'm using it but not for its value for its presence.

swh/storage/cassandra/storage.py
872–875

but it adds code and (at least) "visual complexity", so I'd rather say "no", but I won't fight too hard, it's not a big deal, really.

swh/storage/cassandra/storage.py
894

but like i said, i'll try your version which seems simpler to read and to the point alright ;)

swh/storage/cassandra/storage.py
884

You can avoid multiple +1 without extra_limit:

        rows = self._cql_runner.origin_visit_status_get_range(
            origin, visit, date_from, limit+1, order
        )
...
        if len(visit_statuses) > limit:
...
894

It's not just the test that disagrees. next_page_token is the date of the last returned visit, and the next call will return all visits with a date strictly greater than the one in the token

897

if you do that, then you need to change the inequalities in the backend to be non-strict inequalities

swh/storage/cassandra/storage.py
894

ok then this MUST be documented somehow because it really feels wrong for the fellow dev reading this piece of code.

BTW, do we really need to gather this extra result from the base? why not just add the next link if len(rows) == limit (according we do not use limit+1 when calling it).

Worst case, we do an extra call with no results if the total number of visits is a multiple of limit.

swh/storage/cassandra/storage.py
897

oh and your code is incorrect david, because you're returning the limit first results, and use the limit+2's date as token

swh/storage/cassandra/storage.py
897

yes it's indeed incorrect since I did not get the behavior of _cql_runner.origin_visit_status_get_range() w.r.t the date_from argument. ( I mean calling a function with a date_from argument, I expect to have results starting from this date by default).
So I say this should be explained in a oneliner comment.

But as said above, I'm not sure I see the real advantage of gathering limit+1 results from the base (just add the next_page_token iff len(rows) == limit )

swh/storage/cassandra/storage.py
897

It does not work for the reasons val exposed.

either the tests are wrong...

I think the tests are fine, spent quite some time on it (either here and for origin-visit-get endpoint).

In the end, the improvments i can do here is:

  • using the list comprehension proposed.
  • drop the extra check on the argument (plus it's not consistent, we don't do it on all parameters)

Note:
len(rows) does not work, rows is a ResultSet and they complain about it (mypy and tests).

swh/storage/cassandra/storage.py
897

( I mean calling a function with a date_from argument, I expect to have results starting from this date by default).

ah yes, that's reasonable.

(just add the next_page_token iff len(rows) == limit

That saves one call to the api from the user perspective?
Here we know and say as much, there is still something to iterate again over.

without the extra limit, i gather, the user needs to make one more call with possibly nothing to retrieve.

swh/storage/cassandra/storage.py
897

ah yes, that's reasonable.

I'll check how hard it is to invert the strictness in the backend...
A priori, i should not have anything to change in the tests...
and in the end, everything should stay green... i think

swh/storage/cassandra/storage.py
897

That saves one call to the api from the user perspective?

and David's argument is that it's not worth the extra complexity + fetching an extra record.

Personally I don't care either way.

swh/storage/cassandra/storage.py
897

Yes, but I'm more for the "extra complexity" argument, as you would have guessed ;-) But I don't really care either as long as the code is easy to read and understand (i.e. with comments for traps like the date_from argument)

swh/storage/cassandra/cql.py
821–824

I'll try something...

date_name = "date"
...
date_name = "no_date"
swh/storage/cassandra/storage.py
897

(i.e. with comments for traps like the date_from argument)

I successfully inverted the strictness.
So it's no longer a trap, we include the date as starting point.

(And note that i did not realize it was a trap, otherwise, i would have done something about it...)

Though that means we need to exclude the last result from the fetch.
So that means we need the +1 to fetch that result but not returning it. (to respect the limit number,)

We use its date as value for the next-page-token (so it is used!).

i'll check the extra comments you made to adapt according before pushing an update.

Note: I can remove the extra-limit though and just use limit + 1 etc...

Adapt according to discussion:

  • Add some comments
  • Inverse strictness comparison check so we can start from the date (page-token) including it
  • rename a variable in cql to something more clear
  • Drop considered extraneous checks on parameters (and drop the corresponding tests)

For the part about strictness comparison check, note I did not touch the tests at all \m/

Build is green

Patch application report for D3641 (id=12846)

Rebasing onto 7667f7eef9...

Current branch diff-target is up to date.
Changes applied before test
commit 8de3a714051e326a01aa122e6fb09bf081a316cf
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/618/ for more details.

Build is green

Patch application report for D3641 (id=12850)

Rebasing onto e63b78c6f9...

Current branch diff-target is up to date.
Changes applied before test
commit a40a982cf61e2f430cd33456db0fccd522a5d14f
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/621/ for more details.

This revision is now accepted and ready to land.Jul 30 2020, 1:58 PM

Build is green

Patch application report for D3641 (id=12853)

Rebasing onto 8cf6efa2e6...

Current branch diff-target is up to date.
Changes applied before test
commit b81f928fa7b941d569bd5d2c9ae8ac7208dbb2c3
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jul 29 16:33:31 2020 +0200

    storage*: add origin_visit_status_get(...) -> PagedResult[OriginVisitStatus]
    
    Related to T645

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/623/ for more details.