Page MenuHomeSoftware Heritage

Iterate over paginated visits in batches to retrieve latest visit/snapshot
ClosedPublic

Authored by ardumont on Jun 24 2020, 6:11 PM.

Details

Summary

This should stop the current timeouts on origin with a high number of visits [1]

[1] https://sentry.softwareheritage.org/share/issue/3786bfd7b0414464a037d6a2a6c6ff68/

Depends on D3359

Related to T2310

Test Plan

tox

Diff Detail

Repository
rDSTO Storage manager
Branch
add-limit-to-snapshot-get-latest
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 13149
Build 20069: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 20068: arc lint + arc unit

Event Timeline

swh/storage/cassandra/cql.py
675 ↗(On Diff #11873)

Those methods were private and used only once so no need for them.

692 ↗(On Diff #11873)

Instead i used the same pattern used in storage (which i found clearer).

Build has FAILED

Patch application report for D3350 (id=11873)

Could not rebase; Attempt merge onto 621fc8d377...

Updating 621fc8d3..ccf83882
Fast-forward
 swh/storage/algos/origin.py            |  5 ++-
 swh/storage/algos/snapshot.py          |  8 ++++-
 swh/storage/cassandra/cql.py           | 46 +++++++++++++++------------
 swh/storage/cassandra/storage.py       |  9 ++++--
 swh/storage/db.py                      | 42 +++++++++++++-----------
 swh/storage/in_memory.py               | 11 +++++--
 swh/storage/interface.py               |  8 ++++-
 swh/storage/storage.py                 |  4 ++-
 swh/storage/tests/algos/test_origin.py |  8 +++++
 swh/storage/tests/test_storage.py      | 58 ++++++++++++++++++++++++++++++++++
 10 files changed, 150 insertions(+), 49 deletions(-)
Changes applied before test
commit ccf838822aa3d1aeba460edc36233d28ad3521a4
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 24 18:04:29 2020 +0200

    storage: Limit the snapshot-get-latest lookup
    
    This opens a sort parameter to the origin-visit-get storage endpoints to allow
    the search to be consistent.

commit e2981059d761a6773cbec4a327c5e72450834bbb
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 24 17:11:36 2020 +0200

    test_storage: Add missing tests on origin_visit_get method

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/345/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/345/console

swh/storage/algos/origin.py
71–72

oh, no longer need to list it... ;)

swh/storage/cassandra/cql.py
692 ↗(On Diff #11873)

(in reply to: "Those methods were private and used only once so no need for them." comment above to the left).

After discussion with vlorentz, I now believe this diff can be splitted in 2.

One diff to change the sort order by default to "desc" on visit id for the
"origin-visit-get" endpoint. Currently it's historically "asc" but given its
current use, that does not seem to be the reasonable order. "desc" sounds like
a better default. So that means no need to open the "order" parameter as it's
done in the current diff state.

The other diff would be about the current "limit" parameter to open to both
"snapshot-get-latest" and "origin-get-latest-visit-status" functions.

After discussion with vlorentz, I now believe this diff can be splitted in 2.

One diff to change the sort order by default to "desc" on visit id for the
"origin-visit-get" endpoint. Currently it's historically "asc" but given its
current use, that does not seem to be the reasonable order. "desc" sounds like
a better default. So that means no need to open the "order" parameter as it's
done in the current diff state.

Well, I'm not so sure anymore.
Looking at clients using that endoint, he webapp uses the last-visit
parameter which allows paginated results (assuming an ascending order).

I don't measure well the impacts of doing it in the descending order...

swh/storage/algos/origin.py
71–72

wrong, need it for the no result case.

swh/storage/cassandra/cql.py
692 ↗(On Diff #11873)

Although, prepared statement for cassandra are faster.
So that might need a bit of rework.

swh/storage/tests/algos/test_origin.py
340 ↗(On Diff #11873)

Should be None.

  • Rebase on latest master
  • Add more tests on edge case
  • (Ensure green tests in ci)

TODO:

  • Rework cassandra implementation with prepared-statements

Build is green

Patch application report for D3350 (id=11903)

Rebasing onto b991e69707...

Current branch diff-target is up to date.
Changes applied before test
commit 1ad9926442377ef4e507bb252a3f67e48a29c2c0
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 24 18:04:29 2020 +0200

    storage: Limit the *-get-latest functions lookup
    
    Impacts functions:
    - snapshot-get-latest
    - origin-get-latest-visit-status
    
    To this end, this opens a sort parameter to the origin-visit-get storage
    endpoints to allow the search from most recent to oldest.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/358/ for more details.

  • Split the diff in 2 D3359 and this one (to reduce the concern to only the limit part)

Depends on D3359

Build is green

Patch application report for D3350 (id=11911)

Could not rebase; Attempt merge onto b991e69707...

Updating b991e697..1f6e780e
Fast-forward
 swh/storage/algos/origin.py              |  5 ++--
 swh/storage/algos/snapshot.py            |  8 ++++-
 swh/storage/cassandra/cql.py             | 51 +++++++++++++++++++++++++++-----
 swh/storage/cassandra/storage.py         |  9 ++++--
 swh/storage/db.py                        | 42 ++++++++++++++------------
 swh/storage/in_memory.py                 | 11 +++++--
 swh/storage/interface.py                 |  8 ++++-
 swh/storage/storage.py                   |  4 ++-
 swh/storage/tests/algos/test_origin.py   |  7 +++++
 swh/storage/tests/algos/test_snapshot.py | 26 +++++++++-------
 swh/storage/tests/test_storage.py        |  8 +++++
 11 files changed, 133 insertions(+), 46 deletions(-)
Changes applied before test
commit 1f6e780e7b3b0fbf7ebc8ee09abefbce3f173a21
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jun 25 18:37:18 2020 +0200

    storage: Limit the *-get-latest functions search
    
    Impacts functions:
    - snapshot-get-latest
    - origin-get-latest-visit-status

commit 82e6070c0712ba47bfbedcc2c3329f1828c14fd0
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 24 18:04:29 2020 +0200

    storage: Open order parameter to origin-visit-get endpoint
    
    This allows clients to search from most recent to oldest calling the endpoint
    with order parameter to "desc" (visit id desc)
    
    This keeps and explicits the existing sorting order as "asc" (visit id asc)

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/365/ for more details.

ardumont edited the summary of this revision. (Show Details)
  • Rebase on latest D3359
  • Rework comments
swh/storage/tests/algos/test_snapshot.py
139

that should go away.

Remove unneeded extra fields in origin-visit (i'm cleaning those up everywhere)

Build is green

Patch application report for D3350 (id=11922)

Could not rebase; Attempt merge onto 8620519891...

Updating 86205198..90c91255
Fast-forward
 swh/storage/algos/origin.py              |  5 +-
 swh/storage/algos/snapshot.py            |  8 ++-
 swh/storage/cassandra/cql.py             | 93 ++++++++++++++++++++++++++++----
 swh/storage/cassandra/storage.py         |  9 +++-
 swh/storage/db.py                        | 45 +++++++++-------
 swh/storage/in_memory.py                 | 15 ++++--
 swh/storage/interface.py                 |  7 ++-
 swh/storage/storage.py                   |  4 +-
 swh/storage/tests/algos/test_origin.py   |  7 +++
 swh/storage/tests/algos/test_snapshot.py | 21 ++++----
 swh/storage/tests/test_storage.py        | 39 +++++++++++++-
 11 files changed, 200 insertions(+), 53 deletions(-)
Changes applied before test
commit 90c9125540ed0d97c31200df21ca04b9e3db594c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jun 25 18:37:18 2020 +0200

    storage: Limit the *-get-latest functions search
    
    Impacts functions:
    - snapshot-get-latest
    - origin-get-latest-visit-status

commit 1d5717fb5e34831fc002f9b625156bfd8d7384af
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 24 18:04:29 2020 +0200

    storage*: Open order parameter to origin-visit-get endpoint
    
    This allows clients to search from most recent to oldest visit when calling the
    endpoint with the "order" parameter set to "desc" (visit id desc).
    
    This keeps and explicits the existing sorting order as visit id "asc".
    
    Related to T2310

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/369/ for more details.

Build has FAILED

Patch application report for D3350 (id=11923)

Could not rebase; Attempt merge onto 8620519891...

Updating 86205198..96dd4417
Fast-forward
 swh/storage/algos/origin.py              |  5 +-
 swh/storage/algos/snapshot.py            |  8 ++-
 swh/storage/cassandra/cql.py             | 93 ++++++++++++++++++++++++++++----
 swh/storage/cassandra/storage.py         |  9 +++-
 swh/storage/db.py                        | 45 +++++++++-------
 swh/storage/in_memory.py                 | 15 ++++--
 swh/storage/interface.py                 |  7 ++-
 swh/storage/storage.py                   |  4 +-
 swh/storage/tests/algos/test_origin.py   |  7 +++
 swh/storage/tests/algos/test_snapshot.py | 21 +++-----
 swh/storage/tests/test_storage.py        | 39 +++++++++++++-
 11 files changed, 196 insertions(+), 57 deletions(-)
Changes applied before test
commit 96dd4417f0286dabb55f02a08f01b899415ee344
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jun 25 18:37:18 2020 +0200

    storage: Limit the *-get-latest functions search
    
    Impacts functions:
    - snapshot-get-latest
    - origin-get-latest-visit-status

commit 1d5717fb5e34831fc002f9b625156bfd8d7384af
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 24 18:04:29 2020 +0200

    storage*: Open order parameter to origin-visit-get endpoint
    
    This allows clients to search from most recent to oldest visit when calling the
    endpoint with the "order" parameter set to "desc" (visit id desc).
    
    This keeps and explicits the existing sorting order as visit id "asc".
    
    Related to T2310

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/371/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/371/console

Build has FAILED

yes, for the wrong reason, flaky tests...

Build is green

Patch application report for D3350 (id=11927)

Could not rebase; Attempt merge onto f75cd41ab8...

Updating f75cd41a..2a56db1b
Fast-forward
 swh/storage/algos/origin.py              |  5 +-
 swh/storage/algos/snapshot.py            |  8 ++-
 swh/storage/cassandra/cql.py             | 93 ++++++++++++++++++++++++++++----
 swh/storage/cassandra/storage.py         |  9 +++-
 swh/storage/db.py                        | 45 +++++++++-------
 swh/storage/in_memory.py                 | 15 ++++--
 swh/storage/interface.py                 |  7 ++-
 swh/storage/storage.py                   |  4 +-
 swh/storage/tests/algos/test_origin.py   |  7 +++
 swh/storage/tests/algos/test_snapshot.py | 21 +++-----
 swh/storage/tests/test_storage.py        | 39 +++++++++++++-
 11 files changed, 196 insertions(+), 57 deletions(-)
Changes applied before test
commit 2a56db1b1b31f055b53705a8cf0914de6894d451
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jun 25 18:37:18 2020 +0200

    storage: Limit the *-get-latest functions search
    
    Impacts functions:
    - snapshot-get-latest
    - origin-get-latest-visit-status

commit 3b1695b2e7b1f5e227410b41953921e5a5806554
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 24 18:04:29 2020 +0200

    storage*: Open order parameter to origin-visit-get endpoint
    
    This allows clients to search from most recent to oldest visit when calling the
    endpoint with the "order" parameter set to "desc" (visit id desc).
    
    This keeps and explicits the existing sorting order as visit id "asc".
    
    Related to T2310

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/374/ for more details.

Build is green

Patch application report for D3350 (id=11931)

Could not rebase; Attempt merge onto f75cd41ab8...

Updating f75cd41a..e5f5a2d1
Fast-forward
 swh/storage/algos/origin.py              |  5 +-
 swh/storage/algos/snapshot.py            |  8 ++-
 swh/storage/cassandra/cql.py             | 93 ++++++++++++++++++++++++++++----
 swh/storage/cassandra/storage.py         |  9 +++-
 swh/storage/db.py                        | 50 ++++++++++-------
 swh/storage/in_memory.py                 | 15 ++++--
 swh/storage/interface.py                 |  7 ++-
 swh/storage/storage.py                   |  4 +-
 swh/storage/tests/algos/test_origin.py   |  7 +++
 swh/storage/tests/algos/test_snapshot.py | 21 +++-----
 swh/storage/tests/test_storage.py        | 39 +++++++++++++-
 11 files changed, 201 insertions(+), 57 deletions(-)
Changes applied before test
commit e5f5a2d1a047faeab35253572a5b629e75af4edd
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jun 25 18:37:18 2020 +0200

    storage: Limit the *-get-latest functions search
    
    Impacts functions:
    - snapshot-get-latest
    - origin-get-latest-visit-status

commit 182ee497328a55d99609e5230146268a108508e3
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Jun 24 18:04:29 2020 +0200

    storage*: Open order parameter to origin-visit-get endpoint
    
    This allows clients to search from most recent to oldest visit when calling the
    endpoint with the "order" parameter set to "desc" (visit id desc).
    
    This keeps and explicits the existing sorting order as visit id "asc".
    
    Related to T2310

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/377/ for more details.

That's confusing, limit is usually the number of items *after* filtering, but you're limiting before filtering.

That's confusing, limit is usually the number of items *after* filtering, but you're limiting before filtering.

I don't follow the after/before filtering you mention. Maybe the tests scenario
are confusing because they use such a small limit though. I did not really want
to craft a more complicated scenario ¯\_(ツ)_/¯.

Implementation wise, that limit is passed through to the backend(s) which does
both the filtering and limit the results at the same time...

In any case, do you have a sample in mind for the limit use you are mentioning?

Thanks in advance.

If function(X, limit=1000) returns one item, then function(X, limit=1) should also return one item, not zero.

I don't follow the after/before filtering you mention. Maybe the tests scenario
are confusing because they use such a small limit though. I did not really want
to craft a more complicated scenario ¯\_(ツ)_/¯.

It's not about the tests. origin_get_latest_visit_status gets limit records from the storage, then filters; instead of getting records until it gets enough to return limit records.

If function(X, limit=1000) returns one item, then function(X, limit=1) should also return one item, not zero.

I thought it made sense initially but now i'm no longer sure...
I think "limit" does not convey what i think it means...

It's not about the tests. origin_get_latest_visit_status gets limit records from the storage, then filters; instead of getting records until it gets enough to return limit records.

Yes, points towards "does not convey" what i think.
I think i perceive something...

I don't want to limit the number of records to return because those functions should
always be Optional[Union[OriginVisit, Snapshot]] so 0 or 1 (depending on the function heh).

I want to limit the query records looked up internally by the backend so that
it does not timeout. So "limit" seems to be the wrong term...
Also, I opened the parameter so clients can adapt that "limit".

But maybe what i want is iterate until we find something and not open that parameter at all?
Something like:

last_visit = -1
while last_visit is not None:
    visits = list(storage.origin_visit_get(
        origin_url,
        last_visit=None if last_visit == -1 else last_visit,
        order="desc", limit=10)
    )
    if not visits:
        return None
    last_visit = visits[-1]["visit"]

    visit_status: Optional[OriginVisitStatus] = None
    visit: Dict[str, Any]
    for visit in visits:
        if type is not None and visit["type"] != type:
            continue
        visit_status = storage.origin_visit_status_get_latest(
            origin_url,
            visit["visit"],
            allowed_statuses=allowed_statuses,
            require_snapshot=require_snapshot,
        )
        if visit_status is not None:
            return (OriginVisit.from_dict(visit), visit_status)

(pseudo ^, not tested yet)

If function(X, limit=1000) returns one item, then function(X, limit=1) should also return one item, not zero.

I thought it made sense initially but now i'm no longer sure...
I think "limit" does not convey what i think it means...

Yes, that was my point. Your code is correct, I just don't think the word limit is the right one.

But maybe what i want is iterate until we find something and not open that parameter at all?
Something like:

last_visit = -1
while last_visit is not None:
    visits = list(storage.origin_visit_get(
        origin_url,
        last_visit=None if last_visit == -1 else last_visit,
        order="desc", limit=10)
    )
    if not visits:
        return None
    last_visit = visits[-1]["visit"]

    visit_status: Optional[OriginVisitStatus] = None
    visit: Dict[str, Any]
    for visit in visits:
        if type is not None and visit["type"] != type:
            continue
        visit_status = storage.origin_visit_status_get_latest(
            origin_url,
            visit["visit"],
            allowed_statuses=allowed_statuses,
            require_snapshot=require_snapshot,
        )
        if visit_status is not None:
            return (OriginVisit.from_dict(visit), visit_status)

(pseudo ^, not tested yet)

That would be great. Best of both worlds: we don't fetch too many at once, but still get the correct result all the time.

(pseudo ^, not tested yet)

That actually works.
So no need to open that so-called "limit" parameter after all.
Instead, iterate over paginated results of origin-visit from descending order \o/.

Awesome!
Thanks for making me see the lights :)

Iterate over paginated visits in batches to retrieve latest visit/snapshot

This should stops the current timeouts on origin with a high number of visits.

@vlorentz, awesome, thanks! you rock dude! ;)

Build is green

Patch application report for D3350 (id=11934)

Rebasing onto 182ee49732...

Current branch diff-target is up to date.
Changes applied before test
commit 026c5c62f4bc00ff0f41407a5a0f3daf1f80ecad
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jun 25 18:37:18 2020 +0200

    Iterate over paginated visits in batches to retrieve latest visit/snapshot
    
    This should stops the current timeouts on origin with a high number of visits.
    
    Related to T2310

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/378/ for more details.

Simplify loop and stop condition

Build is green

Patch application report for D3350 (id=11935)

Rebasing onto 182ee49732...

Current branch diff-target is up to date.
Changes applied before test
commit 8c039ef4bc4f707d5513159eb4c50fe409e6321c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jun 25 18:37:18 2020 +0200

    Iterate over paginated visits in batches to retrieve latest visit/snapshot
    
    This should stops the current timeouts on origin with a high number of visits.
    
    Related to T2310

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/379/ for more details.

swh/storage/algos/origin.py
99

This is not needed, we'll get out of the loop with line 79.

Simplify even further stop condition.

ardumont retitled this revision from storage: Limit the snapshot-get-latest/origin-get-latest-visit-status lookup to Iterate over paginated visits in batches to retrieve latest visit/snapshot.Jun 26 2020, 5:42 PM
ardumont edited the summary of this revision. (Show Details)

Build is green

Patch application report for D3350 (id=11936)

Rebasing onto 182ee49732...

Current branch diff-target is up to date.
Changes applied before test
commit 10443b8a17c4bd866884fe9c572ba8d13391656c
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Jun 25 18:37:18 2020 +0200

    Iterate over paginated visits in batches to retrieve latest visit/snapshot
    
    This should stops the current timeouts on origin with a high number of visits.
    
    Related to T2310

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/380/ for more details.

This revision is now accepted and ready to land.Jun 26 2020, 5:50 PM