This should stop the current timeouts on origin with a high number of visits [1]
[1] https://sentry.softwareheritage.org/share/issue/3786bfd7b0414464a037d6a2a6c6ff68/
Depends on D3359
Related to T2310
Differential D3350
Iterate over paginated visits in batches to retrieve latest visit/snapshot ardumont on Jun 24 2020, 6:11 PM. Authored by
Details
This should stop the current timeouts on origin with a high number of visits [1] [1] https://sentry.softwareheritage.org/share/issue/3786bfd7b0414464a037d6a2a6c6ff68/ Depends on D3359 Related to T2310 tox
Diff Detail
Event TimelineComment Actions Build has FAILED Patch application report for D3350 (id=11873)Could not rebase; Attempt merge onto 621fc8d377... Updating 621fc8d3..ccf83882 Fast-forward swh/storage/algos/origin.py | 5 ++- swh/storage/algos/snapshot.py | 8 ++++- swh/storage/cassandra/cql.py | 46 +++++++++++++++------------ swh/storage/cassandra/storage.py | 9 ++++-- swh/storage/db.py | 42 +++++++++++++----------- swh/storage/in_memory.py | 11 +++++-- swh/storage/interface.py | 8 ++++- swh/storage/storage.py | 4 ++- swh/storage/tests/algos/test_origin.py | 8 +++++ swh/storage/tests/test_storage.py | 58 ++++++++++++++++++++++++++++++++++ 10 files changed, 150 insertions(+), 49 deletions(-) Changes applied before testcommit ccf838822aa3d1aeba460edc36233d28ad3521a4 Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Wed Jun 24 18:04:29 2020 +0200 storage: Limit the snapshot-get-latest lookup This opens a sort parameter to the origin-visit-get storage endpoints to allow the search to be consistent. commit e2981059d761a6773cbec4a327c5e72450834bbb Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Wed Jun 24 17:11:36 2020 +0200 test_storage: Add missing tests on origin_visit_get method Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/345/ Comment Actions After discussion with vlorentz, I now believe this diff can be splitted in 2. One diff to change the sort order by default to "desc" on visit id for the The other diff would be about the current "limit" parameter to open to both Comment Actions
Well, I'm not so sure anymore. I don't measure well the impacts of doing it in the descending order...
Comment Actions
TODO:
Comment Actions Build is green Patch application report for D3350 (id=11903)Rebasing onto b991e69707... Current branch diff-target is up to date. Changes applied before testcommit 1ad9926442377ef4e507bb252a3f67e48a29c2c0 Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Wed Jun 24 18:04:29 2020 +0200 storage: Limit the *-get-latest functions lookup Impacts functions: - snapshot-get-latest - origin-get-latest-visit-status To this end, this opens a sort parameter to the origin-visit-get storage endpoints to allow the search from most recent to oldest. See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/358/ for more details. Comment Actions Build is green Patch application report for D3350 (id=11911)Could not rebase; Attempt merge onto b991e69707... Updating b991e697..1f6e780e Fast-forward swh/storage/algos/origin.py | 5 ++-- swh/storage/algos/snapshot.py | 8 ++++- swh/storage/cassandra/cql.py | 51 +++++++++++++++++++++++++++----- swh/storage/cassandra/storage.py | 9 ++++-- swh/storage/db.py | 42 ++++++++++++++------------ swh/storage/in_memory.py | 11 +++++-- swh/storage/interface.py | 8 ++++- swh/storage/storage.py | 4 ++- swh/storage/tests/algos/test_origin.py | 7 +++++ swh/storage/tests/algos/test_snapshot.py | 26 +++++++++------- swh/storage/tests/test_storage.py | 8 +++++ 11 files changed, 133 insertions(+), 46 deletions(-) Changes applied before testcommit 1f6e780e7b3b0fbf7ebc8ee09abefbce3f173a21 Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Thu Jun 25 18:37:18 2020 +0200 storage: Limit the *-get-latest functions search Impacts functions: - snapshot-get-latest - origin-get-latest-visit-status commit 82e6070c0712ba47bfbedcc2c3329f1828c14fd0 Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Wed Jun 24 18:04:29 2020 +0200 storage: Open order parameter to origin-visit-get endpoint This allows clients to search from most recent to oldest calling the endpoint with order parameter to "desc" (visit id desc) This keeps and explicits the existing sorting order as "asc" (visit id asc) See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/365/ for more details. Comment Actions Build is green Patch application report for D3350 (id=11922)Could not rebase; Attempt merge onto 8620519891... Updating 86205198..90c91255 Fast-forward swh/storage/algos/origin.py | 5 +- swh/storage/algos/snapshot.py | 8 ++- swh/storage/cassandra/cql.py | 93 ++++++++++++++++++++++++++++---- swh/storage/cassandra/storage.py | 9 +++- swh/storage/db.py | 45 +++++++++------- swh/storage/in_memory.py | 15 ++++-- swh/storage/interface.py | 7 ++- swh/storage/storage.py | 4 +- swh/storage/tests/algos/test_origin.py | 7 +++ swh/storage/tests/algos/test_snapshot.py | 21 ++++---- swh/storage/tests/test_storage.py | 39 +++++++++++++- 11 files changed, 200 insertions(+), 53 deletions(-) Changes applied before testcommit 90c9125540ed0d97c31200df21ca04b9e3db594c Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Thu Jun 25 18:37:18 2020 +0200 storage: Limit the *-get-latest functions search Impacts functions: - snapshot-get-latest - origin-get-latest-visit-status commit 1d5717fb5e34831fc002f9b625156bfd8d7384af Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Wed Jun 24 18:04:29 2020 +0200 storage*: Open order parameter to origin-visit-get endpoint This allows clients to search from most recent to oldest visit when calling the endpoint with the "order" parameter set to "desc" (visit id desc). This keeps and explicits the existing sorting order as visit id "asc". Related to T2310 See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/369/ for more details. Comment Actions Build has FAILED Patch application report for D3350 (id=11923)Could not rebase; Attempt merge onto 8620519891... Updating 86205198..96dd4417 Fast-forward swh/storage/algos/origin.py | 5 +- swh/storage/algos/snapshot.py | 8 ++- swh/storage/cassandra/cql.py | 93 ++++++++++++++++++++++++++++---- swh/storage/cassandra/storage.py | 9 +++- swh/storage/db.py | 45 +++++++++------- swh/storage/in_memory.py | 15 ++++-- swh/storage/interface.py | 7 ++- swh/storage/storage.py | 4 +- swh/storage/tests/algos/test_origin.py | 7 +++ swh/storage/tests/algos/test_snapshot.py | 21 +++----- swh/storage/tests/test_storage.py | 39 +++++++++++++- 11 files changed, 196 insertions(+), 57 deletions(-) Changes applied before testcommit 96dd4417f0286dabb55f02a08f01b899415ee344 Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Thu Jun 25 18:37:18 2020 +0200 storage: Limit the *-get-latest functions search Impacts functions: - snapshot-get-latest - origin-get-latest-visit-status commit 1d5717fb5e34831fc002f9b625156bfd8d7384af Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Wed Jun 24 18:04:29 2020 +0200 storage*: Open order parameter to origin-visit-get endpoint This allows clients to search from most recent to oldest visit when calling the endpoint with the "order" parameter set to "desc" (visit id desc). This keeps and explicits the existing sorting order as visit id "asc". Related to T2310 Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/371/ Comment Actions Build is green Patch application report for D3350 (id=11927)Could not rebase; Attempt merge onto f75cd41ab8... Updating f75cd41a..2a56db1b Fast-forward swh/storage/algos/origin.py | 5 +- swh/storage/algos/snapshot.py | 8 ++- swh/storage/cassandra/cql.py | 93 ++++++++++++++++++++++++++++---- swh/storage/cassandra/storage.py | 9 +++- swh/storage/db.py | 45 +++++++++------- swh/storage/in_memory.py | 15 ++++-- swh/storage/interface.py | 7 ++- swh/storage/storage.py | 4 +- swh/storage/tests/algos/test_origin.py | 7 +++ swh/storage/tests/algos/test_snapshot.py | 21 +++----- swh/storage/tests/test_storage.py | 39 +++++++++++++- 11 files changed, 196 insertions(+), 57 deletions(-) Changes applied before testcommit 2a56db1b1b31f055b53705a8cf0914de6894d451 Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Thu Jun 25 18:37:18 2020 +0200 storage: Limit the *-get-latest functions search Impacts functions: - snapshot-get-latest - origin-get-latest-visit-status commit 3b1695b2e7b1f5e227410b41953921e5a5806554 Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Wed Jun 24 18:04:29 2020 +0200 storage*: Open order parameter to origin-visit-get endpoint This allows clients to search from most recent to oldest visit when calling the endpoint with the "order" parameter set to "desc" (visit id desc). This keeps and explicits the existing sorting order as visit id "asc". Related to T2310 See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/374/ for more details. Comment Actions Build is green Patch application report for D3350 (id=11931)Could not rebase; Attempt merge onto f75cd41ab8... Updating f75cd41a..e5f5a2d1 Fast-forward swh/storage/algos/origin.py | 5 +- swh/storage/algos/snapshot.py | 8 ++- swh/storage/cassandra/cql.py | 93 ++++++++++++++++++++++++++++---- swh/storage/cassandra/storage.py | 9 +++- swh/storage/db.py | 50 ++++++++++------- swh/storage/in_memory.py | 15 ++++-- swh/storage/interface.py | 7 ++- swh/storage/storage.py | 4 +- swh/storage/tests/algos/test_origin.py | 7 +++ swh/storage/tests/algos/test_snapshot.py | 21 +++----- swh/storage/tests/test_storage.py | 39 +++++++++++++- 11 files changed, 201 insertions(+), 57 deletions(-) Changes applied before testcommit e5f5a2d1a047faeab35253572a5b629e75af4edd Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Thu Jun 25 18:37:18 2020 +0200 storage: Limit the *-get-latest functions search Impacts functions: - snapshot-get-latest - origin-get-latest-visit-status commit 182ee497328a55d99609e5230146268a108508e3 Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Wed Jun 24 18:04:29 2020 +0200 storage*: Open order parameter to origin-visit-get endpoint This allows clients to search from most recent to oldest visit when calling the endpoint with the "order" parameter set to "desc" (visit id desc). This keeps and explicits the existing sorting order as visit id "asc". Related to T2310 See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/377/ for more details. Comment Actions That's confusing, limit is usually the number of items *after* filtering, but you're limiting before filtering. Comment Actions
I don't follow the after/before filtering you mention. Maybe the tests scenario Implementation wise, that limit is passed through to the backend(s) which does In any case, do you have a sample in mind for the limit use you are mentioning? Thanks in advance. Comment Actions If function(X, limit=1000) returns one item, then function(X, limit=1) should also return one item, not zero. Comment Actions It's not about the tests. origin_get_latest_visit_status gets limit records from the storage, then filters; instead of getting records until it gets enough to return limit records. Comment Actions
I thought it made sense initially but now i'm no longer sure...
Yes, points towards "does not convey" what i think. I don't want to limit the number of records to return because those functions should I want to limit the query records looked up internally by the backend so that But maybe what i want is iterate until we find something and not open that parameter at all? last_visit = -1 while last_visit is not None: visits = list(storage.origin_visit_get( origin_url, last_visit=None if last_visit == -1 else last_visit, order="desc", limit=10) ) if not visits: return None last_visit = visits[-1]["visit"] visit_status: Optional[OriginVisitStatus] = None visit: Dict[str, Any] for visit in visits: if type is not None and visit["type"] != type: continue visit_status = storage.origin_visit_status_get_latest( origin_url, visit["visit"], allowed_statuses=allowed_statuses, require_snapshot=require_snapshot, ) if visit_status is not None: return (OriginVisit.from_dict(visit), visit_status) (pseudo ^, not tested yet) Comment Actions Yes, that was my point. Your code is correct, I just don't think the word limit is the right one.
That would be great. Best of both worlds: we don't fetch too many at once, but still get the correct result all the time. Comment Actions
That actually works. Awesome! Comment Actions Iterate over paginated visits in batches to retrieve latest visit/snapshot This should stops the current timeouts on origin with a high number of visits. @vlorentz, awesome, thanks! you rock dude! ;) Comment Actions Build is green Patch application report for D3350 (id=11934)Rebasing onto 182ee49732... Current branch diff-target is up to date. Changes applied before testcommit 026c5c62f4bc00ff0f41407a5a0f3daf1f80ecad Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Thu Jun 25 18:37:18 2020 +0200 Iterate over paginated visits in batches to retrieve latest visit/snapshot This should stops the current timeouts on origin with a high number of visits. Related to T2310 See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/378/ for more details. Comment Actions Build is green Patch application report for D3350 (id=11935)Rebasing onto 182ee49732... Current branch diff-target is up to date. Changes applied before testcommit 8c039ef4bc4f707d5513159eb4c50fe409e6321c Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Thu Jun 25 18:37:18 2020 +0200 Iterate over paginated visits in batches to retrieve latest visit/snapshot This should stops the current timeouts on origin with a high number of visits. Related to T2310 See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/379/ for more details.
Comment Actions Build is green Patch application report for D3350 (id=11936)Rebasing onto 182ee49732... Current branch diff-target is up to date. Changes applied before testcommit 10443b8a17c4bd866884fe9c572ba8d13391656c Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Thu Jun 25 18:37:18 2020 +0200 Iterate over paginated visits in batches to retrieve latest visit/snapshot This should stops the current timeouts on origin with a high number of visits. Related to T2310 See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/380/ for more details. |