Page MenuHomeSoftware Heritage

Find latest origin visit status by date instead of visits filtering
ClosedPublic

Authored by ardumont on Feb 3 2022, 6:19 PM.

Details

Summary

Currently, in case of a high number of visits for an origin, this makes the page
querying the result crash (with an empty cache). Note that even with a populated cache,
the current version could take some time to process.

The following diff opens a function which uses simpler storage api calls to retrieve the
most recent visit from a given point in time (with its most recent associated visit
status) instead of filtering all visits of an origin to retrieve the information. Hence
no longer relying on the cache of all visits for that origin either.

Related to T3905

Diff Detail

Repository
rDWAPPS Web applications
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I tried to get rid of the mocks but failed so far (using the archive_data fixture
[1]).

That did not work for now. I guess there is something clashy with tests using both
django and our swh fixtures...

Hency why the changes here still use those damn mocks.

[1] something similar to D7080

Build is green

Patch application report for D7078 (id=25692)

Could not rebase; Attempt merge onto e72b1e7a9d...

Updating e72b1e7a..b9a27c3b
Fast-forward
 swh/web/common/archive.py                | 22 +++++++++++++++-
 swh/web/common/origin_save.py            | 25 ++++++-------------
 swh/web/common/origin_visits.py          | 26 ++++++++++---------
 swh/web/tests/common/test_origin_save.py | 43 ++++++++++++++------------------
 4 files changed, 61 insertions(+), 55 deletions(-)
Changes applied before test
commit b9a27c3b98bcdfa9c525e76614ebd8525fbba8f7
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Feb 3 18:16:36 2022 +0100

    Use origin visit status find by date instead of full visits listing
    
    Instead of iterating over all origin visits to find the one we are insterested in, use a
    simpler storage api call. In case of a huge number of visits for an origins, this makes
    the resolution spin out of control until it crashes (when the cache is initially empty).
    
    Related to T3905

commit bf17f795ce6cbe5c8b536ae1c21adcb67e52f55d
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Feb 2 18:17:44 2022 +0100

    origin_visits: Reuse cache entries if present
    
    This clarifies the current code and optimizes the re-fetching of all origin visits /
    origin visit statuses when new visits happen. It's an optimization providing previous
    visits are in the cache already.
    
    The second part serves also as a workaround of some problematic origins (the ones with a
    high number of visits). It's only a workaround though because when voiding the
    cache (e.g. restart the cache service), the initial error (e.g. crash 500 or 502, ...)
    will still happen.
    
    Another change is on its way to try and fix that specific problem.
    
    Related to T3905

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1328/ for more details.

ardumont retitled this revision from wip: Tentatively use origin visit status find by date to wip: Use origin visit status find by date instead of full visits listing.Feb 4 2022, 11:32 AM
ardumont edited the summary of this revision. (Show Details)
swh/web/common/archive.py
1085

I'm currently adding that missing coverage part.

anlambert added a subscriber: anlambert.
anlambert added inline comments.
swh/web/common/origin_save.py
282

storage.origin_visit_find_by_date might return a visit date lesser than the date passed as parameter , see code.

You should keep the old code as a fallback when the returned date is lesser.

This revision now requires changes to proceed.Feb 4 2022, 11:48 AM
swh/web/common/origin_save.py
282

Well, storage.origin-find-by-date does that, but then the actual code does an storage.origin-visit-status-get-latest so that should be enough to compensate and find the actual most recent visit, shouldn't it?

swh/web/common/archive.py
1068

Given our exchange [1] below, i'm of a mind to rename that new function as ^.

[1] D7078#inline-50876

swh/web/common/origin_save.py
282

Nope because you will do it on an invalid visit.

swh/web/common/origin_save.py
282

what's an invalid visit?

Adapt according to exchange

swh/web/common/origin_save.py
282

A visit whose date is lesser than the save code now request.

swh/web/common/origin_save.py
282

We may need a new storage endpoint to simplify this status code retrieval in the webapp. Something like [1].

[1] P1276

Build is green

Patch application report for D7078 (id=25714)

Rebasing onto 4ed4512e7b...

First, rewinding head to replay your work on top of it...
Applying: Find latest origin visit status using date instead of visits filter
Changes applied before test
commit 232c6ac550505d743f6c666ac2cc0e346356dac9
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Feb 3 18:16:36 2022 +0100

    Find latest origin visit status using date instead of visits filter
    
    Instead of iterating over all origin visits to find the one we are insterested in, use a
    simpler storage api call. In case of a huge number of visits for an origins, this makes
    the resolution spin out of control until it crashes (when the cache is initially empty).
    
    Related to T3905

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1334/ for more details.

swh/web/common/origin_save.py
282

or at least modify the storage.origin-visit-find-by-date so it does the right things (either returning both date closest to the date or simply the most recent one). I don't know why this current api return only the closest to the pivot date...

swh/web/common/archive.py
1081

Following our discussion from last week, I think the code below should do the
trick to ensure a correct visit date will be found.

visit = storage.origin_visit_find_by_date(origin_url, visit_date)
if visit and visit.date < visit_date:
    visits = storage.origin_visit_get(origin_url, page_token=visit.id, limit=1).results
    visit = visits[0] if visits else None
if not visit:
    return None
ardumont retitled this revision from wip: Use origin visit status find by date instead of full visits listing to Find latest origin visit status instead of filtering on all visits.Feb 10 2022, 11:00 AM
ardumont edited the summary of this revision. (Show Details)
ardumont edited the summary of this revision. (Show Details)
ardumont edited the test plan for this revision. (Show Details)

Build is green

Patch application report for D7078 (id=25880)

Rebasing onto b0bba742ec...

Current branch diff-target is up to date.
Changes applied before test
commit 1555f50c95bc15e541f74773eef8f97903ed511f
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Feb 3 18:16:36 2022 +0100

    Find latest origin visit status instead of visits filtering
    
    Instead of iterating over all origin visits to find the one we are insterested in, use a
    simpler storage api call. In case of a huge number of visits for an origin, this
    currently makes the page queried crash (when the cache is initially empty).
    
    Related to T3905

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1367/ for more details.

I missed that we could have visits for the same origins coming from other things than the save code now.
So we could end up displaying other statuses.
So we need to keep the save request date in the loop when fetching those statuses.

[1]

11:19 <anlambert> salut, c'est pas bon pour moi, il faut vraiment se baser sur la date de la requête save code now, tu n'as aucune garantie que la dernière visite d'une origine corresponde à celle du SCN
11:19 <anlambert> ça peut correspondre à une visite déclenchée par un listeur par exemple
11:20 <ardumont> ah!
11:25 <ardumont> good point
ardumont retitled this revision from Find latest origin visit status instead of filtering on all visits to Find latest origin visit status by date instead of visits filtering.Feb 10 2022, 3:07 PM
ardumont edited the summary of this revision. (Show Details)

Adapt according to discussion

ardumont added inline comments.
swh/web/common/archive.py
1081

yep, i've used the _lookup_origin_visits which does the right things about computing the page_token.

swh/web/common/archive.py
1068

I would rather call that function origin_visit_find_by_date for consistency with the others.

1085

I would rather call storage.origin_visit_get(origin_url, page_token=visit.id, limit=1) directly as a single page will be returned here.

ardumont marked an inline comment as done.

Rebase

swh/web/common/archive.py
1085

yep, i've used the _lookup_origin_visits which does the right things about computing the page_token.

The page_token is not null here so calling _lookup_origin_visits is not really useful.
But apparently, the right call is :
storage.origin_visit_get(origin_url, page_token=str(visit.id), limit=1)

Build has FAILED

Patch application report for D7078 (id=25904)

Rebasing onto 13f1591bb2...

First, rewinding head to replay your work on top of it...
Applying: Find latest origin visit status by date instead of visits filtering
Changes applied before test
commit 05496764640adead1a3d9f9833e0a73c1f56daee
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Feb 3 18:16:36 2022 +0100

    Find latest origin visit status by date instead of visits filtering
    
    Currently, in case of a high number of visits for an origin, this makes the page
    querying the result crash (with an empty cache). Note that even with a populated cache,
    the current version could take some time to process.
    
    The following diff opens a function which uses simpler storage api calls to retrieve the
    most recent visit from a given point in time (with its most recent associated visit
    status) instead of filtering all visits of an origin to retrieve the information. Hence
    no longer relying on the cache of all visits for that origin either.
    
    Related to T3905

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1369/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1369/console

swh/web/common/archive.py
1085

yes, it is, it avoids to have to "leak" the page_token construction in that new method as well.
(the page_token is built within the _lookup_origin_visits out of the last_visit provided [1]).

[1] https://forge.softwareheritage.org/source/swh-web/browse/master/swh/web/common/archive.py$982-983

Build is green

Patch application report for D7078 (id=25907)

Rebasing onto 13f1591bb2...

Current branch diff-target is up to date.
Changes applied before test
commit 2dc02e7a0eab7a954464e2fd27e0594db8113856
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Feb 3 18:16:36 2022 +0100

    Find latest origin visit status by date instead of visits filtering
    
    Currently, in case of a high number of visits for an origin, this makes the page
    querying the result crash (with an empty cache). Note that even with a populated cache,
    the current version could take some time to process.
    
    The following diff opens a function which uses simpler storage api calls to retrieve the
    most recent visit from a given point in time (with its most recent associated visit
    status) instead of filtering all visits of an origin to retrieve the information. Hence
    no longer relying on the cache of all visits for that origin either.
    
    Related to T3905

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1370/ for more details.

swh/web/common/archive.py
1085

It's just converting an int to a string so not a big leak here imho.

ardumont marked an inline comment as done.
  • Rename function
  • don't use _lookup_origin_visit
ardumont added inline comments.
swh/web/common/archive.py
1068

missed that, fixing it.

1085

ok then it was my first adaptation of your suggestion and if you don't mind it, fine then [1].

[1] it's just something @vlorentz said justly to me about page_token being the concern of the storage implementation (it's indeed not documented in the storage interface). As such, that should not find its way back to the caller (because when the implem change, here we break as well). And that kinda stuck with me so i'm trying to abide by this ;)

Build is green

Patch application report for D7078 (id=25910)

Rebasing onto 13f1591bb2...

Current branch diff-target is up to date.
Changes applied before test
commit 1c5cecac8fd29bbdd2a89927897b4369629e6805
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Feb 3 18:16:36 2022 +0100

    Find latest origin visit status by date instead of visits filtering
    
    Currently, in case of a high number of visits for an origin, this makes the page
    querying the result crash (with an empty cache). Note that even with a populated cache,
    the current version could take some time to process.
    
    The following diff opens a function which uses simpler storage api calls to retrieve the
    most recent visit from a given point in time (with its most recent associated visit
    status) instead of filtering all visits of an origin to retrieve the information. Hence
    no longer relying on the cache of all visits for that origin either.
    
    Related to T3905

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1371/ for more details.

swh/web/common/archive.py
1082–1088

Could you add a test for this case too ?

swh/web/common/archive.py
1082–1088

yes, i was on it but had to go while the tests were cooking ;)

Build has FAILED

Patch application report for D7078 (id=25918)

Rebasing onto 786d081bd3...

Current branch diff-target is up to date.
Changes applied before test
commit cddcbb3d4eedf83ea6f138ce25388670b3a3259d
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Feb 3 18:16:36 2022 +0100

    Find latest origin visit status by date instead of visits filtering
    
    Currently, in case of a high number of visits for an origin, this makes the page
    querying the result crash (with an empty cache). Note that even with a populated cache,
    the current version could take some time to process.
    
    The following diff opens a function which uses simpler storage api calls to retrieve the
    most recent visit from a given point in time (with its most recent associated visit
    status) instead of filtering all visits of an origin to retrieve the information. Hence
    no longer relying on the cache of all visits for that origin either.
    
    Related to T3905

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1373/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1373/console

Build is green

Patch application report for D7078 (id=25918)

Rebasing onto 786d081bd3...

Current branch diff-target is up to date.
Changes applied before test
commit cddcbb3d4eedf83ea6f138ce25388670b3a3259d
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Feb 3 18:16:36 2022 +0100

    Find latest origin visit status by date instead of visits filtering
    
    Currently, in case of a high number of visits for an origin, this makes the page
    querying the result crash (with an empty cache). Note that even with a populated cache,
    the current version could take some time to process.
    
    The following diff opens a function which uses simpler storage api calls to retrieve the
    most recent visit from a given point in time (with its most recent associated visit
    status) instead of filtering all visits of an origin to retrieve the information. Hence
    no longer relying on the cache of all visits for that origin either.
    
    Related to T3905

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1374/ for more details.

hopefully, improve coverage this time

Build is green

Patch application report for D7078 (id=25922)

Rebasing onto 490b5d678a...

First, rewinding head to replay your work on top of it...
Applying: Find latest origin visit status by date instead of visits filtering
Changes applied before test
commit c327e72e8d24cdb67e173e529a0ac7bf7e6dd8fd
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Feb 3 18:16:36 2022 +0100

    Find latest origin visit status by date instead of visits filtering
    
    Currently, in case of a high number of visits for an origin, this makes the page
    querying the result crash (with an empty cache). Note that even with a populated cache,
    the current version could take some time to process.
    
    The following diff opens a function which uses simpler storage api calls to retrieve the
    most recent visit from a given point in time (with its most recent associated visit
    status) instead of filtering all visits of an origin to retrieve the information. Hence
    no longer relying on the cache of all visits for that origin either.
    
    Related to T3905

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1375/ for more details.

Looks good to me, thanks !

This revision is now accepted and ready to land.Feb 11 2022, 11:04 AM

Build is green

Patch application report for D7078 (id=25926)

Rebasing onto 490b5d678a...

Current branch diff-target is up to date.
Changes applied before test
commit af8b38b3e7bfe96e5501af51f530fc3553dc6f6b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Thu Feb 3 18:16:36 2022 +0100

    Find latest origin visit status by date instead of visits filtering
    
    Currently, in case of a high number of visits for an origin, this makes the page
    querying the result crash (with an empty cache). Note that even with a populated cache,
    the current version could take some time to process.
    
    The following diff opens a function which uses simpler storage api calls to retrieve the
    most recent visit from a given point in time (with its most recent associated visit
    status) instead of filtering all visits of an origin to retrieve the information. Hence
    no longer relying on the cache of all visits for that origin either.
    
    Related to T3905

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1379/ for more details.