Page MenuHomeSoftware Heritage

origin_visits: Reuse cache entries if present
ClosedPublic

Authored by ardumont on Feb 2 2022, 6:21 PM.

Details

Summary

This clarifies the current code and optimizes the re-fetching of all origin visits /
origin visit statuses when new visits happen. It's an optimization providing previous
visits are in the cache already.

The second part serves also as a workaround of some problematic origins (the ones with a
high number of visits). It's only a workaround though because when voiding the
cache (e.g. restart the cache service), the initial error (e.g. crash 500 or 502, ...)
will still happen.

Another change is on its way to try and fix that specific problem.

Related to T3905
Depends on D7080

Test Plan

tox

Diff Detail

Repository
rDWAPPS Web applications
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D7066 (id=25644)

Rebasing onto f990d789aa...

First, rewinding head to replay your work on top of it...
Applying: origin_visits: Reuse cache entries if present
Changes applied before test
commit e2c29e520c11d7952979ef39c81d2c0e8b88b63f
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Feb 2 18:17:44 2022 +0100

    origin_visits: Reuse cache entries if present
    
    This avoids re-fetching all origin visits / origin visit statuses when new visits
    happen. This is only a workaround. When voiding the cache, the error will still happen.
    
    Related to T3905

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1309/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1309/console

Build has FAILED

Patch application report for D7066 (id=25645)

Rebasing onto f990d789aa...

Current branch diff-target is up to date.
Changes applied before test
commit d16f98cb8212cdb568388c5fec71a9bf1b982eec
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Feb 2 18:17:44 2022 +0100

    origin_visits: Reuse cache entries if present
    
    This avoids re-fetching all origin visits / origin visit statuses when new visits
    happen. This is only a workaround. When voiding the cache, the error will still happen.
    
    Related to T3905

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1310/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1310/console

Harbormaster returned this revision to the author for changes because remote builds failed.Feb 2 2022, 6:37 PM
Harbormaster failed remote builds in B26476: Diff 25645!

Build has FAILED

Patch application report for D7066 (id=25648)

Rebasing onto f990d789aa...

Current branch diff-target is up to date.
Changes applied before test
commit 26536b385475979fc7d610e78d274c8491f498cb
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Feb 2 18:17:44 2022 +0100

    origin_visits: Reuse cache entries if present
    
    This avoids re-fetching all origin visits / origin visit statuses when new visits
    happen. This is only a workaround. When voiding the cache, the error will still happen.
    
    Related to T3905

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1311/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1311/console

Build has FAILED

Patch application report for D7066 (id=25649)

Rebasing onto f990d789aa...

Current branch diff-target is up to date.
Changes applied before test
commit 9713342e42e03f54b7481074eaeac6f77b5c86b7
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Feb 2 18:17:44 2022 +0100

    origin_visits: Reuse cache entries if present
    
    This avoids re-fetching all origin visits / origin visit statuses when new visits
    happen. This is only a workaround. When voiding the cache, the error will still happen.
    
    Related to T3905

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1312/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1312/console

Harbormaster returned this revision to the author for changes because remote builds failed.Feb 3 2022, 11:36 AM
Harbormaster failed remote builds in B26480: Diff 25649!

Build has FAILED

sounds unrelated to the diff change (and master build is also broken for some reason [1])

Fix is in D7076.

[1] https://jenkins.softwareheritage.org/view/swh-draft/job/DWAPPS/job/tests/2763/console

Build has FAILED

Patch application report for D7066 (id=25649)

Rebasing onto f990d789aa...

Current branch diff-target is up to date.
Changes applied before test
commit 9713342e42e03f54b7481074eaeac6f77b5c86b7
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Feb 2 18:17:44 2022 +0100

    origin_visits: Reuse cache entries if present
    
    This avoids re-fetching all origin visits / origin visit statuses when new visits
    happen. This is only a workaround. When voiding the cache, the error will still happen.
    
    Related to T3905

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1315/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1315/console

Build is green

Patch application report for D7066 (id=25676)

Rebasing onto eebb27ebdb...

Current branch diff-target is up to date.
Changes applied before test
commit d532e2090499256d7fc9d310ec91d47523df13d3
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Feb 2 18:17:44 2022 +0100

    origin_visits: Reuse cache entries if present
    
    This avoids re-fetching all origin visits / origin visit statuses when new visits
    happen. This is only a workaround. When voiding the cache, the error will still happen.
    
    Related to T3905

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1319/ for more details.

ardumont added a subscriber: anlambert.

Build is green

Patch application report for D7066 (id=25679)

Rebasing onto f2cf669b55...

Current branch diff-target is up to date.
Changes applied before test
commit b40f450cd619ff5e1291a81571c691c81280904d
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Feb 2 18:17:44 2022 +0100

    origin_visits: Reuse cache entries if present
    
    This avoids re-fetching all origin visits / origin visit statuses when new visits
    happen. This is only a workaround. When voiding the cache, the error will still happen.
    
    Related to T3905

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1320/ for more details.

Indeed this origin visits retrieving process is quite time consuming for origins with numerous visits,
especially when getting all visits again while most of them are already in cache (sighs, my bad for this).

So to optimize performances, we should update that code the following way:

cache_entry_id = "origin_visits_%s" % origin_url
cache_entry = cache.get(cache_entry_id)

last_visit = 0
origin_visits = []
new_visits = []
per_page = archive.MAX_LIMIT
if cache_entry:
    origin_visits = cache_entry
    last_visit = cache_entry[-1]["visit"]
    new_visits = list(
        archive.lookup_origin_visits(origin_url, last_visit=last_visit, per_page=per_page)
    )
    last_visit += per_page
    if not new_visits:
        last_snp = archive.lookup_latest_origin_snapshot(origin_url)
        if not last_snp or last_snp["id"] == cache_entry[-1]["snapshot"]:
            return cache_entry

# get new visits that we did not retrieve yet
while 1:
    visits = list(
        archive.lookup_origin_visits(
            origin_url, last_visit=last_visit, per_page=per_page
        )
    )
    new_visits += visits
    if len(visits) < per_page:
        break
    last_visit += per_page

def _visit_sort_key(visit):
    ts = parse_iso8601_date_to_utc(visit["date"]).timestamp()
    return ts + (float(visit["visit"]) / 10e3)

# cache entry is already sorted with oldest visits
origin_visits += sorted(new_visits, key=lambda v: _visit_sort_key(v))

cache.set(cache_entry_id, origin_visits)

This avoids to call archive.lookup_origin_visits more than needed and ensure only new visits
will be fetched if previous ones are in cache.

To test this, we need to simulate the case where visits are in cache and a lot of new ones,
covering at least two pages returned by archive.lookup_origin_visits, need to be fetched.

swh/web/common/origin_visits.py
63–64

Set last_visit to None before line 45 instead.

swh/web/tests/common/test_origin_visits.py
24–26 ↗(On Diff #25676)

why is this needed ?

67–68 ↗(On Diff #25676)

I do not see the utility of these assertions.

This revision now requires changes to proceed.Feb 3 2022, 5:42 PM
swh/web/tests/common/test_origin_visits.py
24–26 ↗(On Diff #25676)

because one test failed finding only 3 visits instead of the 77, see the first failure of the build (prior to the one you fixed).

67–68 ↗(On Diff #25676)

well, we mock stuff so better check those are actually needed...

Adapt according to anlambert's suggestion

Build is green

Patch application report for D7066 (id=25680)

Rebasing onto 84ba3c0df5...

Current branch diff-target is up to date.
Changes applied before test
commit 733755c9bbf84385b1309ca815f47837480ae7de
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Feb 2 18:17:44 2022 +0100

    origin_visits: Reuse cache entries if present
    
    This avoids re-fetching all origin visits / origin visit statuses when new visits
    happen. This is only a workaround. When voiding the cache, the error will still happen.
    
    Related to T3905

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1321/ for more details.

ardumont edited the summary of this revision. (Show Details)

Rebase on top of D7080

Drop the test_origin_visits.py modification (appended by mistake, i missed that change
within the rebase...)

Build was aborted

Patch application report for D7066 (id=25684)

Could not rebase; Attempt merge onto 84ba3c0df5...

Updating 84ba3c0d..fa6a89f6
Fast-forward
 swh/web/common/origin_visits.py            | 25 ++++++++++++++-----------
 swh/web/tests/common/test_origin_visits.py |  9 +++++++--
 2 files changed, 21 insertions(+), 13 deletions(-)
Changes applied before test
commit fa6a89f64e07a4f4c9a267ba97f0b48a6d8e725b
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Feb 2 18:17:44 2022 +0100

    origin_visits: Reuse cache entries if present
    
    This clarifies the current code and optimize the re-fetching of all origin visits /
    origin visit statuses when new visits happen. It's an optimization providing previous
    visits are in the cache already.
    
    The second part serves also as a workaround of some problematic origins (the ones with a
    high number of visits). It's only a workaround though because when voiding the
    cache (e.g. restart the cache service), the initial error (e.g. crash 500 or 502, ...)
    will still happen.
    
    Another change is on its way to try and fix that specific problem.
    
    Related to T3905

commit e72b1e7a9d183b82960ed1dca8544cab7492cc89
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Feb 3 19:09:00 2022 +0100

    tests: Remove use of mocks in common/test_origin_visits

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1324/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1324/console

Fix typos in commit message

Build is green

Patch application report for D7066 (id=25685)

Could not rebase; Attempt merge onto 84ba3c0df5...

Updating 84ba3c0d..90fb9c6f
Fast-forward
 swh/web/common/origin_visits.py            |  25 ++--
 swh/web/tests/common/test_origin_visits.py | 179 ++++++++++++-----------------
 2 files changed, 88 insertions(+), 116 deletions(-)
Changes applied before test
commit 90fb9c6fae548762a6d3e7ccd99bb5174dd30e73
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Feb 2 18:17:44 2022 +0100

    origin_visits: Reuse cache entries if present
    
    This clarifies the current code and optimize the re-fetching of all origin visits /
    origin visit statuses when new visits happen. It's an optimization providing previous
    visits are in the cache already.
    
    The second part serves also as a workaround of some problematic origins (the ones with a
    high number of visits). It's only a workaround though because when voiding the
    cache (e.g. restart the cache service), the initial error (e.g. crash 500 or 502, ...)
    will still happen.
    
    Another change is on its way to try and fix that specific problem.
    
    Related to T3905

commit e72b1e7a9d183b82960ed1dca8544cab7492cc89
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Feb 3 19:09:00 2022 +0100

    tests: Remove use of mocks in common/test_origin_visits

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1325/ for more details.

Build is green

Patch application report for D7066 (id=25687)

Could not rebase; Attempt merge onto 84ba3c0df5...

Updating 84ba3c0d..b126f9b2
Fast-forward
 swh/web/common/origin_visits.py            |  25 ++--
 swh/web/tests/common/test_origin_visits.py | 179 ++++++++++++-----------------
 2 files changed, 88 insertions(+), 116 deletions(-)
Changes applied before test
commit b126f9b227a61c762c64e93ccbedc183bb979611
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Feb 2 18:17:44 2022 +0100

    origin_visits: Reuse cache entries if present
    
    This clarifies the current code and optimizes the re-fetching of all origin visits /
    origin visit statuses when new visits happen. It's an optimization providing previous
    visits are in the cache already.
    
    The second part serves also as a workaround of some problematic origins (the ones with a
    high number of visits). It's only a workaround though because when voiding the
    cache (e.g. restart the cache service), the initial error (e.g. crash 500 or 502, ...)
    will still happen.
    
    Another change is on its way to try and fix that specific problem.
    
    Related to T3905

commit e72b1e7a9d183b82960ed1dca8544cab7492cc89
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Feb 3 19:09:00 2022 +0100

    tests: Remove use of mocks in common/test_origin_visits

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1326/ for more details.

Build is green

Patch application report for D7066 (id=25688)

Could not rebase; Attempt merge onto 84ba3c0df5...

Updating 84ba3c0d..bf17f795
Fast-forward
 swh/web/common/origin_visits.py            |  26 +++--
 swh/web/tests/common/test_origin_visits.py | 179 ++++++++++++-----------------
 2 files changed, 88 insertions(+), 117 deletions(-)
Changes applied before test
commit bf17f795ce6cbe5c8b536ae1c21adcb67e52f55d
Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
Date:   Wed Feb 2 18:17:44 2022 +0100

    origin_visits: Reuse cache entries if present
    
    This clarifies the current code and optimizes the re-fetching of all origin visits /
    origin visit statuses when new visits happen. It's an optimization providing previous
    visits are in the cache already.
    
    The second part serves also as a workaround of some problematic origins (the ones with a
    high number of visits). It's only a workaround though because when voiding the
    cache (e.g. restart the cache service), the initial error (e.g. crash 500 or 502, ...)
    will still happen.
    
    Another change is on its way to try and fix that specific problem.
    
    Related to T3905

commit e72b1e7a9d183b82960ed1dca8544cab7492cc89
Author: Antoine Lambert <anlambert@softwareheritage.org>
Date:   Thu Feb 3 19:09:00 2022 +0100

    tests: Remove use of mocks in common/test_origin_visits

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/1327/ for more details.

This revision is now accepted and ready to land.Feb 4 2022, 11:16 AM