Page MenuHomeSoftware Heritage

common/archive: Avoid db timeouts in lookup_snapshot_sizes
ClosedPublic

Authored by anlambert on Oct 26 2020, 6:43 PM.

Details

Summary

When querying all branch aliases in a snapshot, the underlying database query can
timeout as it is not properly indexable. That diff intends to mitigate that issue.

A first commit moves the branches filtering by target type client side to avoid
sending the costly request that might timeout.
Note that even it seems that all branches of a snapshot are iterated, most of
swh snapshots contains a single branch alias named HEAD that will be iterated
first. Branches iteration will stop once all aliases processed so in that case
the iteration will stop quickly.

A first commit removes aliases resolving in the lookup_snapshot_sizes function.
Branch aliases will be resolved only when required for display from now on.

A second commit puts processed snapshot sizes in cache to avoid sending the same
set of database queries each time a page is loaded when browsing the archive
in the context of a specific snapshot.

Closes T2734

Depends on D4369

Diff Detail

Repository
rDWAPPS Web applications
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build is green

Patch application report for D4356 (id=15405)

Rebasing onto d1643344e5...

Current branch diff-target is up to date.
Changes applied before test
commit 6118c3723f833f381661aa2f99a64498510812f0
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Mon Oct 26 18:34:46 2020 +0100

    browse/snapshot_context: Put processed snapshot sizes in cache
    
    Put processed snapshot sizes in cache to avoid sending the same set of
    database queries each time a page is loaded when browsing the archive
    in the context of a specific snapshot.
    
    Closes T2734

commit a0fa085c27f3c8b680764ad490ffb645c5ab6332
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Mon Oct 26 18:26:20 2020 +0100

    common/archive: Avoid db timeouts in lookup_snapshot_sizes
    
    When querying all branch aliases in a snapshot, the underlying database query can
    timeout as it is not properly indexable.
    
    This commits moves the branches filtering by target type client side to avoid
    sending the costly request that might timeout.
    
    Related to T2734

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/437/ for more details.

Wouldn't it make sense to have this in swh.storage.algos instead?

Update: Simplify some code and use default branches_count parameter value in lookup_snapshot

Build is green

Patch application report for D4356 (id=15419)

Rebasing onto d1643344e5...

Current branch diff-target is up to date.
Changes applied before test
commit 15654ede55837df06a8f0c76f72c7e92104b27cf
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Mon Oct 26 18:34:46 2020 +0100

    browse/snapshot_context: Put processed snapshot sizes in cache
    
    Put processed snapshot sizes in cache to avoid sending the same set of
    database queries each time a page is loaded when browsing the archive
    in the context of a specific snapshot.
    
    Closes T2734

commit d5b8d76e161cae321737ec080eb33a933d8094e1
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Mon Oct 26 18:26:20 2020 +0100

    common/archive: Avoid db timeouts in lookup_snapshot_sizes
    
    When querying all branch aliases in a snapshot, the underlying database query can
    timeout as it is not properly indexable.
    
    This commits moves the branches filtering by target type client side to avoid
    sending the costly request that might timeout.
    
    Related to T2734

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/439/ for more details.

Wouldn't it make sense to have this in swh.storage.algos instead?

Indeed we could add a snapshot_resolve_aliases function in storage.algos. Plus the code in that diff only handles one level of aliases so it should be improved.

As pointed by @vlorentz, the logic to resolve snapshot branch aliases could be moved in swh.storage.algos.snapshot.

Update:

  • rebase
  • move snapshot aliases resolving code in swh.storage.algos.snapshot
  • add more tests

Jenkins build will fail until D4369 gets landed and swh-storage gets a new tag.

Build has FAILED

Patch application report for D4356 (id=15456)

Rebasing onto 3ffeb396fa...

Current branch diff-target is up to date.
Changes applied before test
commit f172f3cfc42c64858c529a54df54a9b29e146bc9
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Mon Oct 26 18:34:46 2020 +0100

    browse/snapshot_context: Put processed snapshot sizes in cache
    
    Put processed snapshot sizes in cache to avoid sending the same set of
    database queries each time a page is loaded when browsing the archive
    in the context of a specific snapshot.
    
    Closes T2734

commit 806ea48be48437db7e480f92e1cbd7d42f48d10b
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Mon Oct 26 18:26:20 2020 +0100

    common/archive: Avoid db timeouts in lookup_snapshot_sizes
    
    When querying all branch aliases in a snapshot, the underlying database query can
    timeout as it is not properly indexable.
    
    This commits moves the branches filtering by target type client side to avoid
    sending the costly request that might timeout.
    
    Related to T2734

Link to build: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/440/
See console output for more information: https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/440/console

Update:

  • Do not resolve snapshot aliases in swh.web.common.archive.lookup_snapshot_sizes anymore
  • Resolve a snapshot alias only when required for display
  • Update tests as revision / release counters have changed since aliases are no more taken into account

Build is green

Patch application report for D4356 (id=15620)

Rebasing onto 527e7b0a87...

Current branch diff-target is up to date.
Changes applied before test
commit 1d459bef95ab5e59fec7b4236ca21bd7175cac15
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Mon Oct 26 18:34:46 2020 +0100

    browse/snapshot_context: Put processed snapshot sizes in cache
    
    Put processed snapshot sizes in cache to avoid sending the same set of
    database queries each time a page is loaded when browsing the archive
    in the context of a specific snapshot.
    
    Closes T2734

commit afbce45f99aba5a7bc09a540d3f5485bb34db214
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Mon Oct 26 18:26:20 2020 +0100

    common/archive: Avoid db timeouts in lookup_snapshot_sizes
    
    When querying all branch aliases in a snapshot, the underlying database
    query can timeout as it is not properly indexable.
    
    That commit stops to try resolving aliases when getting snapshot sizes
    and only resolve an alias when required for display.
    
    Related to T2734

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/448/ for more details.

douardda added inline comments.
swh/web/common/archive.py
1007–1008

minor point: could use a snapshot_sizes.set_default(target_type, 0) I guess here, or initialize the dict beforehand:

snapshot_sizes = dict.from_keys(("alias", "release", "revision"), 0)
snapshot_sizes.update(storage.snapshot_count_branches(snapshot_id_bin))

(not sure it really more readable however :-)

1127

are we 100% sure that result != None necessarly means len(result)>=1?

swh/web/common/archive.py
1007–1008

Ack, will update

1127

yes, snapshot_resolve_alias returns a tuple with two members in it when the snapshot is valid. This has been properly tested in rDSTO8b1815572c695f7c9751b353f51d507e3fe76e0a.

Build is green

Patch application report for D4356 (id=15624)

Rebasing onto 527e7b0a87...

Current branch diff-target is up to date.
Changes applied before test
commit 25d33a00a0171fab288a6a366b66bf134ca4eba4
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Mon Oct 26 18:34:46 2020 +0100

    browse/snapshot_context: Put processed snapshot sizes in cache
    
    Put processed snapshot sizes in cache to avoid sending the same set of
    database queries each time a page is loaded when browsing the archive
    in the context of a specific snapshot.
    
    Closes T2734

commit 907c4760d11224bfb1ddffcabc054b1eb161a197
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Mon Oct 26 18:26:20 2020 +0100

    common/archive: Avoid db timeouts in lookup_snapshot_sizes
    
    When querying all branch aliases in a snapshot, the underlying database
    query can timeout as it is not properly indexable.
    
    That commit stops to try resolving aliases when getting snapshot sizes
    and only resolve an alias when required for display.
    
    Related to T2734

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/449/ for more details.

swh/web/common/archive.py
1121–1130

I think it's unclear that result is a tuple, not a list.

What about unpacking instead? eg. (aliases, target_branch) = result?

It remains unclear to me how this diff addresses the original timeout problem. I see the beneficial effect of having a lazy loading mechanism and cache to reduce the load, but not to prevent the timeouts to occur in the first place (thus failing to fill the cache).

It remains unclear to me how this diff addresses the original timeout problem. I see the beneficial effect of having a lazy loading mechanism and cache to reduce the load, but not to prevent the timeouts to occur in the first place (thus failing to fill the cache).

The diff is a mitigation and only avoids to send the expensive PGSQL query filtering the alias branches in a snapshot.

But you are right, we can still have timeout issues when dealing with a large snapshot, see below with https://github/com/v8/v8 origin (200k branches)

INFO:werkzeug:127.0.0.1 - - [05/Nov/2020 12:02:06] "POST /origin/visit_status/get_latest HTTP/1.1" 200 -
b"           SELECT snapshot_id, name, target, target_type\n           FROM swh_snapshot_get_by_id('\\x8891607122ad552e8c2f4a55cc5b3925f8f6f01c'::bytea, '\\x'::bytea, 1001, NULL :: snapshot_target[])\n        "
DEBUG:swh.core.statsd:Error submitting statsd packet. Dropping the packet and closing the socket.
ERROR:root:canceling statement due to statement timeout
Traceback (most recent call last):
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/anlambert/.virtualenvs/swh/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "<decorator-gen-68>", line 2, in snapshot_get_branches
  File "/home/anlambert/swh/swh-environment/swh-core/swh/core/api/negotiation.py", line 147, in _negotiate
    return f.negotiator(*args, **kwargs)
  File "/home/anlambert/swh/swh-environment/swh-core/swh/core/api/negotiation.py", line 81, in __call__
    result = self.func(*args, **kwargs)
  File "/home/anlambert/swh/swh-environment/swh-core/swh/core/api/__init__.py", line 456, in _f
    return obj_meth(**kw)
  File "/home/anlambert/swh/swh-environment/swh-storage/swh/storage/metrics.py", line 24, in d
    return f(*a, **kw)
  File "/home/anlambert/swh/swh-environment/swh-core/swh/core/db/common.py", line 62, in _meth
    return meth(self, *args, db=db, cur=cur, **kwargs)
  File "/home/anlambert/swh/swh-environment/swh-storage/swh/storage/postgresql/storage.py", line 785, in snapshot_get_branches
    cur=cur,
  File "/home/anlambert/swh/swh-environment/swh-storage/swh/storage/postgresql/db.py", line 273, in snapshot_get_by_id
    cur.execute(query, (snapshot_id, branches_from, branches_count, target_types))
psycopg2.errors.QueryCanceled: canceling statement due to statement timeout

Nevertheless, next queries will not timeout as cache will be filled.

swh/web/common/archive.py
1121–1130

I agree this would be better/clearer

Update: Rebase and synchronize with updated snapshot_resolve_alias signature.

Build is green

Patch application report for D4356 (id=15657)

Rebasing onto 8daf7a8fda...

Current branch diff-target is up to date.
Changes applied before test
commit 3c59879e2fe6379a4ebeae4cb4449ce02221d608
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Mon Oct 26 18:34:46 2020 +0100

    browse/snapshot_context: Put processed snapshot sizes in cache
    
    Put processed snapshot sizes in cache to avoid sending the same set of
    database queries each time a page is loaded when browsing the archive
    in the context of a specific snapshot.
    
    Closes T2734

commit c469412a897ac980ccab72ba5c6d71b159e6a672
Author: Antoine Lambert <antoine.lambert@inria.fr>
Date:   Mon Oct 26 18:26:20 2020 +0100

    common/archive: Avoid db timeouts in lookup_snapshot_sizes
    
    When querying all branch aliases in a snapshot, the underlying database
    query can timeout as it is not properly indexable.
    
    That commit stops to try resolving aliases when getting snapshot sizes
    and only resolve an alias when required for display.
    
    Related to T2734

See https://jenkins.softwareheritage.org/job/DWAPPS/job/tests-on-diff/455/ for more details.

This revision is now accepted and ready to land.Nov 5 2020, 9:15 PM