Page MenuHomeSoftware Heritage

Investigate timeout issues when querying snapshot data
Closed, MigratedEdits Locked

Description

Last Friday, timeout issues were reported by the web application in production related to the retrieval of snapshot data from the archive:

As a consequence, the archive could not be browsed in a snapshot context anymore as an error 500 was raised in that case.

This calls for an investigation of these timeout issues to avoid hitting such denial of service again in production.

Event Timeline

anlambert created this task.
09:16:44          +olasd ╡ in the storage backend, there's a provision to filter snapshot branches by target type; these queries aren't indexable properly because the table containing which branches have which type is completely separate from the join table between snapshots and branches
09:17:11               ⤷ ╡ so really this filter should be removed altogether, or rewritten to force reading the full list of branches of the snapshot first (indexed query) then filtering out the results
09:17:43               ⤷ ╡ (I'm not sure when this was introduced or why; it's much safer to do this filtering client-side)
09:19:49               ⤷ ╡ what happened starting friday is that postgresql decided to use the deduplication index on snapshot branches to try to do the filtering of branches by type; so it's basically pulling all branches of type aliases, and matching that to the list of branches of the snapshot we're trying to
look at
09:20:13               ⤷ ╡ I guess the amount of alias branches passed a threshold that made the planner change its course
09:22:45               ⤷ ╡ (or the amount of queries for snapshots was low enough that the table got kicked off the cache somehow)

I have generated a flame graph for the django view rendering a directory in a snapshot context and we can clearly see that the lookup_snapshot_sizes function
puts quite a burden on the overall process time.

The performance issue comes from that piece of code that intends to match a branch alias to its target type (revision or release).
The purpose is to update branches and releases counters to get a consistent display.

As @olasd explained, the instruction that is slow to execute is the one that gets all branch aliases in a snapshot

if "alias" in snapshot_sizes:
    aliases = lookup_snapshot(
        snapshot_id, branches_count=snapshot_sizes["alias"], target_types=["alias"]
    )

For instance, for the latest snapshot of the linux kernel, if we look at the underlying PostgreSQL query analysis, we can see
that the execution time is not optimal at all:

softwareheritage=> explain analyze SELECT snapshot_id, name, target, target_type FROM swh_snapshot_get_by_id('\xc7beb2432b7e93c4cf6ab09cd194c7c1998df2f9'::bytea, '\x'::bytea, 2, ARRAY['alias'] :: snapshot_target[]);
                                                                                           QUERY PLAN                                                                                            
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=134458.68..134458.71 rows=2 width=75) (actual time=413.101..413.105 rows=1 loops=1)
   ->  Sort  (cost=134458.68..134458.77 rows=34 width=75) (actual time=397.577..397.581 rows=1 loops=1)
         Sort Key: snapshot_branch.name
         Sort Method: quicksort  Memory: 25kB
         InitPlan 1 (returns $0)
           ->  Index Scan using snapshot_id_idx on snapshot  (cost=0.57..2.59 rows=1 width=8) (actual time=0.019..0.021 rows=1 loops=1)
                 Index Cond: ((id)::bytea = ('\xc7beb2432b7e93c4cf6ab09cd194c7c1998df2f9'::bytea)::bytea)
         ->  Hash Join  (cost=6541.89..134455.23 rows=34 width=75) (actual time=178.624..397.565 rows=1 loops=1)
               Hash Cond: (snapshot_branch.object_id = snapshot_branches.branch_id)
               ->  Index Scan using snapshot_branch_target_type_target_name_idx on snapshot_branch  (cost=0.70..127317.96 rows=227043 width=51) (actual time=0.024..355.077 rows=322338 loops=1)
                     Index Cond: ((target_type = ANY ('{alias}'::snapshot_target[])) AND (name >= '\x'::bytea))
               ->  Hash  (cost=5899.47..5899.47 rows=51338 width=8) (actual time=0.614..0.615 rows=1448 loops=1)
                     Buckets: 65536  Batches: 1  Memory Usage: 569kB
                     ->  Index Only Scan using snapshot_branches_pkey on snapshot_branches  (cost=0.58..5899.47 rows=51338 width=8) (actual time=0.040..0.420 rows=1448 loops=1)
                           Index Cond: (snapshot_id = $0)
                           Heap Fetches: 1448
 Planning Time: 1.200 ms
 JIT:
   Functions: 19
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 4.928 ms, Inlining 0.000 ms, Optimization 1.257 ms, Emission 14.028 ms, Total 20.214 ms
 Execution Time: 418.197 ms
(22 rows)

If we compare to the execution time of the query getting all branches without any filtering on the target type, this is night and day:

softwareheritage=> explain analyze SELECT snapshot_id, name, target, target_type FROM swh_snapshot_get_by_id('\xc7beb2432b7e93c4cf6ab09cd194c7c1998df2f9'::bytea, '\x'::bytea, 10000, NULL :: snapshot_target[]);
                                                                                QUERY PLAN                                                                                 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=141412.79..141537.79 rows=10000 width=75) (actual time=18.965..19.115 rows=1448 loops=1)
   ->  Sort  (cost=141412.79..141541.14 rows=51338 width=75) (actual time=5.487..5.554 rows=1448 loops=1)
         Sort Key: snapshot_branch.name
         Sort Method: quicksort  Memory: 252kB
         InitPlan 1 (returns $0)
           ->  Index Scan using snapshot_id_idx on snapshot  (cost=0.57..2.59 rows=1 width=8) (actual time=0.026..0.028 rows=1 loops=1)
                 Index Cond: ((id)::bytea = ('\xc7beb2432b7e93c4cf6ab09cd194c7c1998df2f9'::bytea)::bytea)
         ->  Nested Loop  (cost=1.15..137393.58 rows=51338 width=75) (actual time=0.074..4.782 rows=1448 loops=1)
               ->  Index Only Scan using snapshot_branches_pkey on snapshot_branches  (cost=0.58..5899.47 rows=51338 width=8) (actual time=0.057..0.428 rows=1448 loops=1)
                     Index Cond: (snapshot_id = $0)
                     Heap Fetches: 1448
               ->  Index Scan using snapshot_branch_pkey on snapshot_branch  (cost=0.57..2.56 rows=1 width=51) (actual time=0.003..0.003 rows=1 loops=1448)
                     Index Cond: (object_id = snapshot_branches.branch_id)
                     Filter: (name >= '\x'::bytea)
 Planning Time: 1.276 ms
 JIT:
   Functions: 15
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 4.082 ms, Inlining 0.000 ms, Optimization 0.978 ms, Emission 12.285 ms, Total 17.345 ms
 Execution Time: 23.378 ms
(20 rows)

So getting all branches and performing filtering on target types client side seems a better approach here to avoid
a costly request that might timeout.