Event Timeline
It seems there are some differences in term of indexes between the main and replica databases.
On the replica, only the primary keys are present on the snapshot_branches and the snapshot_branch tables. Perhaps the query optimizer is confused by something and is doing a wrong choice somewhere.
The limit seems to be 10. The response are fast for any limit >=10 but slow when <10....
limit 2:
https://explain.depesz.com/s/UW9Z
softwareheritage=> explain analyze with filtered_snapshot_branches as ( select '\xdfea9cb3249b932235b1cd60ed49c5e316a03147'::bytea as snapshot_id, name, target, target_type from snapshot_branches inner join snapshot_branch on snapshot_branches.branch_id = snapshot_branch.object_id where snapshot_id = (select object_id from snapshot where snapshot.id = '\xdfea9cb3249b932235b1cd60ed49c5e316a03147'::bytea) and (NULL :: snapshot_target[] is null or target_type = any(NULL :: snapshot_target[])) ) select snapshot_id, name, target, target_type from filtered_snapshot_branches where name >= '\x'::bytea and (NULL is null or convert_from(name, 'utf-8') ilike NULL) and (NULL is null or convert_from(name, 'utf-8') not ilike NULL) order by name limit 2; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=1004.01..6764.11 rows=2 width=76) (actual time=172523.081..173555.673 rows=2 loops=1) InitPlan 1 (returns $0) -> Index Scan using snapshot_id_idx on snapshot (cost=0.57..2.59 rows=1 width=8) (actual time=0.028..0.036 rows=1 loops=1) Index Cond: ((id)::bytea = '\xdfea9cb3249b932235b1cd60ed49c5e316a03147'::bytea) -> Gather Merge (cost=1001.43..168852423.27 rows=58628 width=76) (actual time=172523.079..173555.661 rows=2 loops=1) Workers Planned: 2 Params Evaluated: $0 Workers Launched: 2 -> Nested Loop (cost=1.40..168844656.12 rows=24428 width=76) (actual time=126442.320..167761.276 rows=2 loops=3) -> Parallel Index Scan using snapshot_branch_name_target_target_type_idx on snapshot_branch (cost=0.70..12612971.47 rows=154824599 width=52) (actual time=0.077..80926.811 rows=23123612 loops=3) Index Cond: (name >= '\x'::bytea) -> Index Only Scan using snapshot_branches_pkey on snapshot_branches (cost=0.70..1.01 rows=1 width=8) (actual time=0.004..0.004 rows=0 loops=69370837) Index Cond: ((snapshot_id = $0) AND (branch_id = snapshot_branch.object_id)) Heap Fetches: 5 Planning Time: 0.993 ms Execution Time: 173555.864 ms (16 rows) Time: 173667.434 ms (02:53.667)
limit 10:
https://explain.depesz.com/s/NbSU
softwareheritage=> explain analyze with filtered_snapshot_branches as ( select '\xdfea9cb3249b932235b1cd60ed49c5e316a03147'::bytea as snapshot_id, name, target, target_type from snapshot_branches inner join snapshot_branch on snapshot_branches.branch_id = snapshot_branch.object_id where snapshot_id = (select object_id from snapshot where snapshot.id = '\xdfea9cb3249b932235b1cd60ed49c5e316a03147'::bytea) and (NULL :: snapshot_target[] is null or target_type = any(NULL :: snapshot_target[])) ) select snapshot_id, name, target, target_type from filtered_snapshot_branches where name >= '\x'::bytea and (NULL is null or convert_from(name, 'utf-8') ilike NULL) and (NULL is null or convert_from(name, 'utf-8') not ilike NULL) order by name limit 10; Time: 895976.065 ms (14:55.976) QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=1004.01..29804.51 rows=10 width=76) (actual time=148561.091..895901.340 rows=10 loops=1) InitPlan 1 (returns $0) -> Index Scan using snapshot_id_idx on snapshot (cost=0.57..2.59 rows=1 width=8) (actual time=0.032..0.038 rows=1 loops=1) Index Cond: ((id)::bytea = '\xdfea9cb3249b932235b1cd60ed49c5e316a03147'::bytea) -> Gather Merge (cost=1001.43..168852537.02 rows=58628 width=76) (actual time=148561.089..895901.327 rows=10 loops=1) Workers Planned: 2 Params Evaluated: $0 Workers Launched: 2 -> Nested Loop (cost=1.40..168844769.87 rows=24428 width=76) (actual time=112742.855..633535.913 rows=5 loops=3) -> Parallel Index Scan using snapshot_branch_name_target_target_type_idx on snapshot_branch (cost=0.70..12612971.47 rows=154824599 width=52) (actual time=0.042..360975.705 rows=78257338 loops=3) Index Cond: (name >= '\x'::bytea) -> Index Only Scan using snapshot_branches_pkey on snapshot_branches (cost=0.70..1.01 rows=1 width=8) (actual time=0.003..0.003 rows=0 loops=234772013) Index Cond: ((snapshot_id = $0) AND (branch_id = snapshot_branch.object_id)) Heap Fetches: 15 Planning Time: 7.092 ms Execution Time: 895901.535 ms (16 rows)
The query has cleary not the same impact with a limit of 10 or 2
What I do not understand is why the timeout does not occur in production webapp as the query with limit 2 is executed while browsing an origin to resolve the HEAD branch alias.
To be noted, that query is wrapped in a stored PGSQL procedure (swh_snapshot_get_by_id) so this may be related.
yep it's weird, but after looking at the code of the function, I realized it seems to be a known problem :
https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/40-funcs.sql$665-667
so yes indeed, perhaps increasing the minimum limit as you suggested can do the job
Ack. I guess I will handle that directly in the swh.storage.postgresql.storage.snapshot_get_branches and ensures a minimum of 10 for the limit value.