Page MenuHomeSoftware Heritage

Limit the number of origin_visit(_status) joined with origins before sorting.
AbandonedPublic

Authored by vlorentz on Jul 23 2020, 6:29 PM.

Details

Reviewers
None
Group Reviewers
Reviewers
Summary

Without the limit on the subquery, postgresql greatly overestimates the
number of items in the cardinal product, which causes it to prefer
doing an ordered scan on origins rather than filtering on sorting,
which is extremely slow.

Adding this limit bounds the size of the cardinal product to the number
of matches in the origin index, which postgresql knows to be low,
so it automatically prefers the index scan.

Diff Detail

Repository
rDSTO Storage manager
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 13924
Build 21363: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 21362: arc lint + arc unit

Event Timeline

18:32 guest@softwareheritage => explain SELECT url FROM origin o WHERE EXISTS ( SELECT 1 FROM origin_visit ov INNER JOIN origin_visit_status ovs ON ov.origin = ovs.origin AND ov.visit = ovs.visit INNER JOIN snapshot ON ovs.snapshot=snapshot.id WHERE ov.origin=o.id) AND url ~* 'jeIlyfish' ORDER BY id OFFSET 0 LIMIT 100;
                                                                 QUERY PLAN                                                                  
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Limit  (cost=1436.59..44103123.25 rows=100 width=52)
   ->  Gather Merge  (cost=1436.59..55569561.78 rows=126 width=52)
         Workers Planned: 2
         ->  Nested Loop Semi Join  (cost=436.57..55568547.21 rows=52 width=52)
               ->  Parallel Index Scan using origin_pkey on origin o  (cost=0.57..11731897.62 rows=5574 width=52)
                     Filter: (url ~* 'jeIlyfish'::text)
               ->  Hash Join  (cost=436.00..8196.11 rows=17979 width=16)
                     Hash Cond: ((ovs.origin = ov.origin) AND (ovs.visit = ov.visit))
                     ->  Nested Loop  (cost=1.14..7744.70 rows=3152 width=16)
                           ->  Index Scan using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..2284.79 rows=3875 width=37)
                                 Index Cond: (origin = o.id)
                           ->  Index Only Scan using snapshot_id_idx on snapshot  (cost=0.57..1.41 rows=1 width=21)
                                 Index Cond: (id = (ovs.snapshot)::bytea)
                     ->  Hash  (cost=426.33..426.33 rows=569 width=16)
                           ->  Index Only Scan using origin_visit_pkey on origin_visit ov  (cost=0.57..426.33 rows=569 width=16)
                                 Index Cond: (origin = o.id)
(16 lignes)

Temps : 27,290 ms
18:33 guest@softwareheritage => explain SELECT url FROM origin o WHERE EXISTS ( SELECT 1 FROM origin_visit ov INNER JOIN origin_visit_status ovs ON ov.origin = ovs.origin AND ov.visit = ovs.visit INNER JOIN snapshot ON ovs.snapshot=snapshot.id WHERE ov.origin=o.id limit 1 ) AND url ~* 'jeIlyfish' ORDER BY id OFFSET 0 LIMIT 100;
                                                                 QUERY PLAN                                                                  
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Limit  (cost=1436.59..44103123.25 rows=100 width=52)
   ->  Gather Merge  (cost=1436.59..55569561.78 rows=126 width=52)
         Workers Planned: 2
         ->  Nested Loop Semi Join  (cost=436.57..55568547.21 rows=52 width=52)
               ->  Parallel Index Scan using origin_pkey on origin o  (cost=0.57..11731897.62 rows=5574 width=52)
                     Filter: (url ~* 'jeIlyfish'::text)
               ->  Hash Join  (cost=436.00..8196.11 rows=17979 width=16)
                     Hash Cond: ((ovs.origin = ov.origin) AND (ovs.visit = ov.visit))
                     ->  Nested Loop  (cost=1.14..7744.70 rows=3152 width=16)
                           ->  Index Scan using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..2284.79 rows=3875 width=37)
                                 Index Cond: (origin = o.id)
                           ->  Index Only Scan using snapshot_id_idx on snapshot  (cost=0.57..1.41 rows=1 width=21)
                                 Index Cond: (id = (ovs.snapshot)::bytea)
                     ->  Hash  (cost=426.33..426.33 rows=569 width=16)
                           ->  Index Only Scan using origin_visit_pkey on origin_visit ov  (cost=0.57..426.33 rows=569 width=16)
                                 Index Cond: (origin = o.id)

Both these query plans are identical.

You really want to move the origin pattern filtering to a CTE to force the engine to use the proper index for it.

Build is green

Patch application report for D3606 (id=12699)

Rebasing onto d8583eb468...

Current branch diff-target is up to date.
Changes applied before test
commit 4f97be8838cd600742f475aca4d90c7604a14ec4
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jul 23 18:29:15 2020 +0200

    Limit the number of origin_visit(_status) joined with origins before sorting.
    
    Without the limit on the subquery, postgresql greatly overestimates the
    number of items in the cardinal product, which causes it to prefer
    doing an ordered scan on origins rather than filtering on sorting,
    which is extremely slow.
    
    Adding this limit bounds the size of the cardinal product to the number
    of matches in the origin index, which postgresql knows to be low,
    so it automatically prefers the index scan.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/558/ for more details.