Limit the number of origin_visit(_status) joined with origins before sorting.
AbandonedPublic
Actions

Authored by vlorentz on Jul 23 2020, 6:29 PM.

Details

Reviewers

None

Group Reviewers

Summary

Without the limit on the subquery, postgresql greatly overestimates the
number of items in the cardinal product, which causes it to prefer
doing an ordered scan on origins rather than filtering on sorting,
which is extremely slow.

Adding this limit bounds the size of the cardinal product to the number
of matches in the origin index, which postgresql knows to be low,
so it automatically prefers the index scan.

Diff Detail

Repository

rDSTO Storage manager

Branch

master

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 13924
Build 21363: Phabricator diff pipeline on jenkins	Jenkins console · Jenkins
Build 21362: arc lint + arc unit

Event Timeline

vlorentz created this revision.Jul 23 2020, 6:29 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptJul 23 2020, 6:29 PM

vlorentz abandoned this revision.Jul 23 2020, 6:34 PM

18:32 guest@softwareheritage => explain SELECT url FROM origin o WHERE EXISTS ( SELECT 1 FROM origin_visit ov INNER JOIN origin_visit_status ovs ON ov.origin = ovs.origin AND ov.visit = ovs.visit INNER JOIN snapshot ON ovs.snapshot=snapshot.id WHERE ov.origin=o.id) AND url ~* 'jeIlyfish' ORDER BY id OFFSET 0 LIMIT 100;
                                                                 QUERY PLAN                                                                  
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Limit  (cost=1436.59..44103123.25 rows=100 width=52)
   ->  Gather Merge  (cost=1436.59..55569561.78 rows=126 width=52)
         Workers Planned: 2
         ->  Nested Loop Semi Join  (cost=436.57..55568547.21 rows=52 width=52)
               ->  Parallel Index Scan using origin_pkey on origin o  (cost=0.57..11731897.62 rows=5574 width=52)
                     Filter: (url ~* 'jeIlyfish'::text)
               ->  Hash Join  (cost=436.00..8196.11 rows=17979 width=16)
                     Hash Cond: ((ovs.origin = ov.origin) AND (ovs.visit = ov.visit))
                     ->  Nested Loop  (cost=1.14..7744.70 rows=3152 width=16)
                           ->  Index Scan using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..2284.79 rows=3875 width=37)
                                 Index Cond: (origin = o.id)
                           ->  Index Only Scan using snapshot_id_idx on snapshot  (cost=0.57..1.41 rows=1 width=21)
                                 Index Cond: (id = (ovs.snapshot)::bytea)
                     ->  Hash  (cost=426.33..426.33 rows=569 width=16)
                           ->  Index Only Scan using origin_visit_pkey on origin_visit ov  (cost=0.57..426.33 rows=569 width=16)
                                 Index Cond: (origin = o.id)
(16 lignes)

Temps : 27,290 ms
18:33 guest@softwareheritage => explain SELECT url FROM origin o WHERE EXISTS ( SELECT 1 FROM origin_visit ov INNER JOIN origin_visit_status ovs ON ov.origin = ovs.origin AND ov.visit = ovs.visit INNER JOIN snapshot ON ovs.snapshot=snapshot.id WHERE ov.origin=o.id limit 1 ) AND url ~* 'jeIlyfish' ORDER BY id OFFSET 0 LIMIT 100;
                                                                 QUERY PLAN                                                                  
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Limit  (cost=1436.59..44103123.25 rows=100 width=52)
   ->  Gather Merge  (cost=1436.59..55569561.78 rows=126 width=52)
         Workers Planned: 2
         ->  Nested Loop Semi Join  (cost=436.57..55568547.21 rows=52 width=52)
               ->  Parallel Index Scan using origin_pkey on origin o  (cost=0.57..11731897.62 rows=5574 width=52)
                     Filter: (url ~* 'jeIlyfish'::text)
               ->  Hash Join  (cost=436.00..8196.11 rows=17979 width=16)
                     Hash Cond: ((ovs.origin = ov.origin) AND (ovs.visit = ov.visit))
                     ->  Nested Loop  (cost=1.14..7744.70 rows=3152 width=16)
                           ->  Index Scan using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..2284.79 rows=3875 width=37)
                                 Index Cond: (origin = o.id)
                           ->  Index Only Scan using snapshot_id_idx on snapshot  (cost=0.57..1.41 rows=1 width=21)
                                 Index Cond: (id = (ovs.snapshot)::bytea)
                     ->  Hash  (cost=426.33..426.33 rows=569 width=16)
                           ->  Index Only Scan using origin_visit_pkey on origin_visit ov  (cost=0.57..426.33 rows=569 width=16)
                                 Index Cond: (origin = o.id)

Both these query plans are identical.

You really want to move the origin pattern filtering to a CTE to force the engine to use the proper index for it.

Build is green

Patch application report for D3606 (id=12699)

Rebasing onto d8583eb468...

Current branch diff-target is up to date.

Changes applied before test

commit 4f97be8838cd600742f475aca4d90c7604a14ec4
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Jul 23 18:29:15 2020 +0200

    Limit the number of origin_visit(_status) joined with origins before sorting.
    
    Without the limit on the subquery, postgresql greatly overestimates the
    number of items in the cardinal product, which causes it to prefer
    doing an ordered scan on origins rather than filtering on sorting,
    which is extremely slow.
    
    Adding this limit bounds the size of the cardinal product to the number
    of matches in the origin index, which postgresql knows to be low,
    so it automatically prefers the index scan.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/558/ for more details.

Harbormaster completed remote builds in B13924: Diff 12699.Jul 23 2020, 6:36 PM

ardumont mentioned this in D6305: opam: Install and maintain up-to-date shared opam root directories.Sep 20 2021, 3:11 PM