HomeSoftware Heritage

postgresql: Remove merge join with origin_visit in origin_visit_get_latest

Description

postgresql: Remove merge join with origin_visit in origin_visit_get_latest

I noticed that origin_visit_get_latest spends a lot of time doing index
scans on origin_visit_pkey:

swh=> explain analyze SELECT * FROM origin_visit ov INNER JOIN origin o ON o.id = ov.origin INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1;
                                                                                      QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=10.14..29.33 rows=1 width=171) (actual time=1432.475..1432.479 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o_1  (cost=0.56..8.57 rows=1 width=8) (actual time=0.077..0.079 rows=1 loops=1)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.56..2208.37 rows=115 width=171) (actual time=1432.473..1432.476 rows=1 loops=1)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Nested Loop  (cost=1.00..1615.69 rows=93 width=143) (actual time=298.705..298.707 rows=1 loops=1)
               ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85) (actual time=298.658..298.658 rows=1 loops=1)
                     Index Cond: (origin = $0)
                     Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
                     Rows Removed by Filter: 198
               ->  Materialize  (cost=0.43..8.46 rows=1 width=58) (actual time=0.042..0.043 rows=1 loops=1)
                     ->  Index Scan using origin_pkey on origin o  (cost=0.43..8.45 rows=1 width=58) (actual time=0.038..0.038 rows=1 loops=1)
                           Index Cond: (id = $0)
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28) (actual time=30.120..1133.650 rows=100 loops=1)
               Index Cond: (origin = $0)
 Planning Time: 0.577 ms
 Execution Time: 1432.532 ms
(18 lignes)

As far as I understand, this is because we do not have a FK to tell the
planner that every row in origin_visit_status does have a
corresponding row in origin_visit, so it checks every row from
origin_visit_status in this loop.

Therefore, I rewrote the query to use a LEFT JOIN, so it will spare
this check.

First, here is the original query:

swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)

Change columns to filter directly on the "materialized" fields in ovs
instead of those on those in ov (no actual change yet):

swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)

Then, reorder tables (obviously no change either):

swh=> explain SELECT * FROM origin_visit_status ovs INNER JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)

Finally, replace INNER JOIN with LEFT JOIN:

swh=> explain SELECT * FROM origin_visit_status ovs LEFT JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..35.47 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Nested Loop Left Join  (cost=1.13..2396.79 rows=93 width=113)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan using origin_visit_pkey on origin_visit ov  (cost=0.56..8.59 rows=1 width=28)
               Index Cond: ((origin = ovs.origin) AND (origin = $0) AND (visit = ovs.visit))
(10 lignes)

This would also work with a subquery just to get the value of ov.date
and removing the actual join to ov entirely, but it was more annoying
to implement because the function reuses self.origin_visit_select_cols
as column list.

All these EXPLAIN queries were run on staging.