Page MenuHomeSoftware Heritage

postgresql: Remove merge join with origin_visit in origin_visit_get_latest
ClosedPublic

Authored by vlorentz on Sep 29 2022, 9:58 AM.

Details

Summary

I noticed that origin_visit_get_latest spends a lot of time doing index
scans on origin_visit_pkey:

swh=> explain analyze SELECT * FROM origin_visit ov INNER JOIN origin o ON o.id = ov.origin INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1;
                                                                                      QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=10.14..29.33 rows=1 width=171) (actual time=1432.475..1432.479 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o_1  (cost=0.56..8.57 rows=1 width=8) (actual time=0.077..0.079 rows=1 loops=1)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.56..2208.37 rows=115 width=171) (actual time=1432.473..1432.476 rows=1 loops=1)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Nested Loop  (cost=1.00..1615.69 rows=93 width=143) (actual time=298.705..298.707 rows=1 loops=1)
               ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85) (actual time=298.658..298.658 rows=1 loops=1)
                     Index Cond: (origin = $0)
                     Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
                     Rows Removed by Filter: 198
               ->  Materialize  (cost=0.43..8.46 rows=1 width=58) (actual time=0.042..0.043 rows=1 loops=1)
                     ->  Index Scan using origin_pkey on origin o  (cost=0.43..8.45 rows=1 width=58) (actual time=0.038..0.038 rows=1 loops=1)
                           Index Cond: (id = $0)
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28) (actual time=30.120..1133.650 rows=100 loops=1)
               Index Cond: (origin = $0)
 Planning Time: 0.577 ms
 Execution Time: 1432.532 ms
(18 lignes)

As far as I understand, this is because we do not have a FK to tell the
planner that every row in origin_visit_status does have a
corresponding row in origin_visit, so it checks every row from
origin_visit_status in this loop.

Therefore, I rewrote the query to use a LEFT JOIN, so it will spare
this check.

First, here is the original query:

swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)

Change columns to filter directly on the "materialized" fields in ovs
instead of those on those in ov (no actual change yet):

swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)

Then, reorder tables (obviously no change either):

swh=> explain SELECT * FROM origin_visit_status ovs INNER JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)

Finally, replace INNER JOIN with LEFT JOIN:

swh=> explain SELECT * FROM origin_visit_status ovs LEFT JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..35.47 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Nested Loop Left Join  (cost=1.13..2396.79 rows=93 width=113)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan using origin_visit_pkey on origin_visit ov  (cost=0.56..8.59 rows=1 width=28)
               Index Cond: ((origin = ovs.origin) AND (origin = $0) AND (visit = ovs.visit))
(10 lignes)

This would also work with a subquery just to get the value of ov.date
and removing the actual join to ov entirely, but it was more annoying
to implement because the function reuses self.origin_visit_select_cols
as column list.

All these EXPLAIN queries were run on staging.

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Build has FAILED

Patch application report for D8574 (id=30930)

Rebasing onto 87d3f0d7b4...

Current branch diff-target is up to date.
Changes applied before test
commit 7a46d85742b0a9dcb070a9f7786f2495da4bb343
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Sep 29 09:47:39 2022 +0200

    postgresql: Remove merge join with origin_visit in origin_visit_get_latest
    
    I noticed that `origin_visit_get_latest` spends a lot of time doing index
    scans on `origin_visit_pkey`:
swh=> explain analyze SELECT * FROM origin_visit ov INNER JOIN origin o ON o.id = ov.origin INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1;
                                                                                      QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=10.14..29.33 rows=1 width=171) (actual time=1432.475..1432.479 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o_1  (cost=0.56..8.57 rows=1 width=8) (actual time=0.077..0.079 rows=1 loops=1)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.56..2208.37 rows=115 width=171) (actual time=1432.473..1432.476 rows=1 loops=1)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Nested Loop  (cost=1.00..1615.69 rows=93 width=143) (actual time=298.705..298.707 rows=1 loops=1)
               ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85) (actual time=298.658..298.658 rows=1 loops=1)
                     Index Cond: (origin = $0)
                     Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
                     Rows Removed by Filter: 198
               ->  Materialize  (cost=0.43..8.46 rows=1 width=58) (actual time=0.042..0.043 rows=1 loops=1)
                     ->  Index Scan using origin_pkey on origin o  (cost=0.43..8.45 rows=1 width=58) (actual time=0.038..0.038 rows=1 loops=1)
                           Index Cond: (id = $0)
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28) (actual time=30.120..1133.650 rows=100 loops=1)
               Index Cond: (origin = $0)
 Planning Time: 0.577 ms
 Execution Time: 1432.532 ms
(18 lignes)
```

As far as I understand, this is because we do not have a FK to tell the
planner that every row in `origin_visit_status` does have a
corresponding row in `origin_visit`, so it checks every row from
`origin_visit_status` in this loop.

Therefore, I rewrote the query to use a `LEFT JOIN`, so it will spare
this check.

First, here is the original query:

```
swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)
```

Change columns to filter directly on the "materialized" fields in ovs
instead of those on those in ov (no actual change yet):

```
swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)
```

Then, reorder tables (obviously no change either):

```
swh=> explain SELECT * FROM origin_visit_status ovs INNER JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)
```

Finally, replace `INNER JOIN` with `LEFT JOIN`:

```
swh=> explain SELECT * FROM origin_visit_status ovs LEFT JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..35.47 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Nested Loop Left Join  (cost=1.13..2396.79 rows=93 width=113)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan using origin_visit_pkey on origin_visit ov  (cost=0.56..8.59 rows=1 width=28)
               Index Cond: ((origin = ovs.origin) AND (origin = $0) AND (visit = ovs.visit))
(10 lignes)
```

This would also work with a subquery just to get the value of `ov.date`
and removing the actual join to `ov` entirely, but it was more annoying
to implement because the function reuses `self.origin_visit_select_cols`
as column list.

All these EXPLAIN queries were run on staging.
Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1668/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1668/console
Harbormaster returned this revision to the author for changes because remote builds failed.Sep 29 2022, 10:01 AM
Harbormaster failed remote builds in B31902: Diff 30930!
Harbormaster returned this revision to the author for changes because remote builds failed.Sep 29 2022, 10:41 AM
Harbormaster failed remote builds in B31907: Diff 30935!

Build has FAILED

Patch application report for D8574 (id=30935)

Could not rebase; Attempt merge onto 87d3f0d7b4...

Updating 87d3f0d7..06157284
Fast-forward
 .coveragerc                       |  4 ++++
 swh/storage/postgresql/db.py      | 14 ++++++++------
 swh/storage/postgresql/storage.py |  3 +++
 swh/storage/tests/conftest.py     |  6 +++++-
 4 files changed, 20 insertions(+), 7 deletions(-)
 create mode 100644 .coveragerc
Changes applied before test
commit 0615728457738b1ba1994ccef27c40608fa4e69a
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Sep 29 09:47:39 2022 +0200

    postgresql: Remove merge join with origin_visit in origin_visit_get_latest
    
    I noticed that `origin_visit_get_latest` spends a lot of time doing index
    scans on `origin_visit_pkey`:
swh=> explain analyze SELECT * FROM origin_visit ov INNER JOIN origin o ON o.id = ov.origin INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1;
                                                                                      QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=10.14..29.33 rows=1 width=171) (actual time=1432.475..1432.479 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o_1  (cost=0.56..8.57 rows=1 width=8) (actual time=0.077..0.079 rows=1 loops=1)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.56..2208.37 rows=115 width=171) (actual time=1432.473..1432.476 rows=1 loops=1)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Nested Loop  (cost=1.00..1615.69 rows=93 width=143) (actual time=298.705..298.707 rows=1 loops=1)
               ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85) (actual time=298.658..298.658 rows=1 loops=1)
                     Index Cond: (origin = $0)
                     Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
                     Rows Removed by Filter: 198
               ->  Materialize  (cost=0.43..8.46 rows=1 width=58) (actual time=0.042..0.043 rows=1 loops=1)
                     ->  Index Scan using origin_pkey on origin o  (cost=0.43..8.45 rows=1 width=58) (actual time=0.038..0.038 rows=1 loops=1)
                           Index Cond: (id = $0)
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28) (actual time=30.120..1133.650 rows=100 loops=1)
               Index Cond: (origin = $0)
 Planning Time: 0.577 ms
 Execution Time: 1432.532 ms
(18 lignes)
```

As far as I understand, this is because we do not have a FK to tell the
planner that every row in `origin_visit_status` does have a
corresponding row in `origin_visit`, so it checks every row from
`origin_visit_status` in this loop.

Therefore, I rewrote the query to use a `LEFT JOIN`, so it will spare
this check.

First, here is the original query:

```
swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)
```

Change columns to filter directly on the "materialized" fields in ovs
instead of those on those in ov (no actual change yet):

```
swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)
```

Then, reorder tables (obviously no change either):

```
swh=> explain SELECT * FROM origin_visit_status ovs INNER JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)
```

Finally, replace `INNER JOIN` with `LEFT JOIN`:

```
swh=> explain SELECT * FROM origin_visit_status ovs LEFT JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..35.47 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Nested Loop Left Join  (cost=1.13..2396.79 rows=93 width=113)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan using origin_visit_pkey on origin_visit ov  (cost=0.56..8.59 rows=1 width=28)
               Index Cond: ((origin = ovs.origin) AND (origin = $0) AND (visit = ovs.visit))
(10 lignes)
```

This would also work with a subquery just to get the value of `ov.date`
and removing the actual join to `ov` entirely, but it was more annoying
to implement because the function reuses `self.origin_visit_select_cols`
as column list.

All these EXPLAIN queries were run on staging.

commit c3598dd4351cc9de4647278b2304d9c0f624be19
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Sep 29 10:15:36 2022 +0200

conftest: Replace multiprocessing hack when pytest-cov >= 4 is installed

The hack crashes on >= 4 because 'pytest_cov.embed.multiprocessing_start'
is not in the hook list anymore.

https://pytest-cov.readthedocs.io/en/latest/changelog.html
Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1671/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1671/console

test failure is unrelated, I'm trying to solve it in D8576

Build is green

Patch application report for D8574 (id=30946)

Could not rebase; Attempt merge onto 87d3f0d7b4...

Updating 87d3f0d7..657d31f6
Fast-forward
 swh/storage/postgresql/db.py      | 14 ++++++++------
 swh/storage/postgresql/storage.py |  3 +++
 swh/storage/tests/conftest.py     |  3 ++-
 3 files changed, 13 insertions(+), 7 deletions(-)
Changes applied before test
commit 657d31f67555bd727a3412a36b53d569b135f276
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date:   Thu Sep 29 09:47:39 2022 +0200

    postgresql: Remove merge join with origin_visit in origin_visit_get_latest
    
    I noticed that `origin_visit_get_latest` spends a lot of time doing index
    scans on `origin_visit_pkey`:
swh=> explain analyze SELECT * FROM origin_visit ov INNER JOIN origin o ON o.id = ov.origin INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1;
                                                                                      QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=10.14..29.33 rows=1 width=171) (actual time=1432.475..1432.479 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o_1  (cost=0.56..8.57 rows=1 width=8) (actual time=0.077..0.079 rows=1 loops=1)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.56..2208.37 rows=115 width=171) (actual time=1432.473..1432.476 rows=1 loops=1)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Nested Loop  (cost=1.00..1615.69 rows=93 width=143) (actual time=298.705..298.707 rows=1 loops=1)
               ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85) (actual time=298.658..298.658 rows=1 loops=1)
                     Index Cond: (origin = $0)
                     Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
                     Rows Removed by Filter: 198
               ->  Materialize  (cost=0.43..8.46 rows=1 width=58) (actual time=0.042..0.043 rows=1 loops=1)
                     ->  Index Scan using origin_pkey on origin o  (cost=0.43..8.45 rows=1 width=58) (actual time=0.038..0.038 rows=1 loops=1)
                           Index Cond: (id = $0)
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28) (actual time=30.120..1133.650 rows=100 loops=1)
               Index Cond: (origin = $0)
 Planning Time: 0.577 ms
 Execution Time: 1432.532 ms
(18 lignes)
```

As far as I understand, this is because we do not have a FK to tell the
planner that every row in `origin_visit_status` does have a
corresponding row in `origin_visit`, so it checks every row from
`origin_visit_status` in this loop.

Therefore, I rewrote the query to use a `LEFT JOIN`, so it will spare
this check.

First, here is the original query:

```
swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ov.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ov.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)
```

Change columns to filter directly on the "materialized" fields in ovs
instead of those on those in ov (no actual change yet):

```
swh=> explain SELECT * FROM origin_visit ov INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)
```

Then, reorder tables (obviously no change either):

```
swh=> explain SELECT * FROM origin_visit_status ovs INNER JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..28.82 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Merge Join  (cost=1.13..2198.75 rows=115 width=113)
         Merge Cond: (ovs.visit = ov.visit)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.56..590.92 rows=150 width=28)
               Index Cond: (origin = $0)
(11 lignes)
```

Finally, replace `INNER JOIN` with `LEFT JOIN`:

```
swh=> explain SELECT * FROM origin_visit_status ovs LEFT JOIN origin_visit ov USING (origin, visit) WHERE ovs.origin = (SELECT id FROM origin o WHERE o.url = 'https://pypi.org/project/simpleado/') AND ovs.snapshot is not null AND ovs.status = 'full' ORDER BY ovs.visit DESC, ovs.date DESC LIMIT 1;
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=9.71..35.47 rows=1 width=113)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx on origin o  (cost=0.56..8.57 rows=1 width=8)
           Index Cond: (url = 'https://pypi.org/project/simpleado/'::text)
   ->  Nested Loop Left Join  (cost=1.13..2396.79 rows=93 width=113)
         ->  Index Scan Backward using origin_visit_status_pkey on origin_visit_status ovs  (cost=0.57..1606.07 rows=93 width=85)
               Index Cond: (origin = $0)
               Filter: ((snapshot IS NOT NULL) AND (status = 'full'::origin_visit_state))
         ->  Index Scan using origin_visit_pkey on origin_visit ov  (cost=0.56..8.59 rows=1 width=28)
               Index Cond: ((origin = ovs.origin) AND (origin = $0) AND (visit = ovs.visit))
(10 lignes)
```

This would also work with a subquery just to get the value of `ov.date`
and removing the actual join to `ov` entirely, but it was more annoying
to implement because the function reuses `self.origin_visit_select_cols`
as column list.

All these EXPLAIN queries were run on staging.

commit 44616afe5109f495604b01cf72b412fbec8e94c0
Author: Valentin Lorentz <vlorentz@softwareheritage.org>
Date: Thu Sep 29 10:15:36 2022 +0200

conftest: Replace multiprocessing hack when pytest-cov >= 4 is installed

The hack crashes on >= 4 because 'pytest_cov.embed.multiprocessing_start'
is not in the hook list anymore.

https://pytest-cov.readthedocs.io/en/latest/changelog.html
See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1675/ for more details.

That seems sensible, thanks.

This revision is now accepted and ready to land.Sep 30 2022, 4:29 PM