Page MenuHomeSoftware Heritage

postgresql: ensure origin_visit(_status) queries use index
ClosedPublic

Authored by olasd on Apr 12 2022, 1:59 PM.

Details

Summary

When using an inner join for the single origin value, instead of a
subquery, the query fails to use the (origin, visit) indexes and falls
back to fetching all the visits (or all the statuses) for the origin and
sorting them.

This breaks down for origins with a lot of visits, such as the ones that
are being used for end to end monitoring.

Using a subselect to generate a single origin id value ensures that the
queries can use the proper indexes.

Test Plan

unit tests pass unchanged

Comparison of postgresql query plans:

13:56 guest@softwareheritage => explain SELECT ov.visit, ov.date, ov.type FROM origin_visit ov INNER JOIN origin o ON o.id = ov.origin WHERE o.url = 'https://subversion.renater.fr/anonscm/svn/panda' ORDER BY ov.visit ASC LIMIT 1001;
                                                   QUERY PLAN
────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Limit  (cost=612.53..612.54 rows=7 width=20)
   ->  Sort  (cost=612.53..612.54 rows=7 width=20)
         Sort Key: ov.visit
         ->  Nested Loop  (cost=0.58..612.43 rows=7 width=20)
               ->  Index Scan using origin_url_idx1 on origin o  (cost=0.00..2.02 rows=1 width=8)
                     Index Cond: (url = 'https://subversion.renater.fr/anonscm/svn/panda'::text)
               ->  Index Scan using origin_visit_pkey on origin_visit ov  (cost=0.58..604.24 rows=617 width=28)
                     Index Cond: (origin = o.id)
(8 lignes)

Temps : 10,580 ms
13:56 guest@softwareheritage => explain SELECT ov.visit, ov.date, ov.type FROM origin_visit ov INNER JOIN origin o ON o.id = ov.origin WHERE ov.origin = (select id from origin where url = 'https://subversion.renater.fr/anonscm/svn/panda') ORDER BY ov.visit ASC LIMIT 1001;
                                                QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────────────
 Limit  (cost=3.17..616.56 rows=617 width=20)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx1 on origin  (cost=0.00..2.02 rows=1 width=8)
           Index Cond: (url = 'https://subversion.renater.fr/anonscm/svn/panda'::text)
   ->  Nested Loop  (cost=1.15..614.54 rows=617 width=20)
         ->  Index Scan using origin_visit_pkey on origin_visit ov  (cost=0.58..604.24 rows=617 width=28)
               Index Cond: (origin = $0)
         ->  Materialize  (cost=0.57..2.59 rows=1 width=8)
               ->  Index Only Scan using origin_pkey on origin o  (cost=0.57..2.59 rows=1 width=8)
                     Index Cond: (id = $0)
(10 lignes)

Temps : 4,595 ms
13:56 guest@softwareheritage => explain SELECT ov.visit, ov.date, ov.type FROM origin_visit ov INNER JOIN origin o ON o.id = ov.origin WHERE ov.origin = (select id from origin where url = 'https://subversion.renater.fr/anonscm/svn/panda') ORDER BY ov.visit DESC LIMIT 1001;
                                                    QUERY PLAN
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Limit  (cost=3.17..616.56 rows=617 width=20)
   InitPlan 1 (returns $0)
     ->  Index Scan using origin_url_idx1 on origin  (cost=0.00..2.02 rows=1 width=8)
           Index Cond: (url = 'https://subversion.renater.fr/anonscm/svn/panda'::text)
   ->  Nested Loop  (cost=1.15..614.54 rows=617 width=20)
         ->  Index Scan Backward using origin_visit_pkey on origin_visit ov  (cost=0.58..604.24 rows=617 width=28)
               Index Cond: (origin = $0)
         ->  Materialize  (cost=0.57..2.59 rows=1 width=8)
               ->  Index Only Scan using origin_pkey on origin o  (cost=0.57..2.59 rows=1 width=8)
                     Index Cond: (id = $0)
(10 lignes)

Temps : 4,347 ms
13:56 guest@softwareheritage =>

Actual query time before change: 15s; after change: 15*m*s!

Diff Detail

Repository
rDSTO Storage manager
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

(to preempt a comment, yes, we could re-jig the queries to avoid the now useless join on the origin table, but it's not really worth the trouble, the row will be hot in cache anyway)

Build is green

Patch application report for D7554 (id=27384)

Rebasing onto 7c9586a0bb...

Current branch diff-target is up to date.
Changes applied before test
commit 4e659b7b23cd69633a8a4a97c68e323e147ab741
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Apr 12 13:53:43 2022 +0200

    postgresql: ensure origin_visit(_status) queries use index
    
    When using an inner join for the single origin value, instead of a
    subquery, the query fails to use the (origin, visit) indexes and falls
    back to fetching all the visits (or all the statuses) for the origin and
    sorting them.
    
    This breaks down for origins with a lot of visits, such as the ones that
    are being used for end to end monitoring.
    
    Using a subselect to generate a single origin id value ensures that the
    queries can use the proper indexes.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1592/ for more details.

olasd requested review of this revision.Apr 12 2022, 2:09 PM

That's quite a speedup, thanks !

This revision is now accepted and ready to land.Apr 12 2022, 2:12 PM

Build is green

Patch application report for D7554 (id=27390)

Rebasing onto 035d4c1dd1...

Current branch diff-target is up to date.
Changes applied before test
commit 75aa07325d6c8b96a4d37194e4626dcbbace6b3d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Tue Apr 12 13:53:43 2022 +0200

    postgresql: ensure origin_visit(_status) queries use index
    
    When using an inner join for the single origin value, instead of a
    subquery, the query fails to use the (origin, visit) indexes and falls
    back to fetching all the visits (or all the statuses) for the origin and
    sorting them.
    
    This breaks down for origins with a lot of visits, such as the ones that
    are being used for end to end monitoring.
    
    Using a subselect to generate a single origin id value ensures that the
    queries can use the proper indexes.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/1593/ for more details.