Page MenuHomeSoftware Heritage

idx.storage.db: Improve content range queries to actually finish
ClosedPublic

Authored by ardumont on Feb 24 2020, 2:14 PM.

Details

Summary

In production, the previous query version takes too long. This ended up failing
with timeouts.

Counter-intuitively, specifying twice the range query to use (both in
content_mimetype and content_fossology_license), the execution now returns
immediately the expected results.

Closes T1788

Test Plan

tox

before (plus with the new index content_mimetype_id_idx created here), this does not finish in time:

softwareheritage-indexer=> explain select t.id from content_fossology_license t inner join indexer_configuration ic on ic.id=t.indexer_configuration_id inner join content_mimetype cm on (t.id=cm.id and cm.mimetype like 'text/%') where ic.id=1 and '\xbbb6ed677707a2ac700000000000000000000000'::bytea <= t.id and t.id <= '\xbbb6f52d234ebdf3f80000000000000000000000'::bytea order by t.indexer_configuration_id, t.id limit 1000;
                                                                                            QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1.55..16.66 rows=1000 width=40)
   ->  Merge Join  (cost=1.55..169248984.34 rows=11202576684 width=40)
         Merge Cond: ((t.id)::bytea = (cm.id)::bytea)
         ->  Nested Loop  (cost=0.84..547141.79 rows=93334 width=40)
               ->  Index Only Scan using content_fossology_license_pkey on content_fossology_license t  (cost=0.70..545972.96 rows=93334 width=40)
                     Index Cond: ((id >= '\xbbb6ed677707a2ac700000000000000000000000'::bytea) AND (id <= '\xbbb6f52d234ebdf3f80000000000000000000000'::bytea) AND (indexer_configuration_id = 1))
               ->  Materialize  (cost=0.14..2.16 rows=1 width=4)
                     ->  Index Only Scan using indexer_configuration_pkey on indexer_configuration ic  (cost=0.14..2.15 rows=1 width=4)
                           Index Cond: (id = 1)
         ->  Materialize  (cost=0.70..722972.33 rows=24005350 width=32)
               ->  Index Only Scan using content_mimetype_id_idx on content_mimetype cm  (cost=0.70..662958.95 rows=24005350 width=32)
(11 rows)

Now (with the content of this diff), this finishes:

softwareheritage-indexer=> explain select t.id from content_fossology_license t inner join content_mimetype cm on (t.id=cm.id and cm.mimetype like 'text/%' and '\xbbb6ed677707a2ac700000000000000000000000'::bytea <= cm.id and cm.id <= '\xbbb6f52d234ebdf3f80000000000000000000000'::bytea) where t.indexer_configuration_id=1 and '\xbbb6ed677707a2ac700000000000000000000000'::bytea <= t.id and t.id <= '\xbbb6f52d234ebdf3f80000000000000000000000'::bytea order by t.indexer_configuration_id, t.id limit 1000;
                                                                                         QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1.41..26.23 rows=1000 width=40)
   ->  Merge Join  (cost=1.41..1390340.54 rows=56013000 width=40)
         Merge Cond: ((t.id)::bytea = (cm.id)::bytea)
         ->  Index Only Scan using content_fossology_license_pkey on content_fossology_license t  (cost=0.70..545972.96 rows=93334 width=40)
               Index Cond: ((id >= '\xbbb6ed677707a2ac700000000000000000000000'::bytea) AND (id <= '\xbbb6f52d234ebdf3f80000000000000000000000'::bytea) AND (indexer_configuration_id = 1))
         ->  Materialize  (cost=0.70..4239.31 rows=120027 width=32)
               ->  Index Only Scan using content_mimetype_id_idx on content_mimetype cm  (cost=0.70..3939.24 rows=120027 width=32)
                     Index Cond: ((id >= '\xbbb6ed677707a2ac700000000000000000000000'::bytea) AND (id <= '\xbbb6f52d234ebdf3f80000000000000000000000'::bytea))
(8 rows)

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 10737
Build 16118: tox-on-jenkinsJenkins
Build 16117: arc lint + arc unit