Page MenuHomeSoftware Heritage

Use the index when doing metadata search.
ClosedPublic

Authored by vlorentz on Feb 5 2019, 3:55 PM.

Details

Diff Detail

Repository
rDCIDX Metadata indexer
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ardumont added a subscriber: ardumont.

I'm not so good at guessing that this change actually makes it use the index.
Checking the current schema, i see the column metadata_tsvector has an index so from afar, the change seems to do as promised ;)

create index origin_intrinsic_metadata_fulltext_idx on origin_intrinsic_metadata using gin (metadata_tsvector);

ps: I think proposing an explain plan of before and after might be good (i know it's boring ;)

This revision is now accepted and ready to land.Feb 6 2019, 11:22 AM

Checking the current schema, i see the column metadata_tsvector has an index so from afar, the change seems to do as promised ;)

indeed

ps: I think proposing an explain plan of before and after might be good (i know it's boring ;)

Old query:

softwareheritage-indexer=> explain SELECT origin_id, metadata, from_revision, mappings, i.id as tool_id, tool_name, tool_version, tool_configuration FROM origin_intrinsic_metadata AS oim INNER JOIN indexer_configuration AS i ON oim.indexer_configuration_id=i.id JOIN LATERAL (SELECT plainto_tsquery('pg_catalog.simple', 'parmap')) AS s(tsq) ON true WHERE to_tsvector('pg_catalog.simple', metadata) @@ tsq ORDER BY ts_rank(oim.metadata_tsvector, tsq, 1) DESC LIMIT 10;
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3547638.84..3547640.01 rows=10 width=299)
   ->  Gather Merge  (cost=3547638.84..3558902.16 rows=96536 width=299)
         Workers Planned: 2
         ->  Sort  (cost=3546638.82..3546759.49 rows=48268 width=299)
               Sort Key: (ts_rank(oim.metadata_tsvector, '''parmap'''::tsquery, 1)) DESC
               ->  Hash Join  (cost=5.03..3542883.86 rows=48268 width=299)
                     Hash Cond: (oim.indexer_configuration_id = i.id)
                     ->  Parallel Seq Scan on origin_intrinsic_metadata oim  (cost=0.00..3542625.50 rows=48268 width=300)
                           Filter: (to_tsvector('simple'::regconfig, metadata) @@ '''parmap'''::tsquery)
                     ->  Hash  (cost=3.90..3.90 rows=90 width=149)
                           ->  Seq Scan on indexer_configuration i  (cost=0.00..3.90 rows=90 width=149)
(11 rows)

new one:

softwareheritage-indexer=> explain SELECT origin_id, metadata, from_revision, mappings, i.id as tool_id, tool_name, tool_version, tool_configuration FROM origin_intrinsic_metadata AS oim INNER JOIN indexer_configuration AS i ON oim.indexer_configuration_id=i.id JOIN LATERAL (SELECT plainto_tsquery('pg_catalog.simple', 'parmap')) AS s(tsq) ON true WHERE metadata_tsvector @@ tsq ORDER BY ts_rank(oim.metadata_tsvector, tsq, 1) DESC LIMIT 10;                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=4215.95..4215.98 rows=10 width=299)
   ->  Sort  (cost=4215.95..4225.61 rows=3862 width=299)
         Sort Key: (ts_rank(oim.metadata_tsvector, '''parmap'''::tsquery, 1)) DESC
         ->  Hash Join  (cost=208.95..4132.50 rows=3862 width=299)
               Hash Cond: (oim.indexer_configuration_id = i.id)
               ->  Bitmap Heap Scan on origin_intrinsic_metadata oim  (cost=203.93..4107.20 rows=3862 width=300)
                     Recheck Cond: (metadata_tsvector @@ '''parmap'''::tsquery)
                     ->  Bitmap Index Scan on origin_intrinsic_metadata_fulltext_idx  (cost=0.00..202.96 rows=3862 width=0)
                           Index Cond: (metadata_tsvector @@ '''parmap'''::tsquery)
               ->  Hash  (cost=3.90..3.90 rows=90 width=149)
                     ->  Seq Scan on indexer_configuration i  (cost=0.00..3.90 rows=90 width=149)
(11 rows)

In practice, that's several minutes vs less than a second.

This revision was automatically updated to reflect the committed changes.