Resolves T1517.
Details
- Reviewers
ardumont - Group Reviewers
Reviewers - Maniphest Tasks
- T1517: Metadata search is too slow
- Commits
- rDCIDXa02da7261f11: Use the index when doing metadata search.
Diff Detail
- Repository
- rDCIDX Metadata indexer
- Branch
- fulltext-index
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 4078 Build 5369: tox-on-jenkins Jenkins Build 5368: arc lint + arc unit
Event Timeline
Build is green
See https://jenkins.softwareheritage.org/job/DCIDX/job/tox/318/ for more details.
I'm not so good at guessing that this change actually makes it use the index.
Checking the current schema, i see the column metadata_tsvector has an index so from afar, the change seems to do as promised ;)
create index origin_intrinsic_metadata_fulltext_idx on origin_intrinsic_metadata using gin (metadata_tsvector);
ps: I think proposing an explain plan of before and after might be good (i know it's boring ;)
indeed
ps: I think proposing an explain plan of before and after might be good (i know it's boring ;)
Old query:
softwareheritage-indexer=> explain SELECT origin_id, metadata, from_revision, mappings, i.id as tool_id, tool_name, tool_version, tool_configuration FROM origin_intrinsic_metadata AS oim INNER JOIN indexer_configuration AS i ON oim.indexer_configuration_id=i.id JOIN LATERAL (SELECT plainto_tsquery('pg_catalog.simple', 'parmap')) AS s(tsq) ON true WHERE to_tsvector('pg_catalog.simple', metadata) @@ tsq ORDER BY ts_rank(oim.metadata_tsvector, tsq, 1) DESC LIMIT 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Limit (cost=3547638.84..3547640.01 rows=10 width=299)
-> Gather Merge (cost=3547638.84..3558902.16 rows=96536 width=299)
Workers Planned: 2
-> Sort (cost=3546638.82..3546759.49 rows=48268 width=299)
Sort Key: (ts_rank(oim.metadata_tsvector, '''parmap'''::tsquery, 1)) DESC
-> Hash Join (cost=5.03..3542883.86 rows=48268 width=299)
Hash Cond: (oim.indexer_configuration_id = i.id)
-> Parallel Seq Scan on origin_intrinsic_metadata oim (cost=0.00..3542625.50 rows=48268 width=300)
Filter: (to_tsvector('simple'::regconfig, metadata) @@ '''parmap'''::tsquery)
-> Hash (cost=3.90..3.90 rows=90 width=149)
-> Seq Scan on indexer_configuration i (cost=0.00..3.90 rows=90 width=149)
(11 rows)new one:
softwareheritage-indexer=> explain SELECT origin_id, metadata, from_revision, mappings, i.id as tool_id, tool_name, tool_version, tool_configuration FROM origin_intrinsic_metadata AS oim INNER JOIN indexer_configuration AS i ON oim.indexer_configuration_id=i.id JOIN LATERAL (SELECT plainto_tsquery('pg_catalog.simple', 'parmap')) AS s(tsq) ON true WHERE metadata_tsvector @@ tsq ORDER BY ts_rank(oim.metadata_tsvector, tsq, 1) DESC LIMIT 10; QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Limit (cost=4215.95..4215.98 rows=10 width=299)
-> Sort (cost=4215.95..4225.61 rows=3862 width=299)
Sort Key: (ts_rank(oim.metadata_tsvector, '''parmap'''::tsquery, 1)) DESC
-> Hash Join (cost=208.95..4132.50 rows=3862 width=299)
Hash Cond: (oim.indexer_configuration_id = i.id)
-> Bitmap Heap Scan on origin_intrinsic_metadata oim (cost=203.93..4107.20 rows=3862 width=300)
Recheck Cond: (metadata_tsvector @@ '''parmap'''::tsquery)
-> Bitmap Index Scan on origin_intrinsic_metadata_fulltext_idx (cost=0.00..202.96 rows=3862 width=0)
Index Cond: (metadata_tsvector @@ '''parmap'''::tsquery)
-> Hash (cost=3.90..3.90 rows=90 width=149)
-> Seq Scan on indexer_configuration i (cost=0.00..3.90 rows=90 width=149)
(11 rows)In practice, that's several minutes vs less than a second.
Build has FAILED
Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tox/370/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tox/370/console