Resolves T1517.
Details
- Reviewers
ardumont - Group Reviewers
Reviewers - Maniphest Tasks
- T1517: Metadata search is too slow
- Commits
- rDCIDXa02da7261f11: Use the index when doing metadata search.
Diff Detail
- Repository
- rDCIDX Metadata indexer
- Branch
- fulltext-index
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 4078 Build 5369: tox-on-jenkins Jenkins Build 5368: arc lint + arc unit
Event Timeline
Build is green
See https://jenkins.softwareheritage.org/job/DCIDX/job/tox/318/ for more details.
I'm not so good at guessing that this change actually makes it use the index.
Checking the current schema, i see the column metadata_tsvector has an index so from afar, the change seems to do as promised ;)
create index origin_intrinsic_metadata_fulltext_idx on origin_intrinsic_metadata using gin (metadata_tsvector);
ps: I think proposing an explain plan of before and after might be good (i know it's boring ;)
indeed
ps: I think proposing an explain plan of before and after might be good (i know it's boring ;)
Old query:
softwareheritage-indexer=> explain SELECT origin_id, metadata, from_revision, mappings, i.id as tool_id, tool_name, tool_version, tool_configuration FROM origin_intrinsic_metadata AS oim INNER JOIN indexer_configuration AS i ON oim.indexer_configuration_id=i.id JOIN LATERAL (SELECT plainto_tsquery('pg_catalog.simple', 'parmap')) AS s(tsq) ON true WHERE to_tsvector('pg_catalog.simple', metadata) @@ tsq ORDER BY ts_rank(oim.metadata_tsvector, tsq, 1) DESC LIMIT 10; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Limit (cost=3547638.84..3547640.01 rows=10 width=299) -> Gather Merge (cost=3547638.84..3558902.16 rows=96536 width=299) Workers Planned: 2 -> Sort (cost=3546638.82..3546759.49 rows=48268 width=299) Sort Key: (ts_rank(oim.metadata_tsvector, '''parmap'''::tsquery, 1)) DESC -> Hash Join (cost=5.03..3542883.86 rows=48268 width=299) Hash Cond: (oim.indexer_configuration_id = i.id) -> Parallel Seq Scan on origin_intrinsic_metadata oim (cost=0.00..3542625.50 rows=48268 width=300) Filter: (to_tsvector('simple'::regconfig, metadata) @@ '''parmap'''::tsquery) -> Hash (cost=3.90..3.90 rows=90 width=149) -> Seq Scan on indexer_configuration i (cost=0.00..3.90 rows=90 width=149) (11 rows)
new one:
softwareheritage-indexer=> explain SELECT origin_id, metadata, from_revision, mappings, i.id as tool_id, tool_name, tool_version, tool_configuration FROM origin_intrinsic_metadata AS oim INNER JOIN indexer_configuration AS i ON oim.indexer_configuration_id=i.id JOIN LATERAL (SELECT plainto_tsquery('pg_catalog.simple', 'parmap')) AS s(tsq) ON true WHERE metadata_tsvector @@ tsq ORDER BY ts_rank(oim.metadata_tsvector, tsq, 1) DESC LIMIT 10; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Limit (cost=4215.95..4215.98 rows=10 width=299) -> Sort (cost=4215.95..4225.61 rows=3862 width=299) Sort Key: (ts_rank(oim.metadata_tsvector, '''parmap'''::tsquery, 1)) DESC -> Hash Join (cost=208.95..4132.50 rows=3862 width=299) Hash Cond: (oim.indexer_configuration_id = i.id) -> Bitmap Heap Scan on origin_intrinsic_metadata oim (cost=203.93..4107.20 rows=3862 width=300) Recheck Cond: (metadata_tsvector @@ '''parmap'''::tsquery) -> Bitmap Index Scan on origin_intrinsic_metadata_fulltext_idx (cost=0.00..202.96 rows=3862 width=0) Index Cond: (metadata_tsvector @@ '''parmap'''::tsquery) -> Hash (cost=3.90..3.90 rows=90 width=149) -> Seq Scan on indexer_configuration i (cost=0.00..3.90 rows=90 width=149) (11 rows)
In practice, that's several minutes vs less than a second.
Build has FAILED
Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tox/370/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tox/370/console