Resolves T1517.
Details
- Reviewers
ardumont - Group Reviewers
Reviewers - Maniphest Tasks
- T1517: Metadata search is too slow
- Commits
- rDCIDXa02da7261f11: Use the index when doing metadata search.
Diff Detail
- Repository
- rDCIDX Metadata indexer
- Branch
- fulltext-index
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 4016 Build 5270: tox-on-jenkins Jenkins Build 5269: arc lint + arc unit
Event Timeline
Build is green
See https://jenkins.softwareheritage.org/job/DCIDX/job/tox/318/ for more details.
I'm not so good at guessing that this change actually makes it use the index.
Checking the current schema, i see the column metadata_tsvector has an index so from afar, the change seems to do as promised ;)
create index origin_intrinsic_metadata_fulltext_idx on origin_intrinsic_metadata using gin (metadata_tsvector);
ps: I think proposing an explain plan of before and after might be good (i know it's boring ;)
indeed
ps: I think proposing an explain plan of before and after might be good (i know it's boring ;)
Old query:
softwareheritage-indexer=> explain SELECT origin_id, metadata, from_revision, mappings, i.id as tool_id, tool_name, tool_version, tool_configuration FROM origin_intrinsic_metadata AS oim INNER JOIN indexer_configuration AS i ON oim.indexer_configuration_id=i.id JOIN LATERAL (SELECT plainto_tsquery('pg_catalog.simple', 'parmap')) AS s(tsq) ON true WHERE to_tsvector('pg_catalog.simple', metadata) @@ tsq ORDER BY ts_rank(oim.metadata_tsvector, tsq, 1) DESC LIMIT 10; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Limit (cost=3547638.84..3547640.01 rows=10 width=299) -> Gather Merge (cost=3547638.84..3558902.16 rows=96536 width=299) Workers Planned: 2 -> Sort (cost=3546638.82..3546759.49 rows=48268 width=299) Sort Key: (ts_rank(oim.metadata_tsvector, '''parmap'''::tsquery, 1)) DESC -> Hash Join (cost=5.03..3542883.86 rows=48268 width=299) Hash Cond: (oim.indexer_configuration_id = i.id) -> Parallel Seq Scan on origin_intrinsic_metadata oim (cost=0.00..3542625.50 rows=48268 width=300) Filter: (to_tsvector('simple'::regconfig, metadata) @@ '''parmap'''::tsquery) -> Hash (cost=3.90..3.90 rows=90 width=149) -> Seq Scan on indexer_configuration i (cost=0.00..3.90 rows=90 width=149) (11 rows)
new one:
softwareheritage-indexer=> explain SELECT origin_id, metadata, from_revision, mappings, i.id as tool_id, tool_name, tool_version, tool_configuration FROM origin_intrinsic_metadata AS oim INNER JOIN indexer_configuration AS i ON oim.indexer_configuration_id=i.id JOIN LATERAL (SELECT plainto_tsquery('pg_catalog.simple', 'parmap')) AS s(tsq) ON true WHERE metadata_tsvector @@ tsq ORDER BY ts_rank(oim.metadata_tsvector, tsq, 1) DESC LIMIT 10; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Limit (cost=4215.95..4215.98 rows=10 width=299) -> Sort (cost=4215.95..4225.61 rows=3862 width=299) Sort Key: (ts_rank(oim.metadata_tsvector, '''parmap'''::tsquery, 1)) DESC -> Hash Join (cost=208.95..4132.50 rows=3862 width=299) Hash Cond: (oim.indexer_configuration_id = i.id) -> Bitmap Heap Scan on origin_intrinsic_metadata oim (cost=203.93..4107.20 rows=3862 width=300) Recheck Cond: (metadata_tsvector @@ '''parmap'''::tsquery) -> Bitmap Index Scan on origin_intrinsic_metadata_fulltext_idx (cost=0.00..202.96 rows=3862 width=0) Index Cond: (metadata_tsvector @@ '''parmap'''::tsquery) -> Hash (cost=3.90..3.90 rows=90 width=149) -> Seq Scan on indexer_configuration i (cost=0.00..3.90 rows=90 width=149) (11 rows)
In practice, that's several minutes vs less than a second.
Build has FAILED
Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tox/370/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tox/370/console
You should use the --no-bindings option of tree-sitter generate, it will not generate node and rust bindings we do not need here.
Also can you add the generation of the wasm module for the parser in the generate script ?
It will enable to use the parser in swh-web using web-tree-sitter.