Details

Reviewers

ardumont
olasd

Group Reviewers

Reviewers

Maniphest Tasks

T1334: Add a method to swh.indexer.storage to search over origin intrinsic metadata
T1335: Add an index on origin_intrinsic_metadata.metadata

Commits

rDCIDXf7e2308fafce: Add fulltext search on origin intrinsic metadata.

Summary

I made several choices while writing that Diff that are open to discussion:

Using GIN instead of GiST. That seems the most appropriate choice when reading https://www.postgresql.org/docs/9.1/textsearch-indexes.html ; but these part got removed from the doc of pgsql 10: https://www.postgresql.org/docs/10/textsearch-indexes.html
Using pg_catalog.simple as dictionary. Since we're dealing with any language and proper names, it seemed best to use a dictionary with no stop word. Though, arguably, most of the data will be English, and stop words usually don't appear in names.
It only supports conjunctions of search terms. I could easily add support for arbitrary levels of nestings and disjunctions/negations. That can be done later if we deem it worth it.
It indexes JSON keys too. It is probably possible to fix this, at the expanse of complicated SQL code, or some postprocessing in Python.

Resolves T1334 and T1335.

Test Plan

tox

Diff Detail

Repository

rDCIDX Metadata indexer

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

Event Timeline

vlorentz created this revision.Nov 15 2018, 2:12 PM

Herald added a reviewer: Reviewers. · View Herald TranscriptNov 15 2018, 2:12 PM

vlorentz edited the summary of this revision. (Show Details)Nov 15 2018, 2:12 PM

vlorentz added tasks: T1335: Add an index on origin_intrinsic_metadata.metadata, T1334: Add a method to swh.indexer.storage to search over origin intrinsic metadata.

Build is green
See https://jenkins.softwareheritage.org/job/DCIDX/job/tox/29/ for more details.

Harbormaster completed remote builds in B2395: Diff 2070.Nov 15 2018, 2:15 PM

vlorentz edited the summary of this revision. (Show Details)Nov 15 2018, 2:21 PM

vlorentz edited the summary of this revision. (Show Details)Nov 16 2018, 10:17 AM

That sounds good.

This revision is now accepted and ready to land.Nov 16 2018, 4:16 PM

Using GIN instead of GiST. That seems the most appropriate choice when reading https://www.postgresql.org/docs/9.1/textsearch-indexes.html ; but these part got removed from the doc of pgsql 10: https://www.postgresql.org/docs/10/textsearch-indexes.html

Yeah, GIN is fine for indexes on tsvectors.

Using pg_catalog.simple as dictionary. Since we're dealing with any language and proper names, it seemed best to use a dictionary with no stop word. Though, arguably, most of the data will be English, and stop words usually don't appear in names.

Sounds fine to me as a first step, although lack of stemming may generate some surprising results.

In a second step we could do some statistics on our tsvectors to generate our own list of stopwords

It only supports conjunctions of search terms. I could easily add support for arbitrary levels of nestings and disjunctions/negations. That can be done later if we deem it worth it.

I think that'll be fine for 99.99% of people.

It indexes JSON keys too. It is probably possible to fix this, at the expanse of complicated SQL code, or some postprocessing in Python.

It might make sense to materialize the tsvector on each row. This would be especially useful if we want to do some fancy stuff (like prioritize project names over project descriptions over the rest of the metadata) and adjust it on the fly.

It will also be critical if you want to do ranking using https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING, which you will need when you start limiting results to avoid DoSing ;).

The current approach will DoS if you use a broad query term. let's fix that before it lands?

This revision now requires changes to proceed.Nov 16 2018, 4:18 PM

In D658#13778, @olasd wrote:

It might make sense to materialize the tsvector on each row. This would be especially useful if we want to do some fancy stuff (like prioritize project names over project descriptions over the rest of the metadata) and adjust it on the fly.

How would materialization help? That's still one big list of words.

It will also be critical if you want to do ranking using https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING, which you will need when you start limiting results to avoid DoSing ;).

Will do!

In D658#13783, @vlorentz wrote:

In D658#13778, @olasd wrote:

It might make sense to materialize the tsvector on each row. This would be especially useful if we want to do some fancy stuff (like prioritize project names over project descriptions over the rest of the metadata) and adjust it on the fly.

How would materialization help? That's still one big list of words.

Well, depending on what the queries do, it's "one big list of words" computed once for the lifetime of the origin_metadata entry, vs. "one big list of words" you compute every time you call the to_tsvector function :)

If you use ts_rank, for instance, you'll have to re-make the tsvector for each row that will match your query.

D561 shows an approach we could use to keep a tsvector column updated on a given table.

Add 'limit' argument to origin_intrinsic_metadata_search_fulltext, and sort results.

Build has FAILED

Link to build: https://jenkins.softwareheritage.org/job/DCIDX/job/tox/46/
See console output for more information: https://jenkins.softwareheritage.org/job/DCIDX/job/tox/46/console

Harbormaster failed remote builds in B2433: Diff 2107!Nov 16 2018, 5:11 PM

In D658#13793, @olasd wrote:

Well, depending on what the queries do, it's "one big list of words" computed once for the lifetime of the origin_metadata entry, vs. "one big list of words" you compute every time you call the to_tsvector function :)

If you use ts_rank, for instance, you'll have to re-make the tsvector for each row that will match your query.

Got it. I thought you meant materializing would help prioritizing.

Materialize origin_intrinsic_metadata.metadata_tsvector.

Harbormaster completed remote builds in B2449: Diff 2127.Nov 19 2018, 2:07 PM

Build is green
See https://jenkins.softwareheritage.org/job/DCIDX/job/tox/53/ for more details.

Update DB migration.

Build is green
See https://jenkins.softwareheritage.org/job/DCIDX/job/tox/54/ for more details.

Harbormaster completed remote builds in B2450: Diff 2128.Nov 19 2018, 2:10 PM

olasd requested changes to this revision.Nov 22 2018, 2:38 PM

olasd added inline comments.

swh/indexer/storage/__init__.py
662	I'd call this function directly from swh_origin_intrinsic_metadata_add, instead of doing a python round trip. It took me a few reads to notice the addition of this line.
swh/indexer/storage/db.py
355–363	You should do a `JOIN LATERAL (select plainto_tsquery('')...) AS s(q) ON true` and use `q` instead of passing the tsquery mess twice. (In the current query I wouldn't be surprised if postgres did the function call for each row). The `ts_rank` should use the tsvector column.

This revision now requires changes to proceed.Nov 22 2018, 2:38 PM

Call swh_origin_intrinsic_metadata_compute_tsvector directly from swh_origin_intrinsic_metadata_add.
Don't call plainto_tsquery twice.
Use metadata_tsvector to compute the rank.

vlorentz marked an inline comment as done.Nov 22 2018, 2:59 PM

vlorentz added inline comments.

swh/indexer/storage/db.py
355–363	Thanks!

vlorentz marked 2 inline comments as done.Nov 22 2018, 2:59 PM

Build is green
See https://jenkins.softwareheritage.org/job/DCIDX/job/tox/59/ for more details.

Harbormaster completed remote builds in B2532: Diff 2193.Nov 22 2018, 3:01 PM

olasd accepted this revision.Nov 22 2018, 9:01 PM

This revision is now accepted and ready to land.Nov 22 2018, 9:01 PM

Closed by commit rDCIDXf7e2308fafce: Add fulltext search on origin intrinsic metadata. (authored by vlorentz). · Explain WhyNov 23 2018, 11:06 AM

This revision was automatically updated to reflect the committed changes.

Add fulltext search on origin intrinsic metadata.
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 2212

sql/upgrades/117.sql

swh/indexer/sql/30-swh-schema.sql

swh/indexer/sql/40-swh-func.sql

swh/indexer/sql/60-swh-indexes.sql

swh/indexer/storage/init.py

swh/indexer/storage/db.py

swh/indexer/tests/storage/test_storage.py

Add fulltext search on origin intrinsic metadata.ClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 2212

sql/upgrades/117.sql

swh/indexer/sql/30-swh-schema.sql

swh/indexer/sql/40-swh-func.sql

swh/indexer/sql/60-swh-indexes.sql

swh/indexer/storage/__init__.py

swh/indexer/storage/db.py

swh/indexer/tests/storage/test_storage.py

Add fulltext search on origin intrinsic metadata.
ClosedPublic
Actions

Revision Contents
Changeset List

swh/indexer/storage/init.py