Page MenuHomeSoftware Heritage

Add fulltext search on origin intrinsic metadata.
ClosedPublic

Authored by vlorentz on Nov 15 2018, 2:12 PM.

Details

Summary

I made several choices while writing that Diff that are open to discussion:

  • Using GIN instead of GiST. That seems the most appropriate choice when reading https://www.postgresql.org/docs/9.1/textsearch-indexes.html ; but these part got removed from the doc of pgsql 10: https://www.postgresql.org/docs/10/textsearch-indexes.html
  • Using pg_catalog.simple as dictionary. Since we're dealing with any language and proper names, it seemed best to use a dictionary with no stop word. Though, arguably, most of the data will be English, and stop words usually don't appear in names.
  • It only supports conjunctions of search terms. I could easily add support for arbitrary levels of nestings and disjunctions/negations. That can be done later if we deem it worth it.
  • It indexes JSON keys too. It is probably possible to fix this, at the expanse of complicated SQL code, or some postprocessing in Python.

Resolves T1334 and T1335.

Test Plan

tox

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
metadata-search
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 2395
Build 2925: tox-on-jenkinsJenkins
Build 2924: arc lint + arc unit

Event Timeline

ardumont added a subscriber: ardumont.

That sounds good.

This revision is now accepted and ready to land.Nov 16 2018, 4:16 PM

Yeah, GIN is fine for indexes on tsvectors.

  • Using pg_catalog.simple as dictionary. Since we're dealing with any language and proper names, it seemed best to use a dictionary with no stop word. Though, arguably, most of the data will be English, and stop words usually don't appear in names.

Sounds fine to me as a first step, although lack of stemming may generate some surprising results.

In a second step we could do some statistics on our tsvectors to generate our own list of stopwords

  • It only supports conjunctions of search terms. I could easily add support for arbitrary levels of nestings and disjunctions/negations. That can be done later if we deem it worth it.

I think that'll be fine for 99.99% of people.

  • It indexes JSON keys too. It is probably possible to fix this, at the expanse of complicated SQL code, or some postprocessing in Python.

It might make sense to materialize the tsvector on each row. This would be especially useful if we want to do some fancy stuff (like prioritize project names over project descriptions over the rest of the metadata) and adjust it on the fly.

It will also be critical if you want to do ranking using https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING, which you will need when you start limiting results to avoid DoSing ;).

olasd requested changes to this revision.Nov 16 2018, 4:18 PM

The current approach will DoS if you use a broad query term. let's fix that before it lands?

This revision now requires changes to proceed.Nov 16 2018, 4:18 PM
In D658#13778, @olasd wrote:

It might make sense to materialize the tsvector on each row. This would be especially useful if we want to do some fancy stuff (like prioritize project names over project descriptions over the rest of the metadata) and adjust it on the fly.

How would materialization help? That's still one big list of words.

It will also be critical if you want to do ranking using https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING, which you will need when you start limiting results to avoid DoSing ;).

Will do!

In D658#13778, @olasd wrote:

It might make sense to materialize the tsvector on each row. This would be especially useful if we want to do some fancy stuff (like prioritize project names over project descriptions over the rest of the metadata) and adjust it on the fly.

How would materialization help? That's still one big list of words.

Well, depending on what the queries do, it's "one big list of words" computed once for the lifetime of the origin_metadata entry, vs. "one big list of words" you compute every time you call the to_tsvector function :)

If you use ts_rank, for instance, you'll have to re-make the tsvector for each row that will match your query.

D561 shows an approach we could use to keep a tsvector column updated on a given table.

  • Add 'limit' argument to origin_intrinsic_metadata_search_fulltext, and sort results.
In D658#13793, @olasd wrote:

Well, depending on what the queries do, it's "one big list of words" computed once for the lifetime of the origin_metadata entry, vs. "one big list of words" you compute every time you call the to_tsvector function :)

If you use ts_rank, for instance, you'll have to re-make the tsvector for each row that will match your query.

Got it. I thought you meant materializing would help prioritizing.

  • Materialize origin_intrinsic_metadata.metadata_tsvector.
olasd requested changes to this revision.Nov 22 2018, 2:38 PM
olasd added inline comments.
swh/indexer/storage/__init__.py
628

I'd call this function directly from swh_origin_intrinsic_metadata_add, instead of doing a python round trip. It took me a few reads to notice the addition of this line.

swh/indexer/storage/db.py
345–353

You should do a JOIN LATERAL (select plainto_tsquery('')...) AS s(q) ON true and use q instead of passing the tsquery mess twice. (In the current query I wouldn't be surprised if postgres did the function call for each row).

The ts_rank should use the tsvector column.

This revision now requires changes to proceed.Nov 22 2018, 2:38 PM
  • Call swh_origin_intrinsic_metadata_compute_tsvector directly from swh_origin_intrinsic_metadata_add.
  • Don't call plainto_tsquery twice.
  • Use metadata_tsvector to compute the rank.
vlorentz added inline comments.
swh/indexer/storage/db.py
345–353

Thanks!

This revision is now accepted and ready to land.Nov 22 2018, 9:01 PM
This revision was automatically updated to reflect the committed changes.