Origin search is slow when you look for very common words
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	olasd
	Jun 26 2018, 11:30 AM

Description

Searching for very common words (linux, git, github, ...) makes the origin search slow down to a crawl.

This is due to the fact that we're sorting results by origin id, nullifying all the advantages of having the trigram index when the search hits a lot of results.

We should try to do something a little smarter.

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T1117 Origin search is slow when you look for very common words
Migrated	gitlab-migration	T1910 Redesign origin search using a dedicated component (swh-search)
Migrated	gitlab-migration	T2052 Publish swh-search on PyPI
Migrated	gitlab-migration	T2182 Switch production swh-web to use swh-search instead of postgresql search.
Migrated	gitlab-migration	T2183 Switch webapp0 to use swh-search instead of postgresql search.
Migrated	gitlab-migration	T2167 Deploy swh-search
Migrated	gitlab-migration	T2174 Add debian package for swh-search
Migrated	gitlab-migration	T2184 Replay origins to ElasticSearch's "origin" index
Migrated	gitlab-migration	T2497 Create an ElasticSearch cluster tuned for origin/metadata search
Migrated	gitlab-migration	T2590 Finish the indexer -> swh-search pipeline
Migrated	gitlab-migration	T2651 Make the indexer-storage publish its rows to Kafka
Migrated	gitlab-migration	T2652 Make the indexer-storage interface use attr classes instead of dicts
Migrated	gitlab-migration	T2816 Enable the journal-writer for the swh-idx-storage in staging
Migrated	gitlab-migration	T2817 Enable the swh-search environment in staging
Migrated	gitlab-migration	T2876 metadata indexation : ES' dynamic mapping creation fails for field values that are of varying types
Migrated	gitlab-migration	T2904 Create a new production webapp using the frozen index on the staging ES
Migrated	gitlab-migration	T2905 Deploy swh-search for production
Migrated	gitlab-migration	T2936 Update the swh-search journal client to only set "has_visit" on "full" status of the visit
Migrated	gitlab-migration	T2944 Deploy swh-search v0.4.1
Migrated	gitlab-migration	T3037 Reschedule origin-intrinsic-metadata tasks for all origins
Migrated	gitlab-migration	T2780 Enable the journal-writer for the swh-idx-storage in production
Migrated	gitlab-migration	T3040 [production] Enable swh-search's journal-client for indexed objects
Migrated	gitlab-migration	T3041 [production] Provision enough space for the search ES cluster to ingest all intrinsic metadata
Migrated	gitlab-migration	T3373 Metadata search is failing due to a boolean field in the mapping of the metadata fields
Migrated	gitlab-migration	T3391 [swh-search] Deploy v0.9.0 on staging and execute a full origin and metadata reindexation
Migrated	gitlab-migration	T3392 [staging] Properly recreate the origin_intrinsic_metadata topic
Migrated	gitlab-migration	T3398 [swh-search] Deploy v0.9.0 on production and execute a full origin and metadata reindexation
Migrated	gitlab-migration	T3047 Enable to search in origin metadata with swh-search in webapp
Migrated	gitlab-migration	T3058 Metadata search is failing with "failed to parse date field"
Migrated	gitlab-migration	T3060 Deploy swh-search v0.6.0 in staging
Migrated	gitlab-migration	T3076 [swh-search] Improve the index/mapping migration process
Migrated	gitlab-migration	T3083 Deploy swh-search v0.7.0/v0.7.1

Event Timeline

olasd triaged this task as High priority.Jun 26 2018, 11:30 AM

olasd created this task.

(pinging this issue, because it's 2018, and it really looks bad that we're apparently not capable of quickly returning results in our main search :-))

How about just *not* sort by origin ID then? it's an arbitrary criterion, based on an arbitrary value, that is not guaranteed to return better results higher in the list.
Quality of results is increased by drilling down with more keywords, which users will do naturally when they don't find what they are looking for. (And even if it weren't, returning results faster would be a reasonable trade-off anyway.)

zack added a project: Web app.Sep 5 2018, 9:10 AM

How about just *not* sort by origin ID then?

It is already [1]

When checking this, @anlambert and i, saw that the explain is ok.
It uses the right index (gin one, origin_url_idx).
But when executing it lags [2]

softwareheritage=> explain SELECT id,type,url,lister,project FROM origin WHERE url ~* 'googlecode' ORDER BY id OFFSET 0 LIMIT 200;
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Limit  (cost=9181.22..9181.72 rows=200 width=86)
   ->  Sort  (cost=9181.22..9202.42 rows=8480 width=86)
         Sort Key: id
         ->  Bitmap Heap Scan on origin  (cost=267.72..8814.72 rows=8480 width=86)
               Recheck Cond: (url ~* 'googlecode'::text)
               ->  Bitmap Index Scan on origin_url_idx  (cost=0.00..265.60 rows=8480 width=0)
                     Index Cond: (url ~* 'googlecode'::text)
(7 rows)

It's not always that index which is used, and for those it's fast:

softwareheritage=> explain SELECT id,type,url,lister,project FROM origin WHERE url ~* 'github' ORDER BY id OFFSET 0 LIMIT 200;
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Limit  (cost=0.57..6.77 rows=200 width=86)
   ->  Index Scan using origin_pkey on origin  (cost=0.57..2627792.13 rows=84787381 width=86)
         Filter: (url ~* 'github'::text)
(3 rows)

Things i can think of to improve:

update db statistics
transforming the db.py function call into a call to a query server side.

[1] https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/db.py$865-903

[2] I saw the same behavior on dbreplica0, and it's even slower there (azure replica vm is half the size of somerset). The query is even sometimes slow to the point that it got cancelled due to the replication mechanism...

How about just *not* sort by origin ID then?

Nice, i just missed the *not*. So my initial answer is irrelevant to your question.
(the information provided are not wrong though ;).

Now, to answer your question, you just cannot remove the order by.
It's there for the pagination consistency.

transforming the db.py function call into a call to a query server side.

Tried and saw no apparent improvment so no.

create or replace function swh_origin_search_regexp(
    url_pattern text, _offset integer, _limit integer,
    regexp boolean, with_visit boolean)
    returns setof origin
    language plpgsql
as $$
declare
    q text;
begin
    q = 'select id, type, url, lister, project from origin o where';
    if with_visit is true then
        q := q || ' exists (select 1 from origin_visit where origin=o.id) and';
    end if;
    if regexp is true then
        q := q || ' url ~* ' || quote_literal(url_pattern);
    else
        q := q || ' url ILIKE ' || quote_literal(url_pattern);
    end if;
    q := q || ' order by o.id offset ' || quote_literal(_offset) || ' limit ' || quote_literal(_limit);

    return query execute q;
end
$$;

Worse than that the explain is no longer interesting (nor does it reflect reality either, still slow when triggered).

11:24:19 *softwareheritage@somerset:5433=> explain select * from swh_origin_search_regexp('gitlab', 0, 200, true, true);
┌───────────────────────────────────────────────────────────────────────────────────┐
│                                    QUERY PLAN                                     │
├───────────────────────────────────────────────────────────────────────────────────┤
│ Function Scan on swh_origin_search_regexp  (cost=0.25..10.25 rows=1000 width=104) │
└───────────────────────────────────────────────────────────────────────────────────┘
(1 row)

The main issue with the current query is that the id sort happens after we had to fetch all the indexed results, even if we're only presenting a few of these results to the user. When searching gitbla, the bitmap index scan will find all entries containing the trigram git (so, almost everything), then recheck them for exact match, then finally sort the results.

From my understanding, there's two sides of this "fast search" coin:

Do we need to have accurate, infinite pagination when someone looks for an overly broad term such as git ?
How do we enhance the relevance of our search results to avoid people needing the pagination ?

The answer to the first question is very likely "no": we don't care about allowing infinite scrolling in general, considering the randomness of current results. Assuming this, we can enhance the user experience by:

Deciding the max number of results we want to present people, N
fetching N+1 results from the backend and stopping there
if we fetched exactly N+1 results, add a message to the page saying "lots of results, only presenting N, you may need to use more precise search terms"
paginate the N first results

The lack of server-side pagination means the id sort is not needed any longer, and search results can be returned immediately.

Now the more interesting question is, how do we improve the relevance of the results we give people?

I have started toying with the PostgreSQL full text search functionality, which allows for more flexible indexes than just using raw trigrams. It also allows similarity search ("did you mean?") and scoring of results.

The main issue with using postgresql's full text search engine directly, is that the way documents are tokenized and parsed is hard-coded (well, it's extensible, but in a C extension), and the settings for URLs are pretty limited: basically, you get a token for the protocol, a token for the domain name, a token for the URL path, and a token for the full URL.

If we want more precise lexing, for instance to have one token by URL part (what's between slashes), or even have several tokens for each word inside the URL part, we need to do that ourselves.

My first approach was very crude, but kinda worked: replacing all non-ascii characters with spaces :-) With some subtle application of Postgres full text search prioritization, the results weren't too bad (e.g. looking for git would put https://github.com/git/git very close to the top).

If that sounds sensible I can keep working towards refining this approach.

Thanks for your analysis.

In T1117#22299, @olasd wrote:

Do we need to have accurate, infinite pagination when someone looks for an overly broad term such as git ?

How do we enhance the relevance of our search results to avoid people needing the pagination ?

The answer to the first question is very likely "no": we don't care about allowing infinite scrolling in general, considering the randomness of current results. Assuming this, we can enhance the user experience by:

Deciding the max number of results we want to present people, N

fetching N+1 results from the backend and stopping there

if we fetched exactly N+1 results, add a message to the page saying "lots of results, only presenting N, you may need to use more precise search terms"

paginate the N first results

The lack of server-side pagination means the id sort is not needed any longer, and search results can be returned immediately.

Let's go for this.

My rationale: there is a bunch of HCI research supporting the fact that the vast majority of search engine users will *never* bother going past the first page of results. If the answer they're looking for is not there, they will either refine their search (by tweaking keywords) or simply give up.

Now the more interesting question is, how do we improve the relevance of the results we give people?

I have started toying with the PostgreSQL full text search functionality, which allows for more flexible indexes than just using raw trigrams. It also allows similarity search ("did you mean?") and scoring of results.

I looked at that long time ago, and it was indeed cool. We should just make sure that the swh.storage API for this is not dependent on some postgres-specific stuff, but that should be easy here, given simple "full-text search" is implemented these days by a bunch of backends.

My first approach was very crude, but kinda worked: replacing all non-ascii characters with spaces :-)

Eh, nice. I didn't even find it to be a horrible hack, but that's a very low bar :-)

zack mentioned this in T1116: Azure webapp performance tests.Oct 4 2018, 12:00 PM

zack mentioned this in T1251: archive page: visually show archive coverage.Oct 11 2018, 11:46 AM

This is very probably superseded by @vlorentz 's work on swh.search.

aka T1910

vlorentz claimed this task.Sep 24 2020, 11:05 AM

vlorentz added a subtask: T1910: Redesign origin search using a dedicated component (swh-search).

vlorentz closed subtask T1910: Redesign origin search using a dedicated component (swh-search) as Resolved.Mar 25 2021, 11:16 AM

vlorentz added a subtask: T2590: Finish the indexer -> swh-search pipeline.Apr 1 2021, 10:51 AM

Closing this as resolved now the search feature is using elasticsearch in production.

vlorentz closed subtask T2590: Finish the indexer -> swh-search pipeline as Resolved.Sep 8 2021, 3:35 PM

This task has been migrated to GitLab.

gitlab-migration changed the status of subtask T1910: Redesign origin search using a dedicated component (swh-search) from Resolved to Migrated.Jan 8 2023, 9:59 PM

gitlab-migration changed the status of subtask T2590: Finish the indexer -> swh-search pipeline from Resolved to Migrated.

Origin search is *slow* when you look for very common wordsClosed, MigratedEdits LockedActions

Description

Related ObjectsSearch...

Event Timeline

Origin search is slow when you look for very common words
Closed, MigratedEdits Locked
Actions

Related Objects
Search...