Page MenuHomeSoftware Heritage

Origin search is *slow* when you look for very common words
Closed, MigratedEdits Locked

Description

Searching for very common words (linux, git, github, ...) makes the origin search slow down to a crawl.

This is due to the fact that we're sorting results by origin id, nullifying all the advantages of having the trigram index when the search hits a lot of results.

We should try to do something a little smarter.

Related Objects

StatusAssignedTask
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration
Migratedgitlab-migration

Event Timeline

olasd triaged this task as High priority.Jun 26 2018, 11:30 AM
olasd created this task.

(pinging this issue, because it's 2018, and it really looks bad that we're apparently not capable of quickly returning results in our main search :-))

How about just *not* sort by origin ID then? it's an arbitrary criterion, based on an arbitrary value, that is not guaranteed to return better results higher in the list.
Quality of results is increased by drilling down with more keywords, which users will do naturally when they don't find what they are looking for. (And even if it weren't, returning results faster would be a reasonable trade-off anyway.)

How about just *not* sort by origin ID then?

It is already [1]


When checking this, @anlambert and i, saw that the explain is ok.
It uses the right index (gin one, origin_url_idx).
But when executing it lags [2]

softwareheritage=> explain SELECT id,type,url,lister,project FROM origin WHERE url ~* 'googlecode' ORDER BY id OFFSET 0 LIMIT 200;
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Limit  (cost=9181.22..9181.72 rows=200 width=86)
   ->  Sort  (cost=9181.22..9202.42 rows=8480 width=86)
         Sort Key: id
         ->  Bitmap Heap Scan on origin  (cost=267.72..8814.72 rows=8480 width=86)
               Recheck Cond: (url ~* 'googlecode'::text)
               ->  Bitmap Index Scan on origin_url_idx  (cost=0.00..265.60 rows=8480 width=0)
                     Index Cond: (url ~* 'googlecode'::text)
(7 rows)

It's not always that index which is used, and for those it's fast:

softwareheritage=> explain SELECT id,type,url,lister,project FROM origin WHERE url ~* 'github' ORDER BY id OFFSET 0 LIMIT 200;
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Limit  (cost=0.57..6.77 rows=200 width=86)
   ->  Index Scan using origin_pkey on origin  (cost=0.57..2627792.13 rows=84787381 width=86)
         Filter: (url ~* 'github'::text)
(3 rows)

Things i can think of to improve:

  • update db statistics
  • transforming the db.py function call into a call to a query server side.

[1] https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/db.py$865-903

[2] I saw the same behavior on dbreplica0, and it's even slower there (azure replica vm is half the size of somerset). The query is even sometimes slow to the point that it got cancelled due to the replication mechanism...

How about just *not* sort by origin ID then?

Nice, i just missed the *not*. So my initial answer is irrelevant to your question.
(the information provided are not wrong though ;).

Now, to answer your question, you just cannot remove the order by.
It's there for the pagination consistency.

transforming the db.py function call into a call to a query server side.

Tried and saw no apparent improvment so no.

create or replace function swh_origin_search_regexp(
    url_pattern text, _offset integer, _limit integer,
    regexp boolean, with_visit boolean)
    returns setof origin
    language plpgsql
as $$
declare
    q text;
begin
    q = 'select id, type, url, lister, project from origin o where';
    if with_visit is true then
        q := q || ' exists (select 1 from origin_visit where origin=o.id) and';
    end if;
    if regexp is true then
        q := q || ' url ~* ' || quote_literal(url_pattern);
    else
        q := q || ' url ILIKE ' || quote_literal(url_pattern);
    end if;
    q := q || ' order by o.id offset ' || quote_literal(_offset) || ' limit ' || quote_literal(_limit);

    return query execute q;
end
$$;

Worse than that the explain is no longer interesting (nor does it reflect reality either, still slow when triggered).

11:24:19 *softwareheritage@somerset:5433=> explain select * from swh_origin_search_regexp('gitlab', 0, 200, true, true);
┌───────────────────────────────────────────────────────────────────────────────────┐
│                                    QUERY PLAN                                     │
├───────────────────────────────────────────────────────────────────────────────────┤
│ Function Scan on swh_origin_search_regexp  (cost=0.25..10.25 rows=1000 width=104) │
└───────────────────────────────────────────────────────────────────────────────────┘
(1 row)

The main issue with the current query is that the id sort happens after we had to fetch all the indexed results, even if we're only presenting a few of these results to the user. When searching gitbla, the bitmap index scan will find all entries containing the trigram git (so, almost everything), then recheck them for exact match, then finally sort the results.

From my understanding, there's two sides of this "fast search" coin:

  • Do we need to have accurate, infinite pagination when someone looks for an overly broad term such as git ?
  • How do we enhance the relevance of our search results to avoid people needing the pagination ?

The answer to the first question is very likely "no": we don't care about allowing infinite scrolling in general, considering the randomness of current results. Assuming this, we can enhance the user experience by:

  • Deciding the max number of results we want to present people, N
  • fetching N+1 results from the backend and stopping there
  • if we fetched exactly N+1 results, add a message to the page saying "lots of results, only presenting N, you may need to use more precise search terms"
  • paginate the N first results

The lack of server-side pagination means the id sort is not needed any longer, and search results can be returned immediately.

Now the more interesting question is, how do we improve the relevance of the results we give people?

I have started toying with the PostgreSQL full text search functionality, which allows for more flexible indexes than just using raw trigrams. It also allows similarity search ("did you mean?") and scoring of results.

The main issue with using postgresql's full text search engine directly, is that the way documents are tokenized and parsed is hard-coded (well, it's extensible, but in a C extension), and the settings for URLs are pretty limited: basically, you get a token for the protocol, a token for the domain name, a token for the URL path, and a token for the full URL.

If we want more precise lexing, for instance to have one token by URL part (what's between slashes), or even have several tokens for each word inside the URL part, we need to do that ourselves.

My first approach was very crude, but kinda worked: replacing all non-ascii characters with spaces :-) With some subtle application of Postgres full text search prioritization, the results weren't too bad (e.g. looking for git would put https://github.com/git/git very close to the top).

If that sounds sensible I can keep working towards refining this approach.

Thanks for your analysis.

In T1117#22299, @olasd wrote:
  • Do we need to have accurate, infinite pagination when someone looks for an overly broad term such as git ?
  • How do we enhance the relevance of our search results to avoid people needing the pagination ?

The answer to the first question is very likely "no": we don't care about allowing infinite scrolling in general, considering the randomness of current results. Assuming this, we can enhance the user experience by:

  • Deciding the max number of results we want to present people, N
  • fetching N+1 results from the backend and stopping there
  • if we fetched exactly N+1 results, add a message to the page saying "lots of results, only presenting N, you may need to use more precise search terms"
  • paginate the N first results

The lack of server-side pagination means the id sort is not needed any longer, and search results can be returned immediately.

Let's go for this.

My rationale: there is a bunch of HCI research supporting the fact that the vast majority of search engine users will *never* bother going past the first page of results. If the answer they're looking for is not there, they will either refine their search (by tweaking keywords) or simply give up.

Now the more interesting question is, how do we improve the relevance of the results we give people?

I have started toying with the PostgreSQL full text search functionality, which allows for more flexible indexes than just using raw trigrams. It also allows similarity search ("did you mean?") and scoring of results.

I looked at that long time ago, and it was indeed cool. We should just make sure that the swh.storage API for this is not dependent on some postgres-specific stuff, but that should be easy here, given simple "full-text search" is implemented these days by a bunch of backends.

My first approach was very crude, but kinda worked: replacing all non-ascii characters with spaces :-)

Eh, nice. I didn't even find it to be a horrible hack, but that's a very low bar :-)

olasd removed olasd as the assignee of this task.Sep 22 2020, 4:47 PM
olasd added a subscriber: vlorentz.

This is very probably superseded by @vlorentz 's work on swh.search.

Closing this as resolved now the search feature is using elasticsearch in production.