Page MenuHomeSoftware Heritage

Polish the swh-search QL
Closed, MigratedEdits Locked

Description

Make sure it's:

  • consistent
  • future-proof (we should avoid changing it after users start relying on it)
  • user-friendly
  • well-documented

Event Timeline

vlorentz triaged this task as Normal priority.Sep 6 2021, 10:37 AM
vlorentz created this task.
vlorentz updated the task description. (Show Details)
vlorentz added a subtask: Restricted Maniphest Task.Feb 16 2022, 9:54 AM

Hey @vlorentz @zack, I've been using sourcegraph.com for almost a year now and I feel that they have worked a lot on polishing their search query language. I think we can learn from them and adapt our language. Here are a few suggestions:

  • Instead of making it mandatory to use the origin and metadata keyword. We can just allow users to mention keywords without mentioning the field and search those terms in origin (higher score) and metadata fields. This will allow users to write smaller and effective queries:
    • django last_visit > 2022 instead of origin:django and last_visit > 2022
    • progval instead of metadata:progval
  • Make it faster to write array filters like language and license
    • language: python|go instead of language in [python, go]
  • It should be possible to negate any filter with -
    • -origin:XYZ should exclude origins containing the term XYZ (exact opposite of origin:XYZ)
  • Provide aliases for writing queries faster
    • o:xyz should be equivalent to origin:xyz
    • m:abc should be equivalent to metadata:abc
    • lang:python or l:python should be equivalent to language:python
  • Assume and between filters if anything isn't provided.
    • origin:X metadata:Y instead of origin: X and metadata: Y

They are based on the following assumptions:

  • Search queries should be small and hence fast to type.
  • Search query languages should intelligently pick up the most common intention of the user while still allowing overriding the default behavior.

Hey @vlorentz @zack, I've been using sourcegraph.com for almost a year now and I feel that they have worked a lot on polishing their search query language. I think we can learn from them and adapt our language. Here are a few suggestions:

Thanks for investigating this and making a list of actionable suggestions!
Here is a case-by-case commentary below:

  • Instead of making it mandatory to use the origin and metadata keyword. We can just allow users to mention keywords without mentioning the field and search those terms in origin (higher score) and metadata fields. This will allow users to write smaller and effective queries:
    • django last_visit > 2022 instead of origin:django and last_visit > 2022
    • progval instead of metadata:progval

This one gives me pause, but only because we need to make sure it's not semantically ambiguous. Let's see if I'm getting it right:

  • if there are no qualifiers ("o:", "m:"), we search by default in both origin and metadata, and rank the results
  • if there are qualifiers we only search in the associated data

Correct?

If so, I'm fine with this, but we need to check how much worse performances get.

Also, I'm not so sure the ranking criteria should be "origin hits win", maybe there's something smarter to be used there...

  • Make it faster to write array filters like language and license
    • language: python|go instead of language in [python, go]

LGTM

  • It should be possible to negate any filter with -
    • -origin:XYZ should exclude origins containing the term XYZ (exact opposite of origin:XYZ)

LGTM

  • Provide aliases for writing queries faster
    • o:xyz should be equivalent to origin:xyz
    • m:abc should be equivalent to metadata:abc
    • lang:python or l:python should be equivalent to language:python

OK, but again as long as they're not ambiguous.

  • Assume and between filters if anything isn't provided.
    • origin:X metadata:Y instead of origin: X and metadata: Y

Hell yes!

What about having a UI like in Github or Phabricator to create an advanced query?
eg:
https://github.com/search/advanced

We can continue to support the query language, and the QL can be generated using the UI. This will help us to support saved searches and bookmarks.
We could have more search contexts (more than origin and metadata) in the future as we index more data.
It will be too hard to support different contexts with varying inputs just using a QL. It can done in a UI with some moving elements.
What do you think?

gitlab-migration changed the status of subtask Restricted Maniphest Task from Resolved to Migrated.