Page MenuHomeSoftware Heritage

docs/query-language: Describe search query language syntax
ClosedPublic

Authored by KShivendu on Jul 17 2021, 9:40 PM.

Details

Summary

Documentation for search query language syntax

Related D5990

Diff Detail

Repository
rDSEA Archive search
Branch
query-language-docs
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 22704
Build 35409: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 35408: arc lint + arc unit

Event Timeline

Build has FAILED

Patch application report for D6005 (id=21680)

Rebasing onto fe7640f710...

Current branch diff-target is up to date.
Changes applied before test
commit 1bc86d5bce31637c876ee6bc71576ac2024122d5
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Sun Jul 18 01:09:12 2021 +0530

    docs/query-language: Describe search query language syntax
    
    Summary: Documentation for search query language syntax
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:

Link to build: https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/217/
See console output for more information: https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/217/console

Harbormaster returned this revision to the author for changes because remote builds failed.Jul 17 2021, 9:44 PM
Harbormaster failed remote builds in B22648: Diff 21680!
This comment was removed by KShivendu.
docs/query-language.rst
12

I'll remove this extra "on"

Build is green

Patch application report for D6005 (id=21681)

Rebasing onto fe7640f710...

Current branch diff-target is up to date.
Changes applied before test
commit d4a246fe732fb360711d0155bb63168c56bd3934
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Sun Jul 18 01:09:12 2021 +0530

    docs/query-language: Describe search query language syntax
    
    Documentation for search query language syntax

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/218/ for more details.

Some aspects of the query language that I'm still thinking about :

  1. Renaming some of the fields in the query language.
    • date_{modified,published,created} should be renamed as {modified,published,created}
    • last_{visit,eventful_visit,revision,release}_date should be renamed as last_{visit,eventful_visit,revision,release}
    • These changes make the query language shorter in length, but at the same time, I feel that autocomplete would make it easier to find/identify date fields if we keep date in the filter names.
    • Also, "programming_languages" can be changed into just "languages".
  1. Allowing OR between filters
    • Before: No option for OR
    • After: nb_visits > 5 OR (license in ["MIT"] programming_languages: ["python"])
    • Note the format of the above query: filter1 OR (filter2 filter3)
  1. Allowing searches in url and metadata fields without operators
    • Before: metadata: "keyword1 keyword2" url : "keyword3" visit_type : [pypi]
    • After: keyword1 keyword2 keyword3 visit_type : [pypi, git]
    • Also, after this change, Elasticsearch should search for keyword{1,2,3} in the url as well as the metadata and give a boost to the origin if url matches. ( i.e. giving more priority to origin match )
  1. Adopting a .. format for range filters instead of using range operators (< <= = != >= >)
    • nb_visits > 5 nb_visits < 10 => nb_visits: 5..10
    • nb_visits = 5 => nb_visits : 5
    • nb_visits > 5 => nb_visits : 5..
    • last_visit_date < 2021-01-01 => last_visit_date : ..2021-01-01
  1. Allowing exclusion of a keyword from url/metadata filter:
    • Before: no option to exclude a keyword
    • After metadata: NOT "keyword" ( I'm not sure about the format for this and would appreciate some suggestions )

@zack @vlorentz @anlambert
Please comment your opinion about the same. I'm all ears :)

References :

zack requested changes to this revision.Jul 19 2021, 12:26 PM

thanks @KShivendu, this is a great start!

I'm requesting changes on various things.
Some are just stylistic in the doc itself.
Some are inconsistencies (e.g., ":" versus "=").
And some are usability improvements: the general principle to keep in mind is that a textual language is a user interface, as such it should be as ergonomic as possible.

There is also an important open question about general boolean AND/OR queries.

docs/query-language.rst
5–6

What is the semantics of filter composition? I'm assuming if we specify multiple filters they are all going to be AND-ed together.

So are OR queries not possible?

Ideally we want to have both (although I do not know if that is supported by the search backend).

If we do have both AND and OR queries we will need:

  • different conjunction operators
  • an explicit statement in the doc about the precedence of the two operators
  • parentheses to disambiguate
9–10

Drop this note about how "intelligent" the parser is. Just say that spaces can be added around operators.
BTW, I suspect that in some cases they are mandatory, e.g., around the "in" operator, so I'm not sure this description is entirely correct.

19

I don't understand the point of the ":" operator in the concrete syntax (both here and many other places in the grammar).
It is used to mean "equal" and we already use the "=" operator in other context in the syntax.
I don't think we should have two different symbols for the same notion.

Even though ":" is more common in some search languages, we should then use "=" everywhere.

20

please use "quotation marks" which is a more common expression for "inverted commas"

26–28

"url" is a datatype in our context, not an information entity.
It would be better to call it "origin_url", but I prefer even more to just use "origin" as it is what would make the most intuitive sense to users. The documentation will then explain that origin are identified by URLs

29–30

I don't understand the semantic of these. Do they just mean that metadata should contain the provided string anywhere?

36

better: "visited"

49

better: "visits" ("number" is redundant and adds to visual clutter)

64

drop the trailing "s" in all of these and use "language" instead of "programming_language" (we are a source code archive, so it's the most natural notion of "language" we have)

82

drop the trailing "s"

86–97

except for "any", this list is going to become obsolete soon. We need to point here to an external list that is guaranteed to be more up-to-date, maybe swh-web can provide one? (I don't know if we have another good answer to this problem.)

105–112

these field names appear also elsewhere in the doc, we need to find a way to deduplicate the list, so that it does not appear in multiple places. (I'm gonna comment on the specific field names elsewhere.)

127–133

drop "date_" (or "_date") everywhere in these, it's clutter that doesn't add much

136

"YYYY-MM-DD" is the standard ISO date format, so we can drop everything starting with "or..." here?

145–146

you should briefly explain the semantics here, e.g., "limits the number of results to at most N"

this is a more general comment, each of the previous sections could benefit from one such sentence

This revision now requires changes to proceed.Jul 19 2021, 12:26 PM
docs/query-language.rst
5–6

only AND for now. We want to have a working prototype first

docs/query-language.rst
5–6

Well, but if you want to add OR later you need to think about how to fit it in the grammar now, or else you'll have to deal with backward incompatibility later.

For instance, you might want to add explicit AND connector right now, even if it is the only option available.

KShivendu marked 2 inline comments as done.
  • docs: Update query-language specs
docs/query-language.rst
29–30

Yes. (It's what happens in archive search when you tick the "search in metadata" option)

Do you have anything better in mind ?

82

Done.

I was wondering if we should call it just type because it also specifies the type of an origin (which shouldn't change). We're searching for origins, so type should intuitively mean the type of origin to look for. Plus, it also makes the language cleaner.

For example :
An origin of type ftp is always going to be ftp no matter how many times we visit it and hence visit_type is a property of a visit as well as the origin.

But I'm not sure, because I haven't seen enough archived repos. Thoughts?

86–97

@anlambert any suggestions for how to implement this feature?

Build is green

Patch application report for D6005 (id=21713)

Rebasing onto fe7640f710...

Current branch diff-target is up to date.
Changes applied before test
commit 3e3f9f6909abb4667f4aa8869e84be7ac38f1b7d
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Sun Jul 18 02:25:14 2021 +0530

    docs: Update query-language specs

commit d4a246fe732fb360711d0155bb63168c56bd3934
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Sun Jul 18 01:09:12 2021 +0530

    docs/query-language: Describe search query language syntax
    
    Documentation for search query language syntax

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/219/ for more details.

docs/query-language.rst
86–97

Currently, we do not have an efficient way to retrieve all visit types dynamically, the list is also hardcoded in swh-web.

Maybe we could write an elasticsearch query to get that list dynamically ? I wanted to test such query in our production cluster but I cannot access it anymore due to new firewall rules recently put in place.

zack requested changes to this revision.Jul 21 2021, 11:42 AM
zack added inline comments.
docs/query-language.rst
5

Thanks for adding the explicit AND/OR operators. However, I think the C-like choice of && and || is bad for users. It's too programmer-oriented, and we will have users from other domains. They are also unusual in web search languages, where words are more common. Finally, they are hard to type on some keyboard layouts.

I propose to use explicit and and or instead (and in the future not for negation).

They are common stop words, so in most cases not being able to use them as search terms will not be a big deal.
But we need a way to allow to search for them if someone really wants to. One way would be to say that when they appear in quotes they are not operators but terms, e.g., one will be allowed to search for "and" and "or" in order to search for origins that contain both "and" and "or" as strings.

Alternatively we will need an explicit escape symbol, like \and, \or, \not.

I don't particularly care between these two alternatives.

5

Note that two points of my previous review are still relevant and not addressed yet:

  • you need to tell say what are the precedence rules so that users can understand what foo and bar or baz means (it should mean (foo and bar) or baz
  • you need to have explicit parentheses in the language so that users can override default precedence rules (e.g., write foo and (bar or baz))
This revision now requires changes to proceed.Jul 21 2021, 11:42 AM
docs/query-language.rst
28

minor: missing space between "marks" and "("

29–30

No, that's fine. (But should be explained in the README.)

82

You're right that we're searching for origins, but in the feature we might allow to search for other stuff, like source code files containing a specific search pattern (when we will have full-text search). type being a very generic name, I think we should keep it to request the type of artifact one is searching for.

So let's avoid using "type" for now, we can always add it as a shorthand later on.

  • docs/query-language: Use 'and' and 'or'
  • Add details and examples for precedences
  • Fix typos

Build is green

Patch application report for D6005 (id=21733)

Rebasing onto d58705a0eb...

First, rewinding head to replay your work on top of it...
Applying: docs/query-language: Describe search query language syntax
Applying: docs: Update query-language specs
Applying: docs/query-language: Use 'and' and 'or'
Changes applied before test
commit 0a3c5ab8c1984fa469a8d71f3491f9b98d711651
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Thu Jul 22 10:33:10 2021 +0530

    docs/query-language: Use 'and' and 'or'

commit bdcb46bc73df214f16c216dfd7197d9d2de18478
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Sun Jul 18 02:25:14 2021 +0530

    docs: Update query-language specs

commit 43cca34200299c576ccf8f2e8865d4b5e9d1bece
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Sun Jul 18 01:09:12 2021 +0530

    docs/query-language: Describe search query language syntax
    
    Documentation for search query language syntax

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/225/ for more details.

LGTM.

I'm accepting this diff, but note that I've added a few suggestions for improved language above. Please integrate them before this is final.

docs/query-language.rst
19

better:

  • `and has higher precedence than or. Therefore foo or bar and baz means foo or (bar and baz)`
20

better:

  • Precedence can be overridden using parentheses: `( and ). For example, you can override the default precedence in the previous query as: (foo or bar) and baz`
21

better:

  • To actually search for `and or or as strings, just put them within quotes. Example : metadata : "vcs history and metadata", or even just "and" to search for the string and`
This revision is now accepted and ready to land.Jul 22 2021, 9:29 AM
  • Changes suggested by @zack
  • Squash commits

Build is green

Patch application report for D6005 (id=21735)

Rebasing onto d58705a0eb...

First, rewinding head to replay your work on top of it...
Applying: docs/query-language: Describe search query language syntax
Changes applied before test
commit 68ebc3f7f751a7685e2774b9ab8b49bbae0870f6
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Sun Jul 18 01:09:12 2021 +0530

    docs/query-language: Describe search query language syntax
    
    Documentation for the proposed search query language

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/226/ for more details.

This revision was landed with ongoing or failed builds.Jul 22 2021, 10:05 AM
This revision was automatically updated to reflect the committed changes.

Build is green

Patch application report for D6005 (id=21736)

Rebasing onto d58705a0eb...

Current branch diff-target is up to date.
Changes applied before test
commit 4e453304ade0c63657a913f515b66737ee69dfcb
Author: KShivendu <shivendu@iitbhilai.ac.in>
Date:   Sun Jul 18 01:09:12 2021 +0530

    docs/query-language: Describe search query language syntax
    
    Documentation for the proposed search query language

See https://jenkins.softwareheritage.org/job/DSEA/job/tests-on-diff/227/ for more details.