Changeset View
Standalone View
docs/query-language.rst
- This file was added.
Search Query Language Syntax | |||||
============================ | |||||
Every query is composed of filters separated by whitespaces. | |||||
zack: Thanks for adding the explicit AND/OR operators. However, I think the C-like choice of `&&` and… | |||||
Not Done Inline ActionsNote that two points of my previous review are still relevant and not addressed yet:
zack: Note that two points of my previous review are still relevant and not addressed yet:
- you need… | |||||
These filters have a 3 components in the order : ``Name Operator Value`` | |||||
zackUnsubmitted Not Done Inline ActionsWhat is the semantics of filter composition? I'm assuming if we specify multiple filters they are all going to be AND-ed together. So are OR queries not possible? Ideally we want to have both (although I do not know if that is supported by the search backend). If we do have both AND and OR queries we will need:
zack: What is the semantics of filter composition? I'm assuming if we specify multiple filters they… | |||||
vlorentzUnsubmitted Not Done Inline Actionsonly AND for now. We want to have a working prototype first vlorentz: only AND for now. We want to have a working prototype first | |||||
zackUnsubmitted Not Done Inline ActionsWell, but if you want to add OR later you need to think about how to fit it in the grammar now, or else you'll have to deal with backward incompatibility later. For instance, you might want to add explicit AND connector right now, even if it is the only option available. zack: Well, but if you want to add OR later you need to think about how to fit it in the grammar now… | |||||
**Note:** It's not necessary to put whitespace between these three components. | |||||
The parser is intelligent enough to identify them even if the | |||||
whitespaces are removed. | |||||
zackUnsubmitted Not Done Inline ActionsDrop this note about how "intelligent" the parser is. Just say that spaces can be added around operators. zack: Drop this note about how "intelligent" the parser is. Just say that spaces can be added around… | |||||
The filters have been classfied based on the type of the value that it expects. | |||||
Done Inline ActionsI'll remove this extra "on" KShivendu: I'll remove this extra "on" | |||||
Pattern filters | |||||
--------------- | |||||
* Name: ``url`` ``metadata`` | |||||
* Operator: ``:`` | |||||
zackUnsubmitted Not Done Inline ActionsI don't understand the point of the ":" operator in the concrete syntax (both here and many other places in the grammar). Even though ":" is more common in some search languages, we should then use "=" everywhere. zack: I don't understand the point of the ":" operator in the concrete syntax (both here and many… | |||||
Not Done Inline Actionsbetter:
zack: better:
* ``and`` has higher precedence than ``or``. Therefore ``foo or bar and baz``… | |||||
* Value: String wrapped in inverted commas(``"`` or ``'``) | |||||
zackUnsubmitted Done Inline Actionsplease use "quotation marks" which is a more common expression for "inverted commas" zack: please use "quotation marks" which is a more common expression for "inverted commas" | |||||
Not Done Inline Actionsbetter:
zack: better:
* Precedence can be overridden using parentheses: ``(`` and ``)``. For example… | |||||
Not Done Inline Actionsbetter:
zack: better:
* To actually search for ``and`` or ``or`` as strings, just put them within quotes. | |||||
**Note:** If the string has no whitespace then the inverted comma becomes optional. | |||||
**Examples:** | |||||
* ``url : https://github.com/Django/django`` | |||||
* ``url : kubernetes`` | |||||
* ``url : "github python"`` | |||||
zackUnsubmitted Done Inline Actions"url" is a datatype in our context, not an information entity. zack: "url" is a datatype in our context, not an information entity.
It would be better to call it… | |||||
Not Done Inline Actionsminor: missing space between "marks" and "(" zack: minor: missing space between "marks" and "(" | |||||
* ``metadata : orchestration`` | |||||
* ``metadata : "javascript language"`` | |||||
zackUnsubmitted Not Done Inline ActionsI don't understand the semantic of these. Do they just mean that metadata should contain the provided string anywhere? zack: I don't understand the semantic of these. Do they just mean that metadata should contain the… | |||||
KShivenduAuthorUnsubmitted Done Inline ActionsYes. (It's what happens in archive search when you tick the "search in metadata" option) Do you have anything better in mind ? KShivendu: Yes. (It's what happens in archive search when you tick the "search in metadata" option)
Do… | |||||
zackUnsubmitted Not Done Inline ActionsNo, that's fine. (But should be explained in the README.) zack: No, that's fine. (But should be explained in the README.) | |||||
Boolean filters | |||||
--------------- | |||||
* Name: ``with_visit`` | |||||
zackUnsubmitted Not Done Inline Actionsbetter: "visited" zack: better: "visited" | |||||
* Operator: ``:`` | |||||
* Value: ``true`` or ``false`` | |||||
**Examples:** | |||||
* ``with_visit : true`` | |||||
* ``with_visit : false`` | |||||
Numeric filters | |||||
--------------- | |||||
* Name: ``nb_visits`` | |||||
zackUnsubmitted Not Done Inline Actionsbetter: "visits" ("number" is redundant and adds to visual clutter) zack: better: "visits" ("number" is redundant and adds to visual clutter) | |||||
* Operator: ``<`` ``<=`` ``=`` ``!=`` ``>`` ``>=`` | |||||
* Value: Positive integer | |||||
**Examples:** | |||||
* ``nb_visits > 2`` | |||||
* ``nb_visits = 5`` | |||||
* ``nb_visits <= 10`` | |||||
Un-bounded List filters | |||||
----------------------- | |||||
* Name: ``programming_languages`` ``licenses`` ``keywords`` | |||||
zackUnsubmitted Not Done Inline Actionsdrop the trailing "s" in all of these and use "language" instead of "programming_language" (we are a source code archive, so it's the most natural notion of "language" we have) zack: drop the trailing "s" in all of these and use "language" instead of "programming_language" (we… | |||||
* Operator: ``in`` ``not in`` | |||||
* Value: Array of Strings (separated with ``,``) | |||||
**Note:** If string has no whitespace then the inverted comma becomes optional. | |||||
**Examples:** | |||||
* ``programming_languages in [python, js]`` | |||||
* ``licenses in ["GPL 3.0 or later", MIT]`` | |||||
* ``keywords in ["Software Heritage", swh]`` | |||||
Bounded List filters | |||||
-------------------- | |||||
**visit_types** | |||||
* Name: ``visit_types`` | |||||
zackUnsubmitted Not Done Inline Actionsdrop the trailing "s" zack: drop the trailing "s" | |||||
KShivenduAuthorUnsubmitted Done Inline ActionsDone. I was wondering if we should call it just type because it also specifies the type of an origin (which shouldn't change). We're searching for origins, so type should intuitively mean the type of origin to look for. Plus, it also makes the language cleaner. For example : But I'm not sure, because I haven't seen enough archived repos. Thoughts? KShivendu: Done.
I was wondering if we should call it just `type` because it also specifies the type of… | |||||
zackUnsubmitted Not Done Inline ActionsYou're right that we're searching for origins, but in the feature we might allow to search for other stuff, like source code files containing a specific search pattern (when we will have full-text search). type being a very generic name, I think we should keep it to request the type of artifact one is searching for. So let's avoid using "type" for now, we can always add it as a shorthand later on. zack: You're right that we're searching for origins, but in the feature we might allow to search for… | |||||
* Operator: ``:`` | |||||
* Value: Array with elements | |||||
``any`` | |||||
``cran`` | |||||
``deb`` | |||||
``deposit`` | |||||
``ftp`` | |||||
``hg`` | |||||
``git`` | |||||
``nixguix`` | |||||
``npm`` | |||||
``pypi`` | |||||
``svn`` | |||||
``tar`` | |||||
zackUnsubmitted Not Done Inline Actionsexcept for "any", this list is going to become obsolete soon. We need to point here to an external list that is guaranteed to be more up-to-date, maybe swh-web can provide one? (I don't know if we have another good answer to this problem.) zack: except for "any", this list is going to become obsolete soon. We need to point here to an… | |||||
KShivenduAuthorUnsubmitted Done Inline Actions@anlambert any suggestions for how to implement this feature? KShivendu: @anlambert any suggestions for how to implement this feature? | |||||
anlambertUnsubmitted Not Done Inline ActionsCurrently, we do not have an efficient way to retrieve all visit types dynamically, the list is also hardcoded in swh-web. Maybe we could write an elasticsearch query to get that list dynamically ? I wanted to test such query in our production cluster but I cannot access it anymore due to new firewall rules recently put in place. anlambert: Currently, we do not have an efficient way to retrieve all visit types dynamically, the list is… | |||||
**sort_by** | |||||
* Name: ``sort_by`` | |||||
* Operator: ``:`` | |||||
* Value: Array with elements | |||||
``nb_visits`` | |||||
``last_visit_date`` | |||||
``last_eventful_visit_date`` | |||||
``last_revision_date`` | |||||
``last_release_date`` | |||||
``date_created`` | |||||
``date_modified`` | |||||
``date_published`` | |||||
zackUnsubmitted Not Done Inline Actionsthese field names appear also elsewhere in the doc, we need to find a way to deduplicate the list, so that it does not appear in multiple places. (I'm gonna comment on the specific field names elsewhere.) zack: these field names appear also elsewhere in the doc, we need to find a way to deduplicate the… | |||||
**Examples:** | |||||
* ``visit_types : [svn, npm]`` | |||||
* ``visit_types : [nixguix, "ftp"]`` | |||||
* ``sort_by : ["last_visit_date", date_created]`` | |||||
* ``sort_by : [nb_visits, date_modified]`` | |||||
Date filters | |||||
------------ | |||||
* Name: | |||||
* ``last_visit_date`` | |||||
* ``last_eventful_visit_date`` | |||||
* ``last_revision_date`` | |||||
* ``last_release_date`` | |||||
* ``date_created`` | |||||
* ``date_modified`` | |||||
* ``date_published`` | |||||
zackUnsubmitted Not Done Inline Actionsdrop "date_" (or "_date") everywhere in these, it's clutter that doesn't add much zack: drop "date_" (or "_date") everywhere in these, it's clutter that doesn't add much | |||||
* Operator: ``<`` ``<=`` ``=`` ``!=`` ``>`` ``>=`` | |||||
* Value: Date in ``YYYY-MM-DD`` or ``Standard ISO`` format | |||||
zackUnsubmitted Not Done Inline Actions"YYYY-MM-DD" is the standard ISO date format, so we can drop everything starting with "or..." here? zack: "YYYY-MM-DD" is the standard ISO date format, so we can drop everything starting with "or..."… | |||||
**Examples:** | |||||
* ``last_visit_date > 2001-01-01 last_visit_date < 2001-01-01`` | |||||
* ``last_revision_date = "2000-01-01 18:35Z"`` | |||||
* ``last_release_date != "2021-07-17T18:35:00Z"`` | |||||
* ``date_created <= "2021-07-17 18:35"`` | |||||
Limit filter | |||||
------------ | |||||
zackUnsubmitted Not Done Inline Actionsyou should briefly explain the semantics here, e.g., "limits the number of results to at most N" this is a more general comment, each of the previous sections could benefit from one such sentence zack: you should briefly explain the semantics here, e.g., "limits the number of results to at most… | |||||
* Name: ``limit`` | |||||
* Operator: ``=`` | |||||
* Value: Positive Integer | |||||
**Examples:** | |||||
* ``limit = 1`` | |||||
* ``limit = 15`` |
Thanks for adding the explicit AND/OR operators. However, I think the C-like choice of && and || is bad for users. It's too programmer-oriented, and we will have users from other domains. They are also unusual in web search languages, where words are more common. Finally, they are hard to type on some keyboard layouts.
I propose to use explicit and and or instead (and in the future not for negation).
They are common stop words, so in most cases not being able to use them as search terms will not be a big deal.
But we need a way to allow to search for them if someone really wants to. One way would be to say that when they appear in quotes they are not operators but terms, e.g., one will be allowed to search for "and" and "or" in order to search for origins that contain both "and" and "or" as strings.
Alternatively we will need an explicit escape symbol, like \and, \or, \not.
I don't particularly care between these two alternatives.