Changeset View
Standalone View
docs/query-language.rst
- This file was added.
Search Query Language | |||||
===================== | |||||
Every query is composed of filters separated by ``and`` or ``or``. | |||||
zack: Thanks for adding the explicit AND/OR operators. However, I think the C-like choice of `&&` and… | |||||
Not Done Inline ActionsNote that two points of my previous review are still relevant and not addressed yet:
zack: Note that two points of my previous review are still relevant and not addressed yet:
- you need… | |||||
These filters have 3 components in the order : ``Name Operator Value`` | |||||
Not Done Inline ActionsWhat is the semantics of filter composition? I'm assuming if we specify multiple filters they are all going to be AND-ed together. So are OR queries not possible? Ideally we want to have both (although I do not know if that is supported by the search backend). If we do have both AND and OR queries we will need:
zack: What is the semantics of filter composition? I'm assuming if we specify multiple filters they… | |||||
Not Done Inline Actionsonly AND for now. We want to have a working prototype first vlorentz: only AND for now. We want to have a working prototype first | |||||
Not Done Inline ActionsWell, but if you want to add OR later you need to think about how to fit it in the grammar now, or else you'll have to deal with backward incompatibility later. For instance, you might want to add explicit AND connector right now, even if it is the only option available. zack: Well, but if you want to add OR later you need to think about how to fit it in the grammar now… | |||||
Some of the examples are : | |||||
* ``origin = django and language in [python] and visits >= 5`` | |||||
* ``last_revision > 2020-01-01 and limit = 10`` | |||||
Not Done Inline ActionsDrop this note about how "intelligent" the parser is. Just say that spaces can be added around operators. zack: Drop this note about how "intelligent" the parser is. Just say that spaces can be added around… | |||||
* ``last_visit > 2021-01-01 or last_visit < 2020-01-01`` | |||||
* ``visited = false and metadata = "kubernetes" or origin = "minikube"`` | |||||
Done Inline ActionsI'll remove this extra "on" KShivendu: I'll remove this extra "on" | |||||
* ``keyword in ["orchestration", "kubectl"] and language in ["go", "rust"]`` | |||||
* ``(origin = debian or visit_type = ["deb"]) and license in ["GPL-3"]`` | |||||
**Note**: | |||||
* Whitespaces are optional between the three components of a filter. | |||||
* The conjunction operators have left precedence. Therefore ``foo and bar and baz`` means ``(foo and bar) and baz`` | |||||
* ``and`` has higher precedence than ``or``. Therefore ``foo or bar and baz`` means ``foo or (bar and baz)`` | |||||
Not Done Inline ActionsI don't understand the point of the ":" operator in the concrete syntax (both here and many other places in the grammar). Even though ":" is more common in some search languages, we should then use "=" everywhere. zack: I don't understand the point of the ":" operator in the concrete syntax (both here and many… | |||||
Not Done Inline Actionsbetter:
zack: better:
* ``and`` has higher precedence than ``or``. Therefore ``foo or bar and baz``… | |||||
* Precedence can be overridden using parentheses: ``(`` and ``)``. For example, you can override the default precedence in the previous query as: ``(foo or bar) and baz`` | |||||
Done Inline Actionsplease use "quotation marks" which is a more common expression for "inverted commas" zack: please use "quotation marks" which is a more common expression for "inverted commas" | |||||
Not Done Inline Actionsbetter:
zack: better:
* Precedence can be overridden using parentheses: ``(`` and ``)``. For example… | |||||
* To actually search for ``and`` or ``or`` as strings, just put them within quotes. Example : ``metadata : "vcs history and metadata"``, or even just ``metadata : "and"`` to search for the string ``and`` in the metadata | |||||
Not Done Inline Actionsbetter:
zack: better:
* To actually search for ``and`` or ``or`` as strings, just put them within quotes. | |||||
The filters have been classified based on the type of value that they expects. | |||||
Pattern filters | |||||
--------------- | |||||
Returns origins having the given keywords in their url or intrinsic metadata | |||||
Done Inline Actions"url" is a datatype in our context, not an information entity. zack: "url" is a datatype in our context, not an information entity.
It would be better to call it… | |||||
Not Done Inline Actionsminor: missing space between "marks" and "(" zack: minor: missing space between "marks" and "(" | |||||
* Name: | |||||
Not Done Inline ActionsI don't understand the semantic of these. Do they just mean that metadata should contain the provided string anywhere? zack: I don't understand the semantic of these. Do they just mean that metadata should contain the… | |||||
Done Inline ActionsYes. (It's what happens in archive search when you tick the "search in metadata" option) Do you have anything better in mind ? KShivendu: Yes. (It's what happens in archive search when you tick the "search in metadata" option)
Do… | |||||
Not Done Inline ActionsNo, that's fine. (But should be explained in the README.) zack: No, that's fine. (But should be explained in the README.) | |||||
* ``origin``: Keywords from the origin url | |||||
* ``metadata``: Keywords from all the intrinsic metadata fields | |||||
* Operator: ``=`` | |||||
* Value: String wrapped in quotation marks(``"`` or ``'``) | |||||
**Note:** If a string has no whitespace then the quotation marks become optional. | |||||
Not Done Inline Actionsbetter: "visited" zack: better: "visited" | |||||
**Examples:** | |||||
* ``origin = https://github.com/Django/django`` | |||||
* ``origin = kubernetes`` | |||||
* ``origin = "github python"`` | |||||
* ``metadata = orchestration`` | |||||
* ``metadata = "javascript language"`` | |||||
Boolean filters | |||||
--------------- | |||||
Returns origins having their boolean type values equal to given values | |||||
Not Done Inline Actionsbetter: "visits" ("number" is redundant and adds to visual clutter) zack: better: "visits" ("number" is redundant and adds to visual clutter) | |||||
* Name: ``visited`` : Whether the origin has been visited | |||||
* Operator: ``=`` | |||||
* Value: ``true`` or ``false`` | |||||
**Examples:** | |||||
* ``visited = true`` | |||||
* ``visited = false`` | |||||
Numeric filters | |||||
--------------- | |||||
Returns origins having their numeric type values in the given range | |||||
* Name: ``visits`` : Number of visits of an origin | |||||
Not Done Inline Actionsdrop the trailing "s" in all of these and use "language" instead of "programming_language" (we are a source code archive, so it's the most natural notion of "language" we have) zack: drop the trailing "s" in all of these and use "language" instead of "programming_language" (we… | |||||
* Operator: ``<`` ``<=`` ``=`` ``!=`` ``>`` ``>=`` | |||||
* Value: Positive integer | |||||
**Examples:** | |||||
* ``visits > 2`` | |||||
* ``visits = 5`` | |||||
* ``visits <= 10`` | |||||
Un-bounded List filters | |||||
----------------------- | |||||
Returns origins that satisfy the criteria based on a given list | |||||
* Name: | |||||
* ``language`` : Programming languages used | |||||
Not Done Inline Actionsdrop the trailing "s" zack: drop the trailing "s" | |||||
Done Inline ActionsDone. I was wondering if we should call it just type because it also specifies the type of an origin (which shouldn't change). We're searching for origins, so type should intuitively mean the type of origin to look for. Plus, it also makes the language cleaner. For example : But I'm not sure, because I haven't seen enough archived repos. Thoughts? KShivendu: Done.
I was wondering if we should call it just `type` because it also specifies the type of… | |||||
Not Done Inline ActionsYou're right that we're searching for origins, but in the feature we might allow to search for other stuff, like source code files containing a specific search pattern (when we will have full-text search). type being a very generic name, I think we should keep it to request the type of artifact one is searching for. So let's avoid using "type" for now, we can always add it as a shorthand later on. zack: You're right that we're searching for origins, but in the feature we might allow to search for… | |||||
* ``license`` : License used | |||||
* ``keyword`` : keywords (often same as tags) or description (includes README) from the metadata | |||||
* Operator: ``in`` ``not in`` | |||||
* Value: Array of strings | |||||
**Note:** | |||||
* If a string has no whitespace then the quotation marks become optional. | |||||
* The ``keyword`` filter gives more priority to the keywords field of intrinsic metadata than the description field. So origins having the queried term in their intrinsic metadata keyword will appear first. | |||||
**Examples:** | |||||
* ``language in [python, js]`` | |||||
* ``license in ["GPL 3.0 or later", MIT]`` | |||||
Not Done Inline Actionsexcept for "any", this list is going to become obsolete soon. We need to point here to an external list that is guaranteed to be more up-to-date, maybe swh-web can provide one? (I don't know if we have another good answer to this problem.) zack: except for "any", this list is going to become obsolete soon. We need to point here to an… | |||||
Done Inline Actions@anlambert any suggestions for how to implement this feature? KShivendu: @anlambert any suggestions for how to implement this feature? | |||||
Not Done Inline ActionsCurrently, we do not have an efficient way to retrieve all visit types dynamically, the list is also hardcoded in swh-web. Maybe we could write an elasticsearch query to get that list dynamically ? I wanted to test such query in our production cluster but I cannot access it anymore due to new firewall rules recently put in place. anlambert: Currently, we do not have an efficient way to retrieve all visit types dynamically, the list is… | |||||
* ``keyword in ["Software Heritage", swh]`` | |||||
Bounded List filters | |||||
-------------------- | |||||
Returns origins that satisfy the criteria based on a list of fixed options | |||||
**visit_type** | |||||
* Name: ``visit_type`` : Returns only origins with at least one of the specified visit types | |||||
* Operator: ``=`` | |||||
* Value: Array of the following values | |||||
``any`` | |||||
Not Done Inline Actionsthese field names appear also elsewhere in the doc, we need to find a way to deduplicate the list, so that it does not appear in multiple places. (I'm gonna comment on the specific field names elsewhere.) zack: these field names appear also elsewhere in the doc, we need to find a way to deduplicate the… | |||||
``cran`` | |||||
``deb`` | |||||
``deposit`` | |||||
``ftp`` | |||||
``hg`` | |||||
``git`` | |||||
``nixguix`` | |||||
``npm`` | |||||
``pypi`` | |||||
``svn`` | |||||
``tar`` | |||||
**sort_by** | |||||
* Name: ``sort_by`` : Sorts origins based on the given list of origin attributes | |||||
* Operator: ``=`` | |||||
* Value: Array of the following values | |||||
``visits`` | |||||
``last_visit`` | |||||
``last_eventful_visit`` | |||||
Not Done Inline Actionsdrop "date_" (or "_date") everywhere in these, it's clutter that doesn't add much zack: drop "date_" (or "_date") everywhere in these, it's clutter that doesn't add much | |||||
``last_revision`` | |||||
``last_release`` | |||||
``created`` | |||||
Not Done Inline Actions"YYYY-MM-DD" is the standard ISO date format, so we can drop everything starting with "or..." here? zack: "YYYY-MM-DD" is the standard ISO date format, so we can drop everything starting with "or..."… | |||||
``modified`` | |||||
``published`` | |||||
**Examples:** | |||||
* ``visit_type = [svn, npm]`` | |||||
* ``visit_type = [nixguix, "ftp"]`` | |||||
* ``sort_by = ["last_visit", created]`` | |||||
* ``sort_by = [visits, modified]`` | |||||
Not Done Inline Actionsyou should briefly explain the semantics here, e.g., "limits the number of results to at most N" this is a more general comment, each of the previous sections could benefit from one such sentence zack: you should briefly explain the semantics here, e.g., "limits the number of results to at most… | |||||
Date filters | |||||
------------ | |||||
Returns origins having their date type values in the given range | |||||
* Name: | |||||
* ``last_visit`` : Latest visit date | |||||
* ``last_eventful_visit`` : Latest visit date where a new snapshot was detected | |||||
* ``last_revision`` : Latest commit date | |||||
* ``last_release`` : Latest release date | |||||
* ``created`` Creation date | |||||
* ``modified`` Modification date | |||||
* ``published`` Published date | |||||
* Operator: ``<`` ``<=`` ``=`` ``!=`` ``>`` ``>=`` | |||||
* Value: Date in ``Standard ISO`` format | |||||
**Note:** The last three date filters are based on metadata that has to be manually entered | |||||
by the repository authors. So they might not be correct or up-to-date. | |||||
**Examples:** | |||||
* ``last_visit > 2001-01-01 and last_visit < 2101-01-01`` | |||||
* ``last_revision = "2000-01-01 18:35Z"`` | |||||
* ``last_release != "2021-07-17T18:35:00Z"`` | |||||
* ``created <= "2021-07-17 18:35"`` | |||||
Limit filter | |||||
------------ | |||||
Limits the number of results to at most N | |||||
* Name: ``limit`` | |||||
* Operator: ``=`` | |||||
* Value: Positive Integer | |||||
**Note:** The default value of the limit is 50 | |||||
**Examples:** | |||||
* ``limit = 1`` | |||||
* ``limit = 15`` |
Thanks for adding the explicit AND/OR operators. However, I think the C-like choice of && and || is bad for users. It's too programmer-oriented, and we will have users from other domains. They are also unusual in web search languages, where words are more common. Finally, they are hard to type on some keyboard layouts.
I propose to use explicit and and or instead (and in the future not for negation).
They are common stop words, so in most cases not being able to use them as search terms will not be a big deal.
But we need a way to allow to search for them if someone really wants to. One way would be to say that when they appear in quotes they are not operators but terms, e.g., one will be allowed to search for "and" and "or" in order to search for origins that contain both "and" and "or" as strings.
Alternatively we will need an explicit escape symbol, like \and, \or, \not.
I don't particularly care between these two alternatives.