diff --git a/PKG-INFO b/PKG-INFO index b9d6de0..2c855c5 100644 --- a/PKG-INFO +++ b/PKG-INFO @@ -1,90 +1,90 @@ Metadata-Version: 2.1 Name: swh.search -Version: 0.13.0 +Version: 0.13.1 Summary: Software Heritage search service Home-page: https://forge.softwareheritage.org/diffusion/DSEA Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-search Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-search/ Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 3 - Alpha Requires-Python: >=3.7 Description-Content-Type: text/markdown Provides-Extra: testing License-File: LICENSE License-File: AUTHORS swh-search ========== Search service for the Software Heritage archive. It is similar to swh-storage in what it contains, but provides different ways to query it: while swh-storage is mostly a key-value store that returns an object from a primary key, swh-search is focused on reverse indices, to allow finding objects that match some criteria; for example full-text search. Currently uses ElasticSearch, and provides only origin search (by URL and metadata) ## Dependencies - Python tests for this module include tests that cannot be run without a local ElasticSearch instance, so you need the ElasticSearch server executable on your machine (no need to have a running ElasticSearch server). - Debian-like host The elasticsearch package is required. As it's not part of debian-stable, [another debian repository is required to be configured](https://www.elastic.co/guide/en/elasticsearch/reference/current/deb.html#deb-repo) - Non Debian-like host The tests expect: - `/usr/share/elasticsearch/jdk/bin/java` to exist. - `org.elasticsearch.bootstrap.Elasticsearch` to be in java's classpath. - Emscripten is required for generating tree-sitter WASM module. The following commands need to be executed for the setup: ```bash cd /opt && git clone https://github.com/emscripten-core/emsdk.git && cd emsdk && \ ./emsdk install latest && ./emsdk activate latest PATH="${PATH}:/opt/emsdk/upstream/emscripten" ``` **Note:** If emsdk isn't found in the PATH, the tree-sitter cli automatically pulls `emscripten/emsdk` image from docker hub when `make ts-build-wasm` or `make ts-build` is used. ## Make targets Below is the list of available make targets that can be executed from the root directory of swh-search in order to build and/or execute the swh-search under various configurations: * **ts-install**: Install node_modules and emscripten SDK required for TreeSitter * **ts-generate**: Generate parser files(C and JSON) from the grammar * **ts-repl**: Starts a web based playground for the TreeSitter grammar. It's the recommended way for developing TreeSitter grammar. * **ts-dev**: Parse the `query_language/sample_query` and print the corresponding syntax expression along with the start and end positions of all the nodes. * **ts-dev sanitize=1**: Same as **ts-dev** but without start and end position of the nodes. This format is expected by TreeSitter's native test command. `sanitize=1` cleans the output of **ts-dev** using `sed` to achieve the desired format. * **ts-test**: executes TreeSitter's native tests * **ts-build-so**: Generates `swh_ql.so` file from the previously generated parser using py-tree-sitter * **ts-build-so**: Generates `swh_ql.wasm` file from the previously generated parser using emscripten * **ts-build**: Executes both **ts-build-so** and **ts-build-so** diff --git a/docs/query-language.rst b/docs/query-language.rst index 11dba04..b538845 100644 --- a/docs/query-language.rst +++ b/docs/query-language.rst @@ -1,190 +1,199 @@ Search Query Language ===================== Every query is composed of filters separated by ``and`` or ``or``. These filters have 3 components in the order : ``Name Operator Value`` Some of the examples are : * ``origin : plasma and language in [python] and visits >= 5`` - * ``last_revision > 2020-01-01 and limit = 10`` * ``last_visit > 2021-01-01 or last_visit < 2020-01-01`` * ``visited = false and metadata = "kubernetes" or origin : "minikube"`` - * ``keyword in ["orchestration", "kubectl"] and language in ["go", "rust"]`` + * ``keyword in ["orchestration", "kubectl"] and license in ["GPLv3+", "GPLv3"]`` * ``(origin : debian or visit_type = ["deb"]) and license in ["GPL-3"]`` + +.. this one is currently disabled, because it is too expansive to add + the 'last_revision' in swh-search: + + * ``last_revision > 2020-01-01 and limit = 10`` + +.. and this one does not work yet because we don't collect enough language metadata: + + * ``keyword in ["orchestration", "kubectl"] and language in ["go", "rust"]`` + **Note**: * Whitespaces are optional between the three components of a filter. * The conjunction operators have left precedence. Therefore ``foo and bar and baz`` means ``(foo and bar) and baz`` * ``and`` has higher precedence than ``or``. Therefore ``foo or bar and baz`` means ``foo or (bar and baz)`` * Precedence can be overridden using parentheses: ``(`` and ``)``. For example, you can override the default precedence in the previous query as: ``(foo or bar) and baz`` * To actually search for ``and`` or ``or`` as strings, just put them within quotes. Example : ``metadata : "vcs history and metadata"``, or even just ``metadata : "and"`` to search for the string ``and`` in the metadata The filters have been classified based on the type of value that they expects. Pattern filters --------------- Returns origins having the given keywords in their url or intrinsic metadata * Name: * ``origin``: Keywords from the origin url * ``metadata``: Keywords from all the intrinsic metadata fields * Operator: ``:`` * Value: String wrapped in quotation marks(``"`` or ``'``) **Note:** If a string has no whitespace then the quotation marks become optional. **Examples:** * ``origin : https://github.com/Django/django`` * ``origin : kubernetes`` * ``origin : "github python"`` * ``metadata : orchestration`` * ``metadata : "javascript language"`` Boolean filters --------------- Returns origins having their boolean type values equal to given values * Name: ``visited`` : Whether the origin has been visited * Operator: ``=`` * Value: ``true`` or ``false`` **Examples:** * ``visited = true`` * ``visited = false`` Numeric filters --------------- Returns origins having their numeric type values in the given range * Name: ``visits`` : Number of visits of an origin * Operator: ``<`` ``<=`` ``=`` ``!=`` ``>`` ``>=`` * Value: Positive integer **Examples:** * ``visits > 2`` * ``visits = 5`` * ``visits <= 10`` Un-bounded List filters ----------------------- Returns origins that satisfy the criteria based on a given list * Name: * ``language`` : Programming languages used * ``license`` : License used * ``keyword`` : keywords (often same as tags) or description (includes README) from the metadata * Operator: ``in`` ``not in`` * Value: Array of strings **Note:** * If a string has no whitespace then the quotation marks become optional. * The ``keyword`` filter gives more priority to the keywords field of intrinsic metadata than the description field. So origins having the queried term in their intrinsic metadata keyword will appear first. **Examples:** * ``language in [python, js]`` * ``license in ["GPL 3.0 or later", MIT]`` * ``keyword in ["Software Heritage", swh]`` Bounded List filters -------------------- Returns origins that satisfy the criteria based on a list of fixed options **visit_type** * Name: ``visit_type`` : Returns only origins with at least one of the specified visit types * Operator: ``=`` * Value: Array of the following values ``any`` ``cran`` ``deb`` ``deposit`` ``ftp`` ``hg`` ``git`` ``nixguix`` ``npm`` ``pypi`` ``svn`` ``tar`` **sort_by** * Name: ``sort_by`` : Sorts origins based on the given list of origin attributes * Operator: ``=`` * Value: Array of the following values ``visits`` ``last_visit`` ``last_eventful_visit`` ``last_revision`` ``last_release`` ``created`` ``modified`` ``published`` **Examples:** * ``visit_type = [svn, npm]`` * ``visit_type = [nixguix, "ftp"]`` * ``sort_by = ["last_visit", created]`` * ``sort_by = [visits, modified]`` Date filters ------------ Returns origins having their date type values in the given range * Name: * ``last_visit`` : Latest visit date * ``last_eventful_visit`` : Latest visit date where a new snapshot was detected * ``last_revision`` : Latest commit date * ``last_release`` : Latest release date * ``created`` Creation date * ``modified`` Modification date * ``published`` Published date * Operator: ``<`` ``<=`` ``=`` ``!=`` ``>`` ``>=`` * Value: Date in ``Standard ISO`` format **Note:** The last three date filters are based on metadata that has to be manually entered by the repository authors. So they might not be correct or up-to-date. **Examples:** * ``last_visit > 2001-01-01 and last_visit < 2101-01-01`` * ``last_revision = "2000-01-01 18:35Z"`` * ``last_release != "2021-07-17T18:35:00Z"`` * ``created <= "2021-07-17 18:35"`` Limit filter ------------ Limits the number of results to at most N * Name: ``limit`` * Operator: ``=`` * Value: Positive Integer **Note:** The default value of the limit is 50 **Examples:** * ``limit = 1`` * ``limit = 15`` diff --git a/swh.search.egg-info/PKG-INFO b/swh.search.egg-info/PKG-INFO index b9d6de0..2c855c5 100644 --- a/swh.search.egg-info/PKG-INFO +++ b/swh.search.egg-info/PKG-INFO @@ -1,90 +1,90 @@ Metadata-Version: 2.1 Name: swh.search -Version: 0.13.0 +Version: 0.13.1 Summary: Software Heritage search service Home-page: https://forge.softwareheritage.org/diffusion/DSEA Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-search Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-search/ Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 3 - Alpha Requires-Python: >=3.7 Description-Content-Type: text/markdown Provides-Extra: testing License-File: LICENSE License-File: AUTHORS swh-search ========== Search service for the Software Heritage archive. It is similar to swh-storage in what it contains, but provides different ways to query it: while swh-storage is mostly a key-value store that returns an object from a primary key, swh-search is focused on reverse indices, to allow finding objects that match some criteria; for example full-text search. Currently uses ElasticSearch, and provides only origin search (by URL and metadata) ## Dependencies - Python tests for this module include tests that cannot be run without a local ElasticSearch instance, so you need the ElasticSearch server executable on your machine (no need to have a running ElasticSearch server). - Debian-like host The elasticsearch package is required. As it's not part of debian-stable, [another debian repository is required to be configured](https://www.elastic.co/guide/en/elasticsearch/reference/current/deb.html#deb-repo) - Non Debian-like host The tests expect: - `/usr/share/elasticsearch/jdk/bin/java` to exist. - `org.elasticsearch.bootstrap.Elasticsearch` to be in java's classpath. - Emscripten is required for generating tree-sitter WASM module. The following commands need to be executed for the setup: ```bash cd /opt && git clone https://github.com/emscripten-core/emsdk.git && cd emsdk && \ ./emsdk install latest && ./emsdk activate latest PATH="${PATH}:/opt/emsdk/upstream/emscripten" ``` **Note:** If emsdk isn't found in the PATH, the tree-sitter cli automatically pulls `emscripten/emsdk` image from docker hub when `make ts-build-wasm` or `make ts-build` is used. ## Make targets Below is the list of available make targets that can be executed from the root directory of swh-search in order to build and/or execute the swh-search under various configurations: * **ts-install**: Install node_modules and emscripten SDK required for TreeSitter * **ts-generate**: Generate parser files(C and JSON) from the grammar * **ts-repl**: Starts a web based playground for the TreeSitter grammar. It's the recommended way for developing TreeSitter grammar. * **ts-dev**: Parse the `query_language/sample_query` and print the corresponding syntax expression along with the start and end positions of all the nodes. * **ts-dev sanitize=1**: Same as **ts-dev** but without start and end position of the nodes. This format is expected by TreeSitter's native test command. `sanitize=1` cleans the output of **ts-dev** using `sed` to achieve the desired format. * **ts-test**: executes TreeSitter's native tests * **ts-build-so**: Generates `swh_ql.so` file from the previously generated parser using py-tree-sitter * **ts-build-so**: Generates `swh_ql.wasm` file from the previously generated parser using emscripten * **ts-build**: Executes both **ts-build-so** and **ts-build-so** diff --git a/swh.search.egg-info/entry_points.txt b/swh.search.egg-info/entry_points.txt index 28b8966..2bca812 100644 --- a/swh.search.egg-info/entry_points.txt +++ b/swh.search.egg-info/entry_points.txt @@ -1,4 +1,2 @@ - - [swh.cli.subcommands] - search=swh.search.cli - \ No newline at end of file +[swh.cli.subcommands] +search = swh.search.cli