diff --git a/PKG-INFO b/PKG-INFO index 2c855c5..04786f3 100644 --- a/PKG-INFO +++ b/PKG-INFO @@ -1,90 +1,90 @@ Metadata-Version: 2.1 Name: swh.search -Version: 0.13.1 +Version: 0.13.2 Summary: Software Heritage search service Home-page: https://forge.softwareheritage.org/diffusion/DSEA Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-search Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-search/ Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 3 - Alpha Requires-Python: >=3.7 Description-Content-Type: text/markdown Provides-Extra: testing License-File: LICENSE License-File: AUTHORS swh-search ========== Search service for the Software Heritage archive. It is similar to swh-storage in what it contains, but provides different ways to query it: while swh-storage is mostly a key-value store that returns an object from a primary key, swh-search is focused on reverse indices, to allow finding objects that match some criteria; for example full-text search. Currently uses ElasticSearch, and provides only origin search (by URL and metadata) ## Dependencies - Python tests for this module include tests that cannot be run without a local ElasticSearch instance, so you need the ElasticSearch server executable on your machine (no need to have a running ElasticSearch server). - Debian-like host The elasticsearch package is required. As it's not part of debian-stable, [another debian repository is required to be configured](https://www.elastic.co/guide/en/elasticsearch/reference/current/deb.html#deb-repo) - Non Debian-like host The tests expect: - `/usr/share/elasticsearch/jdk/bin/java` to exist. - `org.elasticsearch.bootstrap.Elasticsearch` to be in java's classpath. - Emscripten is required for generating tree-sitter WASM module. The following commands need to be executed for the setup: ```bash cd /opt && git clone https://github.com/emscripten-core/emsdk.git && cd emsdk && \ ./emsdk install latest && ./emsdk activate latest PATH="${PATH}:/opt/emsdk/upstream/emscripten" ``` **Note:** If emsdk isn't found in the PATH, the tree-sitter cli automatically pulls `emscripten/emsdk` image from docker hub when `make ts-build-wasm` or `make ts-build` is used. ## Make targets Below is the list of available make targets that can be executed from the root directory of swh-search in order to build and/or execute the swh-search under various configurations: * **ts-install**: Install node_modules and emscripten SDK required for TreeSitter * **ts-generate**: Generate parser files(C and JSON) from the grammar * **ts-repl**: Starts a web based playground for the TreeSitter grammar. It's the recommended way for developing TreeSitter grammar. * **ts-dev**: Parse the `query_language/sample_query` and print the corresponding syntax expression along with the start and end positions of all the nodes. * **ts-dev sanitize=1**: Same as **ts-dev** but without start and end position of the nodes. This format is expected by TreeSitter's native test command. `sanitize=1` cleans the output of **ts-dev** using `sed` to achieve the desired format. * **ts-test**: executes TreeSitter's native tests * **ts-build-so**: Generates `swh_ql.so` file from the previously generated parser using py-tree-sitter * **ts-build-so**: Generates `swh_ql.wasm` file from the previously generated parser using emscripten * **ts-build**: Executes both **ts-build-so** and **ts-build-so** diff --git a/pytest.ini b/pytest.ini index b712d00..e7f139e 100644 --- a/pytest.ini +++ b/pytest.ini @@ -1,2 +1,2 @@ [pytest] -norecursedirs = docs .* +norecursedirs = build docs .* diff --git a/swh.search.egg-info/PKG-INFO b/swh.search.egg-info/PKG-INFO index 2c855c5..04786f3 100644 --- a/swh.search.egg-info/PKG-INFO +++ b/swh.search.egg-info/PKG-INFO @@ -1,90 +1,90 @@ Metadata-Version: 2.1 Name: swh.search -Version: 0.13.1 +Version: 0.13.2 Summary: Software Heritage search service Home-page: https://forge.softwareheritage.org/diffusion/DSEA Author: Software Heritage developers Author-email: swh-devel@inria.fr License: UNKNOWN Project-URL: Bug Reports, https://forge.softwareheritage.org/maniphest Project-URL: Funding, https://www.softwareheritage.org/donate Project-URL: Source, https://forge.softwareheritage.org/source/swh-search Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-search/ Platform: UNKNOWN Classifier: Programming Language :: Python :: 3 Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Operating System :: OS Independent Classifier: Development Status :: 3 - Alpha Requires-Python: >=3.7 Description-Content-Type: text/markdown Provides-Extra: testing License-File: LICENSE License-File: AUTHORS swh-search ========== Search service for the Software Heritage archive. It is similar to swh-storage in what it contains, but provides different ways to query it: while swh-storage is mostly a key-value store that returns an object from a primary key, swh-search is focused on reverse indices, to allow finding objects that match some criteria; for example full-text search. Currently uses ElasticSearch, and provides only origin search (by URL and metadata) ## Dependencies - Python tests for this module include tests that cannot be run without a local ElasticSearch instance, so you need the ElasticSearch server executable on your machine (no need to have a running ElasticSearch server). - Debian-like host The elasticsearch package is required. As it's not part of debian-stable, [another debian repository is required to be configured](https://www.elastic.co/guide/en/elasticsearch/reference/current/deb.html#deb-repo) - Non Debian-like host The tests expect: - `/usr/share/elasticsearch/jdk/bin/java` to exist. - `org.elasticsearch.bootstrap.Elasticsearch` to be in java's classpath. - Emscripten is required for generating tree-sitter WASM module. The following commands need to be executed for the setup: ```bash cd /opt && git clone https://github.com/emscripten-core/emsdk.git && cd emsdk && \ ./emsdk install latest && ./emsdk activate latest PATH="${PATH}:/opt/emsdk/upstream/emscripten" ``` **Note:** If emsdk isn't found in the PATH, the tree-sitter cli automatically pulls `emscripten/emsdk` image from docker hub when `make ts-build-wasm` or `make ts-build` is used. ## Make targets Below is the list of available make targets that can be executed from the root directory of swh-search in order to build and/or execute the swh-search under various configurations: * **ts-install**: Install node_modules and emscripten SDK required for TreeSitter * **ts-generate**: Generate parser files(C and JSON) from the grammar * **ts-repl**: Starts a web based playground for the TreeSitter grammar. It's the recommended way for developing TreeSitter grammar. * **ts-dev**: Parse the `query_language/sample_query` and print the corresponding syntax expression along with the start and end positions of all the nodes. * **ts-dev sanitize=1**: Same as **ts-dev** but without start and end position of the nodes. This format is expected by TreeSitter's native test command. `sanitize=1` cleans the output of **ts-dev** using `sed` to achieve the desired format. * **ts-test**: executes TreeSitter's native tests * **ts-build-so**: Generates `swh_ql.so` file from the previously generated parser using py-tree-sitter * **ts-build-so**: Generates `swh_ql.wasm` file from the previously generated parser using emscripten * **ts-build**: Executes both **ts-build-so** and **ts-build-so** diff --git a/swh/search/api/server.py b/swh/search/api/server.py index 112c903..25d491a 100644 --- a/swh/search/api/server.py +++ b/swh/search/api/server.py @@ -1,100 +1,106 @@ # Copyright (C) 2019-2021 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import logging import os from typing import Any, Dict from swh.core import config from swh.core.api import RPCServerApp from swh.core.api import encode_data_server as encode_data from swh.core.api import error_handler from swh.search.metrics import timed from .. import get_search +from ..exc import SearchException from ..interface import SearchInterface logger = logging.getLogger(__name__) def _get_search(): global search if not search: search = get_search(**app.config["search"]) return search app = RPCServerApp(__name__, backend_class=SearchInterface, backend_factory=_get_search) search = None +@app.errorhandler(SearchException) +def search_error_handler(exception): + return error_handler(exception, encode_data, status_code=400) + + @app.errorhandler(Exception) def my_error_handler(exception): return error_handler(exception, encode_data) @app.route("/") @timed def index(): return "SWH Search API server" @app.before_first_request def initialized_index(): search = _get_search() logger.info("Initializing indexes with configuration: ", search.origin_config) search.initialize() api_cfg = None def load_and_check_config(config_file: str) -> Dict[str, Any]: """Check the minimal configuration is set to run the api or raise an error explanation. Args: config_file: Path to the configuration file to load type: configuration type. For 'local' type, more checks are done. Raises: Error if the setup is not as expected Returns: configuration as a dict """ if not config_file: raise EnvironmentError("Configuration file must be defined") if not os.path.exists(config_file): raise FileNotFoundError("Configuration file %s does not exist" % (config_file,)) cfg = config.read(config_file) if "search" not in cfg: raise KeyError("Missing 'search' configuration") return cfg def make_app_from_configfile(): """Run the WSGI app from the webserver, loading the configuration from a configuration file. SWH_CONFIG_FILENAME environment variable defines the configuration path to load. """ global api_cfg if not api_cfg: config_file = os.environ.get("SWH_CONFIG_FILENAME") api_cfg = load_and_check_config(config_file) app.config.update(api_cfg) handler = logging.StreamHandler() app.logger.addHandler(handler) return app diff --git a/swh/search/elasticsearch.py b/swh/search/elasticsearch.py index 5cc0451..e83053e 100644 --- a/swh/search/elasticsearch.py +++ b/swh/search/elasticsearch.py @@ -1,555 +1,554 @@ # Copyright (C) 2019-2021 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import base64 from collections import Counter import logging import pprint from textwrap import dedent from typing import Any, Dict, Iterable, List, Optional from elasticsearch import Elasticsearch, helpers import msgpack from swh.indexer import codemeta from swh.model import model from swh.model.hashutil import hash_to_hex from swh.search.interface import ( SORT_BY_OPTIONS, MinimalOriginDict, OriginDict, PagedResult, ) from swh.search.metrics import send_metric, timed from swh.search.translator import Translator from swh.search.utils import escape, get_expansion, parse_and_format_date logger = logging.getLogger(__name__) INDEX_NAME_PARAM = "index" READ_ALIAS_PARAM = "read_alias" WRITE_ALIAS_PARAM = "write_alias" ORIGIN_DEFAULT_CONFIG = { INDEX_NAME_PARAM: "origin", READ_ALIAS_PARAM: "origin-read", WRITE_ALIAS_PARAM: "origin-write", } +ORIGIN_MAPPING = { + "dynamic_templates": [ + { + "booleans_as_string": { + # All fields stored as string in the metadata + # even the booleans + "match_mapping_type": "boolean", + "path_match": "intrinsic_metadata.*", + "mapping": {"type": "keyword"}, + } + } + ], + "date_detection": False, + "properties": { + # sha1 of the URL; used as the document id + "sha1": {"type": "keyword", "doc_values": True,}, + # Used both to search URLs, and as the result to return + # as a response to queries + "url": { + "type": "text", + # To split URLs into token on any character + # that is not alphanumerical + "analyzer": "simple", + # 2-gram and partial-3-gram search (ie. with the end of the + # third word potentially missing) + "fields": { + "as_you_type": {"type": "search_as_you_type", "analyzer": "simple",} + }, + }, + "visit_types": {"type": "keyword"}, + # used to filter out origins that were never visited + "has_visits": {"type": "boolean",}, + "nb_visits": {"type": "integer"}, + "snapshot_id": {"type": "keyword"}, + "last_visit_date": {"type": "date"}, + "last_eventful_visit_date": {"type": "date"}, + "last_release_date": {"type": "date"}, + "last_revision_date": {"type": "date"}, + "intrinsic_metadata": { + "type": "nested", + "properties": { + "@context": { + # don't bother indexing tokens in these URIs, as the + # are used as namespaces + "type": "keyword", + }, + "http://schema": { + "properties": { + "org/dateCreated": { + "properties": {"@value": {"type": "date",}} + }, + "org/dateModified": { + "properties": {"@value": {"type": "date",}} + }, + "org/datePublished": { + "properties": {"@value": {"type": "date",}} + }, + } + }, + }, + }, + # Has this origin been taken down? + "blocklisted": {"type": "boolean",}, + }, +} + +# painless script that will be executed when updating an origin document +ORIGIN_UPDATE_SCRIPT = dedent( + """ + // utility function to get and parse date + ZonedDateTime getDate(def ctx, String date_field) { + String default_date = "0001-01-01T00:00:00Z"; + String date = ctx._source.getOrDefault(date_field, default_date); + return ZonedDateTime.parse(date); + } + + // backup current visit_types field value + List visit_types = ctx._source.getOrDefault("visit_types", []); + int nb_visits = ctx._source.getOrDefault("nb_visits", 0); + + ZonedDateTime last_visit_date = getDate(ctx, "last_visit_date"); + + String snapshot_id = ctx._source.getOrDefault("snapshot_id", ""); + ZonedDateTime last_eventful_visit_date = + getDate(ctx, "last_eventful_visit_date"); + ZonedDateTime last_revision_date = getDate(ctx, "last_revision_date"); + ZonedDateTime last_release_date = getDate(ctx, "last_release_date"); + + // update origin document with new field values + ctx._source.putAll(params); + + // restore previous visit types after visit_types field overriding + if (ctx._source.containsKey("visit_types")) { + for (int i = 0; i < visit_types.length; ++i) { + if (!ctx._source.visit_types.contains(visit_types[i])) { + ctx._source.visit_types.add(visit_types[i]); + } + } + } + + // Undo overwrite if incoming nb_visits is smaller + if (ctx._source.containsKey("nb_visits")) { + int incoming_nb_visits = ctx._source.getOrDefault("nb_visits", 0); + if(incoming_nb_visits < nb_visits){ + ctx._source.nb_visits = nb_visits; + } + } + + // Undo overwrite if incoming last_visit_date is older + if (ctx._source.containsKey("last_visit_date")) { + ZonedDateTime incoming_last_visit_date = getDate(ctx, "last_visit_date"); + int difference = + // returns -1, 0 or 1 + incoming_last_visit_date.compareTo(last_visit_date); + if(difference < 0){ + ctx._source.last_visit_date = last_visit_date; + } + } + + // Undo update of last_eventful_date and snapshot_id if + // snapshot_id hasn't changed OR incoming_last_eventful_visit_date is older + if (ctx._source.containsKey("snapshot_id")) { + String incoming_snapshot_id = ctx._source.getOrDefault("snapshot_id", ""); + ZonedDateTime incoming_last_eventful_visit_date = + getDate(ctx, "last_eventful_visit_date"); + int difference = + // returns -1, 0 or 1 + incoming_last_eventful_visit_date.compareTo(last_eventful_visit_date); + if(snapshot_id == incoming_snapshot_id || difference < 0){ + ctx._source.snapshot_id = snapshot_id; + ctx._source.last_eventful_visit_date = last_eventful_visit_date; + } + } + + // Undo overwrite if incoming last_revision_date is older + if (ctx._source.containsKey("last_revision_date")) { + ZonedDateTime incoming_last_revision_date = + getDate(ctx, "last_revision_date"); + int difference = + // returns -1, 0 or 1 + incoming_last_revision_date.compareTo(last_revision_date); + if(difference < 0){ + ctx._source.last_revision_date = last_revision_date; + } + } + + // Undo overwrite if incoming last_release_date is older + if (ctx._source.containsKey("last_release_date")) { + ZonedDateTime incoming_last_release_date = + getDate(ctx, "last_release_date"); + // returns -1, 0 or 1 + int difference = incoming_last_release_date.compareTo(last_release_date); + if(difference < 0){ + ctx._source.last_release_date = last_release_date; + } + } + """ +) + def _sanitize_origin(origin): origin = origin.copy() # Whitelist fields to be saved in Elasticsearch res = {"url": origin.pop("url")} for field_name in ( "blocklisted", "has_visits", "intrinsic_metadata", "visit_types", "nb_visits", "snapshot_id", "last_visit_date", "last_eventful_visit_date", "last_revision_date", "last_release_date", ): if field_name in origin: res[field_name] = origin.pop(field_name) # Run the JSON-LD expansion algorithm # # to normalize the Codemeta metadata. # This is required as Elasticsearch will needs each field to have a consistent # type across documents to be searchable; and non-expanded JSON-LD documents # can have various types in the same field. For example, all these are # equivalent in JSON-LD: # * {"author": "Jane Doe"} # * {"author": ["Jane Doe"]} # * {"author": {"@value": "Jane Doe"}} # * {"author": [{"@value": "Jane Doe"}]} # and JSON-LD expansion will convert them all to the last one. if "intrinsic_metadata" in res: intrinsic_metadata = res["intrinsic_metadata"] for date_field in ["dateCreated", "dateModified", "datePublished"]: if date_field in intrinsic_metadata: date = intrinsic_metadata[date_field] # If date{Created,Modified,Published} value isn't parsable # It gets rejected and isn't stored (unlike other fields) formatted_date = parse_and_format_date(date) if formatted_date is None: intrinsic_metadata.pop(date_field) else: intrinsic_metadata[date_field] = formatted_date res["intrinsic_metadata"] = codemeta.expand(intrinsic_metadata) return res def token_encode(index_to_tokenize: Dict[bytes, Any]) -> str: """Tokenize as string an index page result from a search""" page_token = base64.b64encode(msgpack.dumps(index_to_tokenize)) return page_token.decode() def token_decode(page_token: str) -> Dict[bytes, Any]: """Read the page_token""" return msgpack.loads(base64.b64decode(page_token.encode()), raw=True) class ElasticSearch: def __init__(self, hosts: List[str], indexes: Dict[str, Dict[str, str]] = {}): self._backend = Elasticsearch(hosts=hosts) self._translator = Translator() # Merge current configuration with default values origin_config = indexes.get("origin", {}) self.origin_config = {**ORIGIN_DEFAULT_CONFIG, **origin_config} def _get_origin_index(self) -> str: return self.origin_config[INDEX_NAME_PARAM] def _get_origin_read_alias(self) -> str: return self.origin_config[READ_ALIAS_PARAM] def _get_origin_write_alias(self) -> str: return self.origin_config[WRITE_ALIAS_PARAM] @timed def check(self): return self._backend.ping() def deinitialize(self) -> None: """Removes all indices from the Elasticsearch backend""" self._backend.indices.delete(index="*") def initialize(self) -> None: """Declare Elasticsearch indices, aliases and mappings""" if not self._backend.indices.exists(index=self._get_origin_index()): self._backend.indices.create(index=self._get_origin_index()) if not self._backend.indices.exists_alias(name=self._get_origin_read_alias()): self._backend.indices.put_alias( index=self._get_origin_index(), name=self._get_origin_read_alias() ) if not self._backend.indices.exists_alias(name=self._get_origin_write_alias()): self._backend.indices.put_alias( index=self._get_origin_index(), name=self._get_origin_write_alias() ) self._backend.indices.put_mapping( - index=self._get_origin_index(), - body={ - "dynamic_templates": [ - { - "booleans_as_string": { - # All fields stored as string in the metadata - # even the booleans - "match_mapping_type": "boolean", - "path_match": "intrinsic_metadata.*", - "mapping": {"type": "keyword"}, - } - } - ], - "date_detection": False, - "properties": { - # sha1 of the URL; used as the document id - "sha1": {"type": "keyword", "doc_values": True,}, - # Used both to search URLs, and as the result to return - # as a response to queries - "url": { - "type": "text", - # To split URLs into token on any character - # that is not alphanumerical - "analyzer": "simple", - # 2-gram and partial-3-gram search (ie. with the end of the - # third word potentially missing) - "fields": { - "as_you_type": { - "type": "search_as_you_type", - "analyzer": "simple", - } - }, - }, - "visit_types": {"type": "keyword"}, - # used to filter out origins that were never visited - "has_visits": {"type": "boolean",}, - "nb_visits": {"type": "integer"}, - "snapshot_id": {"type": "keyword"}, - "last_visit_date": {"type": "date"}, - "last_eventful_visit_date": {"type": "date"}, - "last_release_date": {"type": "date"}, - "last_revision_date": {"type": "date"}, - "intrinsic_metadata": { - "type": "nested", - "properties": { - "@context": { - # don't bother indexing tokens in these URIs, as the - # are used as namespaces - "type": "keyword", - }, - "http://schema": { - "properties": { - "org/dateCreated": { - "properties": {"@value": {"type": "date",}} - }, - "org/dateModified": { - "properties": {"@value": {"type": "date",}} - }, - "org/datePublished": { - "properties": {"@value": {"type": "date",}} - }, - } - }, - }, - }, - # Has this origin been taken down? - "blocklisted": {"type": "boolean",}, - }, - }, + index=self._get_origin_index(), body=ORIGIN_MAPPING ) @timed def flush(self) -> None: self._backend.indices.refresh(index=self._get_origin_write_alias()) @timed def origin_update(self, documents: Iterable[OriginDict]) -> None: write_index = self._get_origin_write_alias() documents = map(_sanitize_origin, documents) documents_with_sha1 = ( (hash_to_hex(model.Origin(url=document["url"]).id), document) for document in documents ) - # painless script that will be executed when updating an origin document - update_script = dedent( - """ - // utility function to get and parse date - ZonedDateTime getDate(def ctx, String date_field) { - String default_date = "0001-01-01T00:00:00Z"; - String date = ctx._source.getOrDefault(date_field, default_date); - return ZonedDateTime.parse(date); - } - - // backup current visit_types field value - List visit_types = ctx._source.getOrDefault("visit_types", []); - int nb_visits = ctx._source.getOrDefault("nb_visits", 0); - - ZonedDateTime last_visit_date = getDate(ctx, "last_visit_date"); - - String snapshot_id = ctx._source.getOrDefault("snapshot_id", ""); - ZonedDateTime last_eventful_visit_date = - getDate(ctx, "last_eventful_visit_date"); - ZonedDateTime last_revision_date = getDate(ctx, "last_revision_date"); - ZonedDateTime last_release_date = getDate(ctx, "last_release_date"); - - // update origin document with new field values - ctx._source.putAll(params); - - // restore previous visit types after visit_types field overriding - if (ctx._source.containsKey("visit_types")) { - for (int i = 0; i < visit_types.length; ++i) { - if (!ctx._source.visit_types.contains(visit_types[i])) { - ctx._source.visit_types.add(visit_types[i]); - } - } - } - - // Undo overwrite if incoming nb_visits is smaller - if (ctx._source.containsKey("nb_visits")) { - int incoming_nb_visits = ctx._source.getOrDefault("nb_visits", 0); - if(incoming_nb_visits < nb_visits){ - ctx._source.nb_visits = nb_visits; - } - } - - // Undo overwrite if incoming last_visit_date is older - if (ctx._source.containsKey("last_visit_date")) { - ZonedDateTime incoming_last_visit_date = getDate(ctx, "last_visit_date"); - int difference = - // returns -1, 0 or 1 - incoming_last_visit_date.compareTo(last_visit_date); - if(difference < 0){ - ctx._source.last_visit_date = last_visit_date; - } - } - - // Undo update of last_eventful_date and snapshot_id if - // snapshot_id hasn't changed OR incoming_last_eventful_visit_date is older - if (ctx._source.containsKey("snapshot_id")) { - String incoming_snapshot_id = ctx._source.getOrDefault("snapshot_id", ""); - ZonedDateTime incoming_last_eventful_visit_date = - getDate(ctx, "last_eventful_visit_date"); - int difference = - // returns -1, 0 or 1 - incoming_last_eventful_visit_date.compareTo(last_eventful_visit_date); - if(snapshot_id == incoming_snapshot_id || difference < 0){ - ctx._source.snapshot_id = snapshot_id; - ctx._source.last_eventful_visit_date = last_eventful_visit_date; - } - } - - // Undo overwrite if incoming last_revision_date is older - if (ctx._source.containsKey("last_revision_date")) { - ZonedDateTime incoming_last_revision_date = - getDate(ctx, "last_revision_date"); - int difference = - // returns -1, 0 or 1 - incoming_last_revision_date.compareTo(last_revision_date); - if(difference < 0){ - ctx._source.last_revision_date = last_revision_date; - } - } - - // Undo overwrite if incoming last_release_date is older - if (ctx._source.containsKey("last_release_date")) { - ZonedDateTime incoming_last_release_date = - getDate(ctx, "last_release_date"); - // returns -1, 0 or 1 - int difference = incoming_last_release_date.compareTo(last_release_date); - if(difference < 0){ - ctx._source.last_release_date = last_release_date; - } - } - """ # noqa - ) actions = [ { "_op_type": "update", "_id": sha1, "_index": write_index, "scripted_upsert": True, "upsert": {**document, "sha1": sha1,}, "retry_on_conflict": 10, "script": { - "source": update_script, + "source": ORIGIN_UPDATE_SCRIPT, "lang": "painless", "params": document, }, } for (sha1, document) in documents_with_sha1 ] indexed_count, errors = helpers.bulk(self._backend, actions, index=write_index) assert isinstance(errors, List) # Make mypy happy send_metric("document:index", count=indexed_count, method_name="origin_update") send_metric( "document:index_error", count=len(errors), method_name="origin_update" ) @timed def origin_search( self, *, query: str = "", url_pattern: Optional[str] = None, metadata_pattern: Optional[str] = None, with_visit: bool = False, visit_types: Optional[List[str]] = None, min_nb_visits: int = 0, min_last_visit_date: str = "", min_last_eventful_visit_date: str = "", min_last_revision_date: str = "", min_last_release_date: str = "", min_date_created: str = "", min_date_modified: str = "", min_date_published: str = "", programming_languages: Optional[List[str]] = None, licenses: Optional[List[str]] = None, keywords: Optional[List[str]] = None, sort_by: Optional[List[str]] = None, page_token: Optional[str] = None, limit: int = 50, ) -> PagedResult[MinimalOriginDict]: query_clauses: List[Dict[str, Any]] = [] query_filters = [] if url_pattern: query_filters.append(f"origin : {escape(url_pattern)}") if metadata_pattern: query_filters.append(f"metadata : {escape(metadata_pattern)}") # if not query_clauses: # raise ValueError( # "At least one of url_pattern and metadata_pattern must be provided." # ) if with_visit: query_filters.append(f"visited = {'true' if with_visit else 'false'}") if min_nb_visits: query_filters.append(f"visits >= {min_nb_visits}") if min_last_visit_date: query_filters.append( f"last_visit >= {min_last_visit_date.replace('Z', '+00:00')}" ) if min_last_eventful_visit_date: query_filters.append( "last_eventful_visit >= " f"{min_last_eventful_visit_date.replace('Z', '+00:00')}" ) if min_last_revision_date: query_filters.append( f"last_revision >= {min_last_revision_date.replace('Z', '+00:00')}" ) if min_last_release_date: query_filters.append( f"last_release >= {min_last_release_date.replace('Z', '+00:00')}" ) if keywords: query_filters.append(f"keyword in {escape(keywords)}") if licenses: query_filters.append(f"license in {escape(licenses)}") if programming_languages: query_filters.append(f"language in {escape(programming_languages)}") if min_date_created: query_filters.append( f"created >= {min_date_created.replace('Z', '+00:00')}" ) if min_date_modified: query_filters.append( f"modified >= {min_date_modified.replace('Z', '+00:00')}" ) if min_date_published: query_filters.append( f"published >= {min_date_published.replace('Z', '+00:00')}" ) if visit_types is not None: query_filters.append(f"visit_type = {escape(visit_types)}") combined_filters = " and ".join(query_filters) if combined_filters and query: query = f"{combined_filters} and {query}" else: query = combined_filters or query parsed_query = self._translator.parse_query(query) query_clauses.append(parsed_query["filters"]) field_map = { "visits": "nb_visits", "last_visit": "last_visit_date", "last_eventful_visit": "last_eventful_visit_date", "last_revision": "last_revision_date", "last_release": "last_release_date", "created": "date_created", "modified": "date_modified", "published": "date_published", } if "sortBy" in parsed_query: if sort_by is None: sort_by = [] for sort_by_option in parsed_query["sortBy"]: if sort_by_option[0] == "-": sort_by.append("-" + field_map[sort_by_option[1:]]) else: sort_by.append(field_map[sort_by_option]) if parsed_query.get("limit", 0): limit = parsed_query["limit"] sorting_params: List[Dict[str, Any]] = [] if sort_by: for field in sort_by: order = "asc" if field and field[0] == "-": field = field[1:] order = "desc" if field in ["date_created", "date_modified", "date_published"]: sorting_params.append( { get_expansion(field, "."): { "nested_path": "intrinsic_metadata", "order": order, } } ) elif field in SORT_BY_OPTIONS: sorting_params.append({field: order}) sorting_params.extend( [{"_score": "desc"}, {"sha1": "asc"},] ) body = { "query": { "bool": { "must": query_clauses, "must_not": [{"term": {"blocklisted": True}}], } }, "sort": sorting_params, } if page_token: # TODO: use ElasticSearch's scroll API? page_token_content = token_decode(page_token) body["search_after"] = [ page_token_content[b"score"], page_token_content[b"sha1"].decode("ascii"), ] if logger.isEnabledFor(logging.DEBUG): formatted_body = pprint.pformat(body) logger.debug("Search query body: %s", formatted_body) res = self._backend.search( index=self._get_origin_read_alias(), body=body, size=limit ) hits = res["hits"]["hits"] next_page_token: Optional[str] = None if len(hits) == limit: # There are more results after this page; return a pagination token # to get them in a future query last_hit = hits[-1] next_page_token_content = { b"score": last_hit["_score"], b"sha1": last_hit["_source"]["sha1"], } next_page_token = token_encode(next_page_token_content) assert len(hits) <= limit return PagedResult( results=[{"url": hit["_source"]["url"]} for hit in hits], next_page_token=next_page_token, ) def visit_types_count(self) -> Counter: body = { "aggs": { "not_blocklisted": { "filter": {"bool": {"must_not": [{"term": {"blocklisted": True}}]}}, "aggs": { "visit_types": {"terms": {"field": "visit_types", "size": 1000}} }, } } } res = self._backend.search( index=self._get_origin_read_alias(), body=body, size=0 ) buckets = ( res.get("aggregations", {}) .get("not_blocklisted", {}) .get("visit_types", {}) .get("buckets", []) ) return Counter({bucket["key"]: bucket["doc_count"] for bucket in buckets}) diff --git a/swh/search/tests/test_api_client.py b/swh/search/tests/test_api_client.py index b33fe0e..c8cf385 100644 --- a/swh/search/tests/test_api_client.py +++ b/swh/search/tests/test_api_client.py @@ -1,62 +1,64 @@ # Copyright (C) 2019-2020 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information import unittest import pytest from swh.core.api.tests.server_testing import ServerTestFixture from swh.search import get_search from swh.search.api.server import app -from .test_search import CommonSearchTest +from .test_elasticsearch import CommonElasticsearchSearchTest -class TestRemoteSearch(CommonSearchTest, ServerTestFixture, unittest.TestCase): +class TestRemoteSearch( + CommonElasticsearchSearchTest, ServerTestFixture, unittest.TestCase +): @pytest.fixture(autouse=True) def _instantiate_search(self, elasticsearch_host): self._elasticsearch_host = elasticsearch_host def setUp(self): self.config = { "search": { "cls": "elasticsearch", "args": { "hosts": [self._elasticsearch_host], "indexes": { "origin": { "index": "test", "read_alias": "test-read", "write_alias": "test-write", } }, }, } } self.app = app super().setUp() self.reset() self.search = get_search("remote", url=self.url(),) def reset(self): search = get_search( "elasticsearch", hosts=[self._elasticsearch_host], indexes={ "origin": { "index": "test", "read_alias": "test-read", "write_alias": "test-write", } }, ) search.deinitialize() search.initialize() @pytest.mark.skip( "Elasticsearch also returns close matches, so this test would fail" ) def test_origin_url_paging(self, count): pass diff --git a/swh/search/tests/test_elasticsearch.py b/swh/search/tests/test_elasticsearch.py index 7e460d5..147317d 100644 --- a/swh/search/tests/test_elasticsearch.py +++ b/swh/search/tests/test_elasticsearch.py @@ -1,278 +1,281 @@ # Copyright (C) 2019-2022 The Software Heritage developers # See the AUTHORS file at the top-level directory of this distribution # License: GNU General Public License version 3, or any later version # See top-level LICENSE file for more information from datetime import datetime, timedelta, timezone from textwrap import dedent import types import unittest from elasticsearch.helpers.errors import BulkIndexError import pytest from swh.search.exc import SearchQuerySyntaxError from swh.search.metrics import OPERATIONS_METRIC from .test_search import CommonSearchTest now = datetime.now(tz=timezone.utc).isoformat() now_minus_5_days = (datetime.now(tz=timezone.utc) - timedelta(days=5)).isoformat() now_plus_5_days = (datetime.now(tz=timezone.utc) + timedelta(days=5)).isoformat() ORIGINS = [ { "url": "http://foobar.1.com", "nb_visits": 1, "last_visit_date": now_minus_5_days, "last_eventful_visit_date": now_minus_5_days, }, { "url": "http://foobar.2.com", "nb_visits": 2, "last_visit_date": now, "last_eventful_visit_date": now, }, { "url": "http://foobar.3.com", "nb_visits": 3, "last_visit_date": now_plus_5_days, "last_eventful_visit_date": now_minus_5_days, }, { "url": "http://barbaz.4.com", "nb_visits": 3, "last_visit_date": now_plus_5_days, "last_eventful_visit_date": now_minus_5_days, }, ] -class BaseElasticsearchTest(unittest.TestCase): - @pytest.fixture(autouse=True) - def _instantiate_search(self, swh_search, elasticsearch_host, mocker): - self._elasticsearch_host = elasticsearch_host - self.search = swh_search - self.mocker = mocker - - # override self.search.origin_update to catch painless script errors - # and pretty print them - origin_update = self.search.origin_update - - def _origin_update(self, *args, **kwargs): - script_error = False - error_detail = "" - try: - origin_update(*args, **kwargs) - except BulkIndexError as e: - error = e.errors[0].get("update", {}).get("error", {}).get("caused_by") - if error and "script_stack" in error: - script_error = True - error_detail = dedent( - f""" - Painless update script failed ({error.get('reason')}). - error type: {error.get('caused_by', {}).get('type')} - error reason: {error.get('caused_by', {}).get('reason')} - script stack: - - """ - ) - error_detail += "\n".join(error["script_stack"]) - else: - raise e - assert script_error is False, error_detail[1:] - - self.search.origin_update = types.MethodType(_origin_update, self.search) - - def reset(self): - self.search.deinitialize() - self.search.initialize() - - -class TestElasticsearchSearch(CommonSearchTest, BaseElasticsearchTest): - def test_metrics_update_duration(self): - mock = self.mocker.patch("swh.search.metrics.statsd.timing") - - for url in ["http://foobar.bar", "http://foobar.baz"]: - self.search.origin_update([{"url": url}]) - - assert mock.call_count == 2 - - def test_metrics_search_duration(self): - mock = self.mocker.patch("swh.search.metrics.statsd.timing") - - for url_pattern in ["foobar", "foobaz"]: - self.search.origin_search(url_pattern=url_pattern, with_visit=True) - - assert mock.call_count == 2 - - def test_metrics_indexation_counters(self): - mock_es = self.mocker.patch("elasticsearch.helpers.bulk") - mock_es.return_value = 2, ["error"] - - mock_metrics = self.mocker.patch("swh.search.metrics.statsd.increment") - - self.search.origin_update([{"url": "http://foobar.baz"}]) - - assert mock_metrics.call_count == 2 - - mock_metrics.assert_any_call( - OPERATIONS_METRIC, - 2, - tags={ - "endpoint": "origin_update", - "object_type": "document", - "operation": "index", - }, - ) - mock_metrics.assert_any_call( - OPERATIONS_METRIC, - 1, - tags={ - "endpoint": "origin_update", - "object_type": "document", - "operation": "index_error", - }, - ) - - def test_write_alias_usage(self): - mock = self.mocker.patch("elasticsearch.helpers.bulk") - mock.return_value = 2, ["result"] - - self.search.origin_update([{"url": "http://foobar.baz"}]) - - assert mock.call_args[1]["index"] == "test-write" - - def test_read_alias_usage(self): - mock = self.mocker.patch("elasticsearch.Elasticsearch.search") - mock.return_value = {"hits": {"hits": []}} - - self.search.origin_search(url_pattern="foobar.baz") - - assert mock.call_args[1]["index"] == "test-read" +class CommonElasticsearchSearchTest(CommonSearchTest): + """Tests shared between this module (direct ES backend test) and test_api_client.py + (ES backend via HTTP test)""" def test_sort_by_and_limit_query(self): self.search.origin_update(ORIGINS) self.search.flush() def _check_results(query, origin_indices): page = self.search.origin_search(url_pattern="foobar", query=query) results = [r["url"] for r in page.results] assert results == [ORIGINS[index]["url"] for index in origin_indices] _check_results("sort_by = [-visits]", [2, 1, 0]) _check_results("sort_by = [last_visit]", [0, 1, 2]) _check_results("sort_by = [-last_eventful_visit, visits]", [1, 0, 2]) _check_results("sort_by = [last_eventful_visit,-last_visit]", [2, 0, 1]) _check_results("sort_by = [-visits] limit = 1", [2]) _check_results("sort_by = [last_visit] and limit = 2", [0, 1]) _check_results("sort_by = [-last_eventful_visit, visits] limit = 3", [1, 0, 2]) def test_search_ql_simple(self): self.search.origin_update(ORIGINS) self.search.flush() results = { r["url"] for r in self.search.origin_search(query='origin : "foobar"').results } assert results == { "http://foobar.1.com", "http://foobar.2.com", "http://foobar.3.com", } def test_search_ql_datetimes(self): self.search.origin_update(ORIGINS) self.search.flush() now_minus_5_minutes = ( datetime.now(tz=timezone.utc) - timedelta(minutes=5) ).isoformat() now_plus_5_minutes = ( datetime.now(tz=timezone.utc) + timedelta(minutes=5) ).isoformat() results = { r["url"] for r in self.search.origin_search( query=( f"last_visit < {now_minus_5_minutes} " f"or last_visit > {now_plus_5_minutes}" ) ).results } assert results == { "http://foobar.1.com", "http://foobar.3.com", "http://barbaz.4.com", } def test_search_ql_dates(self): self.search.origin_update(ORIGINS) self.search.flush() now_minus_2_days = ( (datetime.now(tz=timezone.utc) - timedelta(days=2)).date().isoformat() ) now_plus_2_days = ( (datetime.now(tz=timezone.utc) + timedelta(days=2)).date().isoformat() ) results = { r["url"] for r in self.search.origin_search( query=( f"last_visit < {now_minus_2_days} " f"or last_visit > {now_plus_2_days}" ) ).results } assert results == { "http://foobar.1.com", "http://foobar.3.com", "http://barbaz.4.com", } def test_search_ql_visited(self): self.search.origin_update( [ { "url": "http://foobar.1.com", "has_visits": True, "nb_visits": 1, "last_visit_date": now_minus_5_days, "last_eventful_visit_date": now_minus_5_days, }, {"url": "http://foobar.2.com",}, {"url": "http://foobar.3.com", "has_visits": False,}, ] ) self.search.flush() assert { r["url"] for r in self.search.origin_search(query="visited = true").results } == {"http://foobar.1.com"} assert { r["url"] for r in self.search.origin_search(query="visited = false").results } == {"http://foobar.2.com", "http://foobar.3.com"} assert ( self.search.origin_search( query="visited = true and visited = false" ).results == [] ) assert ( self.search.origin_search(query="visited = false", with_visit=True).results == [] ) def test_query_syntax_error(self): self.search.origin_update(ORIGINS) self.search.flush() with pytest.raises(SearchQuerySyntaxError): self.search.origin_search(query="foobar") + + +class TestElasticsearchSearch(CommonElasticsearchSearchTest, unittest.TestCase): + @pytest.fixture(autouse=True) + def _instantiate_search(self, swh_search, elasticsearch_host, mocker): + self._elasticsearch_host = elasticsearch_host + self.search = swh_search + self.mocker = mocker + + # override self.search.origin_update to catch painless script errors + # and pretty print them + origin_update = self.search.origin_update + + def _origin_update(self, *args, **kwargs): + script_error = False + error_detail = "" + try: + origin_update(*args, **kwargs) + except BulkIndexError as e: + error = e.errors[0].get("update", {}).get("error", {}).get("caused_by") + if error and "script_stack" in error: + script_error = True + error_detail = dedent( + f""" + Painless update script failed ({error.get('reason')}). + error type: {error.get('caused_by', {}).get('type')} + error reason: {error.get('caused_by', {}).get('reason')} + script stack: + + """ + ) + error_detail += "\n".join(error["script_stack"]) + else: + raise e + assert script_error is False, error_detail[1:] + + self.search.origin_update = types.MethodType(_origin_update, self.search) + + def reset(self): + self.search.deinitialize() + self.search.initialize() + + def test_metrics_update_duration(self): + mock = self.mocker.patch("swh.search.metrics.statsd.timing") + + for url in ["http://foobar.bar", "http://foobar.baz"]: + self.search.origin_update([{"url": url}]) + + assert mock.call_count == 2 + + def test_metrics_search_duration(self): + mock = self.mocker.patch("swh.search.metrics.statsd.timing") + + for url_pattern in ["foobar", "foobaz"]: + self.search.origin_search(url_pattern=url_pattern, with_visit=True) + + assert mock.call_count == 2 + + def test_metrics_indexation_counters(self): + mock_es = self.mocker.patch("elasticsearch.helpers.bulk") + mock_es.return_value = 2, ["error"] + + mock_metrics = self.mocker.patch("swh.search.metrics.statsd.increment") + + self.search.origin_update([{"url": "http://foobar.baz"}]) + + assert mock_metrics.call_count == 2 + + mock_metrics.assert_any_call( + OPERATIONS_METRIC, + 2, + tags={ + "endpoint": "origin_update", + "object_type": "document", + "operation": "index", + }, + ) + mock_metrics.assert_any_call( + OPERATIONS_METRIC, + 1, + tags={ + "endpoint": "origin_update", + "object_type": "document", + "operation": "index_error", + }, + ) + + def test_write_alias_usage(self): + mock = self.mocker.patch("elasticsearch.helpers.bulk") + mock.return_value = 2, ["result"] + + self.search.origin_update([{"url": "http://foobar.baz"}]) + + assert mock.call_args[1]["index"] == "test-write" + + def test_read_alias_usage(self): + mock = self.mocker.patch("elasticsearch.Elasticsearch.search") + mock.return_value = {"hits": {"hits": []}} + + self.search.origin_search(url_pattern="foobar.baz") + + assert mock.call_args[1]["index"] == "test-read"