Deploy swh.search v0.10/v0.11
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	ardumont
	Jul 19 2021, 2:52 PM

Description

First in staging then in production.
from v0.9 to v0.10.

for v0.10, as the schema got updated, a new index will need to be created and then a
backfill is in order to populate correctly that new index.

Rough plan to install and backfill the new index:

Redo the tagging to v0.10.0 (it is 0.10.0 [2])
stop puppet on nodes running the journal clients and swh-search
stop the objects and metadata journal clients so they stop populating the future "old" index
upgrade the debian packages
restart swh-search to declare the new mappings in the old index [1]
restart puppet
manually launch journal client configured to index on a origin-v0.10 index
reset offsets on the origin_visit_status topics for the journal clients' consumer client
wait for the end of the reindexation (journal client: no more lags)
upgrade the new swh-search and journal client configurations in puppet to use the new index (done for webapp1)

[1] That's actually not totally sure whether that's the way to do it in our case. We may
have to do this ourselves manually in another way.

[2] That does not agree with the packaging build

Revisions and Commits

rDENV Development environment
	D6303	rDENVa0939bff7344 swh-web: fix the metadata backend configuration in the swh-search override
rSPSITE puppet-swh-site
	D6206	rSPSITEd19dc2f55c01 webapp: support new metadata search backend configuation
	D6197	rSPSITE6efa928ca146 swh-search: use the consumer group used during the reindexation
	D6182	rSPSITE2f4076496bbd swh-search: update the configuration for the deployment of v0.11.4
	D6176	rSPSITEf8bd91737496 swh-search: deploy v0.11.4 in staging

Related Objects
Search...

		Status	Assigned	Task
		Migrated	gitlab-migration	T3433 Deploy swh.search v0.10/v0.11
		Migrated	gitlab-migration	T3484 Fix the release builds for swh-search

Event Timeline

Related to T3083 T3398 T3391

ardumont updated the task description. (Show Details)Jul 19 2021, 3:02 PM

ardumont moved this task from Backlog to Weekly backlog on the System administration board.Jul 20 2021, 12:33 PM

ardumont updated the task description. (Show Details)Jul 21 2021, 9:40 AM

ardumont updated the task description. (Show Details)Jul 21 2021, 10:37 AM

ardumont updated the task description. (Show Details)Jul 21 2021, 10:44 AM

We've done the following:

tag v0.10.0 instead of 0.10.0; wait for the package build

on staging search0:

disable puppet
upgrade Debian packages
disable swh.search journal clients
reboot (to perform the pending kernel + systemd update)
notice the new storage entry needed for the config. Updated the config to restart the backend, which created the new origins-v0.10.0 index with the new mapping.
updated the journal clients configs to use the RPC backend instead of direct elasticsearch access
updated the write index with the following script

#!/bin/bash

ES_SERVER=search-esnode0:9200
OLD_INDEX=origin-v0.9.0
INDEX=origin-v0.10.0

curl -XPOST -H 'Content-Type: application/json' http://$ES_SERVER/_aliases -d '
 {
   "actions" : [
     { "remove" : { "index" : "'$OLD_INDEX'", "alias" : "origin-write" } },
     { "add" : { "index" : "'$INDEX'", "alias" : "origin-write" } }
   ]
 }'

TMP_CONFIG=`mktemp`
cat >$TMP_CONFIG <<EOF
{
   "index" : {
      "translog.sync_interval" : "60s",
      "translog.durability": "async",
      "refresh_interval": "60s"
   }
 }
EOF
curl -s -H "Content-Type: application/json" -XPUT http://${ES_SERVER}/${INDEX}/_settings -d @$TMP_CONFIG
rm $TMP_CONFIG

We've then restarted the journal clients to fill the new index.

We had to fix a few "real world data" issues, which are now landed (D6011, D6012, D6014), for that to actually work. (we've deployed the journal_client.py file directly on search0 to do so).

After spawning 7 more journal clients, and waiting for grafana/prometheus data to settle, the filling of the index was expected to take around 3 weeks (compared to a few hours for the processing done when deploying 0.9.0).

We've pinpointed the issue to the fetching of snapshots (and associated revisions/releases), for computing the latest_release/revision_date fields. These storage operations take substantially more time than just pushing lightly doctored origin_visit_status dicts (directly pulled from the journal) to elasticsearch.

For now, we've dropped these fields from the working copy of journal_client.py on search0.internal.staging, so that the processing can complete and search functionality can be restored.

We will need to discuss a more targeted plan to pull this data, which doesn't involve "pulling all snapshots, and associated objects, from the archive one by one", which is what processing the whole origin_visit_status topic amounts to, eventually. This approach would not have worked very well for staging, and it definitely won't work with the production data (we have 1.5 billion visit statuses!).

To solve that problem, we have a few ideas:

making the snapshot fetch conditional, so the complete reindexing can avoid it
enabling the snapshot fetch again once the initial reindex is done (and making sure that the journal client can actually keep up)
and, eventually, we will need to start writing actual targeted index migration scripts, rather than let the journal client start again from scratch every time we change a mapping or add a new field.

Writing index migration scripts will become more critical, specifically if we generalize the use of data that can't be pulled directly from journal events in swh.search: these extra data fetches can generate substantial load, and a lot of them are useless as we could just as well pull the data for the latest known snapshot for each origin (instead of *every* known snapshot for each origin). But even when only changing a data type in the mapping, doing a bulk reindex from within elasticsearch would be much more efficient than reindexing all statuses for all visits of all origins.

olasd changed the task status from Open to Work in Progress.Jul 21 2021, 5:28 PM

ardumont renamed this task from Deploy swh.search v0.10 to Deploy swh.search v0.10 on staging.Jul 21 2021, 5:31 PM

Another idea: move this fetching to a new indexer, and make it write to a new topic, which the swh-search journal client can read from.

Pros:

We already use this architecture with metadata
We can keep using replays for migrations

Cons:

One more moving part in the search pipeline.

By the way, a small list of caveats encountered when deploying search (not to fix
immediately, just to mention them).

to run the service: "swhstorage" could/should be swhsearch or we could even use systemd dynamic user mechanism
cli: fix journal_process_objects docstring which is not updated accordingly
avoid index initialization on read-only swh-search client (e.g configuration option to inhibit this behavior)
bugs encountered and fixed were due to the swh.search implementation being too optimistic on expected complete snapshot (snapshot can be empty though).

ardumont moved this task from Weekly backlog to in-progress on the System administration board.Jul 28 2021, 12:05 PM

A new swh.search v0.11 got tagged (this includes the current blocking point
deactivation). That's a workaround though. I've opened a task to avoid forgetting about
the conclusion on the discussion started.

In the mean time, we can deploy and realign both staging and production on the v0.11.

[1] T3479

ardumont renamed this task from Deploy swh.search v0.10 on staging to Deploy swh.search ~~v0.10~~ v0.11 on staging.Aug 11 2021, 11:24 AM

ardumont renamed this task from Deploy swh.search ~~v0.10~~ v0.11 on staging to Deploy swh.search v0.10/v0.11 on staging.

vsellier mentioned this in T3373: Metadata search is failing due to a boolean field in the mapping of the metadata fields.Aug 13 2021, 10:28 AM

vsellier created subtask T3484: Fix the release builds for swh-search.Aug 13 2021, 10:35 AM

vsellier closed subtask T3484: Fix the release builds for swh-search as Resolved.Sep 1 2021, 3:21 PM

Deployment of version v0.11.4 in staging:
On search0:

puppet stopped
stop and disable the journal clients and search backend
update the swh-search configuration to use a origin-v0.11 index

root@search0:/etc/softwareheritage/search# diff -U2 /tmp/server.yml server.yml 
--- /tmp/server.yml	2021-09-01 13:42:29.347951302 +0000
+++ server.yml	2021-09-01 13:42:35.739953523 +0000
@@ -7,5 +7,5 @@
   indexes:
     origin:
-      index: origin-v0.10.0
+      index: origin-v0.11
       read_alias: origin-read
       write_alias: origin-write

update the journal-clients to use a group id swh.search.journal_client.[indexed|object]-v0.11

root@search0:/etc/softwareheritage/search# diff -U3 /tmp/journal_client_objects.yml journal_client_objects.yml 
--- /tmp/journal_client_objects.yml	2021-09-01 13:44:49.843999978 +0000
+++ journal_client_objects.yml	2021-09-01 13:45:03.972004852 +0000
@@ -5,7 +5,7 @@
 journal:
   brokers:
   - journal0.internal.staging.swh.network
-  group_id: swh.search.journal_client-v0.10.0
+  group_id: swh.search.journal_client-v0.11
   prefix: swh.journal.objects
   object_types:
   - origin
root@search0:/etc/softwareheritage/search# diff -U3 /tmp/journal_client_indexed.yml journal_client_indexed.yml 
--- /tmp/journal_client_indexed.yml	2021-09-01 13:44:44.847998252 +0000
+++ journal_client_indexed.yml	2021-09-01 13:44:57.020002454 +0000
@@ -5,7 +5,7 @@
 journal:
   brokers:
   - journal0.internal.staging.swh.network
-  group_id: swh.search.journal_client.indexed-v0.10.0
+  group_id: swh.search.journal_client.indexed-v0.11
   prefix: swh.journal.indexed
   object_types:
   - origin_intrinsic_metadata

perform a system upgrade, a reboot was not required
enable and start swh-search backend
An error occurs after the restart:

Sep 01 14:19:12 search0 python3[4066688]: 2021-09-01 14:19:12 [4066688] root:ERROR command 'cc' failed with exit status 1
                                          Traceback (most recent call last):
                                            File "/usr/lib/python3.7/distutils/unixccompiler.py", line 118, in _compile
                                              extra_postargs)
                                            File "/usr/lib/python3.7/distutils/ccompiler.py", line 909, in spawn
                                              spawn(cmd, dry_run=self.dry_run)
                                            File "/usr/lib/python3.7/distutils/spawn.py", line 36, in spawn
                                              _spawn_posix(cmd, search_path, dry_run=dry_run)
                                            File "/usr/lib/python3.7/distutils/spawn.py", line 159, in _spawn_posix
                                              % (cmd, exit_status))
                                          distutils.errors.DistutilsExecError: command 'cc' failed with exit status 1
                                          
                                          During handling of the above exception, another exception occurred:
                                          
                                          Traceback (most recent call last):
                                            File "/usr/lib/python3/dist-packages/flask/app.py", line 2292, in wsgi_app
                                              response = self.full_dispatch_request()
                                            File "/usr/lib/python3/dist-packages/flask/app.py", line 1808, in full_dispatch_request
                                              self.try_trigger_before_first_request_functions()
                                            File "/usr/lib/python3/dist-packages/flask/app.py", line 1855, in try_trigger_before_first_request_functions
                                              func()
                                            File "/usr/lib/python3/dist-packages/swh/search/api/server.py", line 48, in initialized_index
                                              search = _get_search()
                                            File "/usr/lib/python3/dist-packages/swh/search/api/server.py", line 25, in _get_search
                                              search = get_search(**app.config["search"])
                                            File "/usr/lib/python3/dist-packages/swh/search/__init__.py", line 54, in get_search
                                              return Search(**kwargs)
                                            File "/usr/lib/python3/dist-packages/swh/search/elasticsearch.py", line 104, in __init__
                                              self._translator = Translator()
                                            File "/usr/lib/python3/dist-packages/swh/search/translator.py", line 34, in __init__
                                              Language.build_library(ql_path, [source_path])
                                            File "/usr/lib/python3/dist-packages/tree_sitter/__init__.py", line 69, in build_library
                                              extra_preargs=flags,
                                            File "/usr/lib/python3.7/distutils/ccompiler.py", line 574, in compile
                                              self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
                                            File "/usr/lib/python3.7/distutils/unixccompiler.py", line 120, in _compile
                                              raise CompileError(msg)
                                          distutils.errors.CompileError: command 'cc' failed with exit status 1

The problem was fixed by rDSEA68347a5604c74150197f691593cbb05bdd34396f
thanks @olasd

package python3-swh.search upgraded to version 0.11.4-2, the problem is fixed
the new index is well created:

root@search0:/# curl -s http://search-esnode0:9200/_cat/indices\?v
health status index                       uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   origin-v0.11                HljzsdD9SmKI7-8ekB_q3Q  80   0          0            0      4.2kb          4.2kb
green  close  origin                      HthJj42xT5uO7w3Aoxzppw  80   0                                                  
green  close  origin-v0.9.0               o7FiYJWnTkOViKiAdCXCuA  80   0                                                  
green  open   origin-v0.10.0              -fvf4hK9QDeN8qYTJBBlxQ  80   0    1981623       559384      2.3gb          2.3gb
green  close  origin-backup-20210209-1736 P1CKjXW0QiWM5zlzX46-fg  80   0                                                  
green  close  origin-v0.5.0               SGplSaqPR_O9cPYU4ZsmdQ  80   0

journal clients enabled and restarted
the journal clients lags should recover in less than 12h
waiting some time to estimate the duration with only one journal client per type

The diff to apply the configuration by puppet will come soon

The lag has recovered in ~ 12hours.
The content of the index looks goods (just cherry picked a couple of origin).

Let's prepare the production deployment now.

vsellier added a revision: D6176: swh-search: deploy v0.11.4 in staging.Sep 3 2021, 8:42 AM

vsellier added a commit: rSPSITEf8bd91737496: swh-search: deploy v0.11.4 in staging.Sep 3 2021, 9:47 AM

puppet configuration deployed in staging
read index updated with this script:

#!/bin/bash

ES_SERVER=search-esnode0:9200
OLD_INDEX=origin-v0.10.0
INDEX=origin-v0.11

curl -XPOST -H 'Content-Type: application/json' http://$ES_SERVER/_aliases -d '
 {
   "actions" : [
     { "remove" : { "index" : "'$OLD_INDEX'", "alias" : "origin-read" } },
     { "add" : { "index" : "'$INDEX'", "alias" : "origin-read" } }
   ]
 }'

TMP_CONFIG=`mktemp`
cat >$TMP_CONFIG <<EOF
{
   "index" : {
      "translog.sync_interval" : null,
      "translog.durability": null,
      "refresh_interval":null 
   }
 }
EOF
curl -s -H "Content-Type: application/json" -XPUT http://${ES_SERVER}/${INDEX}/_settings -d @$TMP_CONFIG
rm $TMP_CONFIG