Page MenuHomeSoftware Heritage

Deploy swh.search v0.10/v0.11
Closed, ResolvedPublic

Description

  • First in staging then in production.
  • from v0.9 to v0.10.

for v0.10, as the schema got updated, a new index will need to be created and then a
backfill is in order to populate correctly that new index.

Rough plan to install and backfill the new index:

  • Redo the tagging to v0.10.0 (it is 0.10.0 [2])
  • stop puppet on nodes running the journal clients and swh-search
  • stop the objects and metadata journal clients so they stop populating the future "old" index
  • upgrade the debian packages
  • restart swh-search to declare the new mappings in the old index [1]
  • restart puppet
  • manually launch journal client configured to index on a origin-v0.10 index
  • reset offsets on the origin_visit_status topics for the journal clients' consumer client
  • wait for the end of the reindexation (journal client: no more lags)
  • upgrade the new swh-search and journal client configurations in puppet to use the new index (done for webapp1)

[1] That's actually not totally sure whether that's the way to do it in our case. We may
have to do this ourselves manually in another way.

[2] That does not agree with the packaging build

Event Timeline

ardumont triaged this task as Normal priority.EditedJul 19 2021, 2:52 PM
ardumont created this task.

Related to T3083 T3398 T3391

We've done the following:

  • tag v0.10.0 instead of 0.10.0; wait for the package build

on staging search0:

  • disable puppet
  • upgrade Debian packages
  • disable swh.search journal clients
  • reboot (to perform the pending kernel + systemd update)
  • notice the new storage entry needed for the config. Updated the config to restart the backend, which created the new origins-v0.10.0 index with the new mapping.
  • updated the journal clients configs to use the RPC backend instead of direct elasticsearch access
  • updated the write index with the following script
#!/bin/bash

ES_SERVER=search-esnode0:9200
OLD_INDEX=origin-v0.9.0
INDEX=origin-v0.10.0

curl -XPOST -H 'Content-Type: application/json' http://$ES_SERVER/_aliases -d '
 {
   "actions" : [
     { "remove" : { "index" : "'$OLD_INDEX'", "alias" : "origin-write" } },
     { "add" : { "index" : "'$INDEX'", "alias" : "origin-write" } }
   ]
 }'

TMP_CONFIG=`mktemp`
cat >$TMP_CONFIG <<EOF
{
   "index" : {
      "translog.sync_interval" : "60s",
      "translog.durability": "async",
      "refresh_interval": "60s"
   }
 }
EOF
curl -s -H "Content-Type: application/json" -XPUT http://${ES_SERVER}/${INDEX}/_settings -d @$TMP_CONFIG
rm $TMP_CONFIG

We've then restarted the journal clients to fill the new index.

We had to fix a few "real world data" issues, which are now landed (D6011, D6012, D6014), for that to actually work. (we've deployed the journal_client.py file directly on search0 to do so).

After spawning 7 more journal clients, and waiting for grafana/prometheus data to settle, the filling of the index was expected to take around 3 weeks (compared to a few hours for the processing done when deploying 0.9.0).

We've pinpointed the issue to the fetching of snapshots (and associated revisions/releases), for computing the latest_release/revision_date fields. These storage operations take substantially more time than just pushing lightly doctored origin_visit_status dicts (directly pulled from the journal) to elasticsearch.

For now, we've dropped these fields from the working copy of journal_client.py on search0.internal.staging, so that the processing can complete and search functionality can be restored.

We will need to discuss a more targeted plan to pull this data, which doesn't involve "pulling all snapshots, and associated objects, from the archive one by one", which is what processing the whole origin_visit_status topic amounts to, eventually. This approach would not have worked very well for staging, and it definitely won't work with the production data (we have 1.5 billion visit statuses!).

To solve that problem, we have a few ideas:

  • making the snapshot fetch conditional, so the complete reindexing can avoid it
  • enabling the snapshot fetch again once the initial reindex is done (and making sure that the journal client can actually keep up)
  • and, eventually, we will need to start writing actual targeted index migration scripts, rather than let the journal client start again from scratch every time we change a mapping or add a new field.

Writing index migration scripts will become more critical, specifically if we generalize the use of data that can't be pulled directly from journal events in swh.search: these extra data fetches can generate substantial load, and a lot of them are useless as we could just as well pull the data for the latest known snapshot for each origin (instead of *every* known snapshot for each origin). But even when only changing a data type in the mapping, doing a bulk reindex from within elasticsearch would be much more efficient than reindexing all statuses for all visits of all origins.

olasd changed the task status from Open to Work in Progress.Jul 21 2021, 5:28 PM
ardumont renamed this task from Deploy swh.search v0.10 to Deploy swh.search v0.10 on staging.Jul 21 2021, 5:31 PM

Another idea: move this fetching to a new indexer, and make it write to a new topic, which the swh-search journal client can read from.

Pros:

  • We already use this architecture with metadata
  • We can keep using replays for migrations

Cons:

  • One more moving part in the search pipeline.

By the way, a small list of caveats encountered when deploying search (not to fix
immediately, just to mention them).

  • to run the service: "swhstorage" could/should be swhsearch or we could even use systemd dynamic user mechanism
  • cli: fix journal_process_objects docstring which is not updated accordingly
  • avoid index initialization on read-only swh-search client (e.g configuration option to inhibit this behavior)
  • bugs encountered and fixed were due to the swh.search implementation being too optimistic on expected complete snapshot (snapshot can be empty though).

A new swh.search v0.11 got tagged (this includes the current blocking point
deactivation). That's a workaround though. I've opened a task to avoid forgetting about
the conclusion on the discussion started.

In the mean time, we can deploy and realign both staging and production on the v0.11.

[1] T3479

ardumont renamed this task from Deploy swh.search v0.10 on staging to Deploy swh.search ~~v0.10~~ v0.11 on staging.Aug 11 2021, 11:24 AM
ardumont renamed this task from Deploy swh.search ~~v0.10~~ v0.11 on staging to Deploy swh.search v0.10/v0.11 on staging.

Deployment of version v0.11.4 in staging:
On search0:

  • puppet stopped
  • stop and disable the journal clients and search backend
  • update the swh-search configuration to use a origin-v0.11 index
root@search0:/etc/softwareheritage/search# diff -U2 /tmp/server.yml server.yml 
--- /tmp/server.yml	2021-09-01 13:42:29.347951302 +0000
+++ server.yml	2021-09-01 13:42:35.739953523 +0000
@@ -7,5 +7,5 @@
   indexes:
     origin:
-      index: origin-v0.10.0
+      index: origin-v0.11
       read_alias: origin-read
       write_alias: origin-write
  • update the journal-clients to use a group id swh.search.journal_client.[indexed|object]-v0.11
root@search0:/etc/softwareheritage/search# diff -U3 /tmp/journal_client_objects.yml journal_client_objects.yml 
--- /tmp/journal_client_objects.yml	2021-09-01 13:44:49.843999978 +0000
+++ journal_client_objects.yml	2021-09-01 13:45:03.972004852 +0000
@@ -5,7 +5,7 @@
 journal:
   brokers:
   - journal0.internal.staging.swh.network
-  group_id: swh.search.journal_client-v0.10.0
+  group_id: swh.search.journal_client-v0.11
   prefix: swh.journal.objects
   object_types:
   - origin
root@search0:/etc/softwareheritage/search# diff -U3 /tmp/journal_client_indexed.yml journal_client_indexed.yml 
--- /tmp/journal_client_indexed.yml	2021-09-01 13:44:44.847998252 +0000
+++ journal_client_indexed.yml	2021-09-01 13:44:57.020002454 +0000
@@ -5,7 +5,7 @@
 journal:
   brokers:
   - journal0.internal.staging.swh.network
-  group_id: swh.search.journal_client.indexed-v0.10.0
+  group_id: swh.search.journal_client.indexed-v0.11
   prefix: swh.journal.indexed
   object_types:
   - origin_intrinsic_metadata
  • perform a system upgrade, a reboot was not required
  • enable and start swh-search backend
  • An error occurs after the restart:
Sep 01 14:19:12 search0 python3[4066688]: 2021-09-01 14:19:12 [4066688] root:ERROR command 'cc' failed with exit status 1
                                          Traceback (most recent call last):
                                            File "/usr/lib/python3.7/distutils/unixccompiler.py", line 118, in _compile
                                              extra_postargs)
                                            File "/usr/lib/python3.7/distutils/ccompiler.py", line 909, in spawn
                                              spawn(cmd, dry_run=self.dry_run)
                                            File "/usr/lib/python3.7/distutils/spawn.py", line 36, in spawn
                                              _spawn_posix(cmd, search_path, dry_run=dry_run)
                                            File "/usr/lib/python3.7/distutils/spawn.py", line 159, in _spawn_posix
                                              % (cmd, exit_status))
                                          distutils.errors.DistutilsExecError: command 'cc' failed with exit status 1
                                          
                                          During handling of the above exception, another exception occurred:
                                          
                                          Traceback (most recent call last):
                                            File "/usr/lib/python3/dist-packages/flask/app.py", line 2292, in wsgi_app
                                              response = self.full_dispatch_request()
                                            File "/usr/lib/python3/dist-packages/flask/app.py", line 1808, in full_dispatch_request
                                              self.try_trigger_before_first_request_functions()
                                            File "/usr/lib/python3/dist-packages/flask/app.py", line 1855, in try_trigger_before_first_request_functions
                                              func()
                                            File "/usr/lib/python3/dist-packages/swh/search/api/server.py", line 48, in initialized_index
                                              search = _get_search()
                                            File "/usr/lib/python3/dist-packages/swh/search/api/server.py", line 25, in _get_search
                                              search = get_search(**app.config["search"])
                                            File "/usr/lib/python3/dist-packages/swh/search/__init__.py", line 54, in get_search
                                              return Search(**kwargs)
                                            File "/usr/lib/python3/dist-packages/swh/search/elasticsearch.py", line 104, in __init__
                                              self._translator = Translator()
                                            File "/usr/lib/python3/dist-packages/swh/search/translator.py", line 34, in __init__
                                              Language.build_library(ql_path, [source_path])
                                            File "/usr/lib/python3/dist-packages/tree_sitter/__init__.py", line 69, in build_library
                                              extra_preargs=flags,
                                            File "/usr/lib/python3.7/distutils/ccompiler.py", line 574, in compile
                                              self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
                                            File "/usr/lib/python3.7/distutils/unixccompiler.py", line 120, in _compile
                                              raise CompileError(msg)
                                          distutils.errors.CompileError: command 'cc' failed with exit status 1
  • package python3-swh.search upgraded to version 0.11.4-2, the problem is fixed
  • the new index is well created:
root@search0:/# curl -s http://search-esnode0:9200/_cat/indices\?v
health status index                       uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   origin-v0.11                HljzsdD9SmKI7-8ekB_q3Q  80   0          0            0      4.2kb          4.2kb
green  close  origin                      HthJj42xT5uO7w3Aoxzppw  80   0                                                  
green  close  origin-v0.9.0               o7FiYJWnTkOViKiAdCXCuA  80   0                                                  
green  open   origin-v0.10.0              -fvf4hK9QDeN8qYTJBBlxQ  80   0    1981623       559384      2.3gb          2.3gb
green  close  origin-backup-20210209-1736 P1CKjXW0QiWM5zlzX46-fg  80   0                                                  
green  close  origin-v0.5.0               SGplSaqPR_O9cPYU4ZsmdQ  80   0
  • journal clients enabled and restarted
  • the journal clients lags should recover in less than 12h
  • waiting some time to estimate the duration with only one journal client per type

The diff to apply the configuration by puppet will come soon

The lag has recovered in ~ 12hours.
The content of the index looks goods (just cherry picked a couple of origin).

Let's prepare the production deployment now.

  • puppet configuration deployed in staging
  • read index updated with this script:
#!/bin/bash

ES_SERVER=search-esnode0:9200
OLD_INDEX=origin-v0.10.0
INDEX=origin-v0.11

curl -XPOST -H 'Content-Type: application/json' http://$ES_SERVER/_aliases -d '
 {
   "actions" : [
     { "remove" : { "index" : "'$OLD_INDEX'", "alias" : "origin-read" } },
     { "add" : { "index" : "'$INDEX'", "alias" : "origin-read" } }
   ]
 }'

TMP_CONFIG=`mktemp`
cat >$TMP_CONFIG <<EOF
{
   "index" : {
      "translog.sync_interval" : null,
      "translog.durability": null,
      "refresh_interval":null 
   }
 }
EOF
curl -s -H "Content-Type: application/json" -XPUT http://${ES_SERVER}/${INDEX}/_settings -d @$TMP_CONFIG
rm $TMP_CONFIG
  • swh-search updated and restarted on webapp.staging

production deployment:

  • disable puppet
  • stop and disable the journal clients and the search backend
  • update the swh-search configuration to change the index name to origin-v0.11
root@search1:/etc/softwareheritage/search# diff -U3 /tmp/server.yml server.yml
--- /tmp/server.yml	2021-09-03 14:06:07.896137122 +0000
+++ server.yml	2021-09-03 14:05:47.072081879 +0000
@@ -10,7 +10,7 @@
     port: 9200
   indexes:
     origin:
-      index: origin-production
+      index: origin-v0.11
       read_alias: origin-read
       write_alias: origin-write
  • update the journal-clients to use a group id swh.search.journal_client.[indexed|object]-v0.11
root@search1:/etc/softwareheritage/search# diff -U3 /tmp/journal_client_objects.yml journal_client_objects.yml 
--- /tmp/journal_client_objects.yml	2021-09-03 14:06:52.660255797 +0000
+++ journal_client_objects.yml	2021-09-03 14:07:10.684303568 +0000
@@ -8,7 +8,7 @@
   - kafka2.internal.softwareheritage.org
   - kafka3.internal.softwareheritage.org
   - kafka4.internal.softwareheritage.org
-  group_id: swh.search.journal_client
+  group_id: swh.search.journal_client-v0.11
   prefix: swh.journal.objects
   object_types:
   - origin
root@search1:/etc/softwareheritage/search# diff -U3 /tmp/journal_client_indexed.yml journal_client_indexed.yml 
--- /tmp/journal_client_indexed.yml	2021-09-03 14:06:52.660255797 +0000
+++ journal_client_indexed.yml	2021-09-03 14:07:25.760343512 +0000
@@ -8,7 +8,7 @@
   - kafka2.internal.softwareheritage.org
   - kafka3.internal.softwareheritage.org
   - kafka4.internal.softwareheritage.org
-  group_id: swh.search.journal_client.indexed
+  group_id: swh.search.journal_client.indexed-v0.11
   prefix: swh.journal.indexed
   object_types:
   - origin_intrinsic_metadata
  • perform a system upgrade
root@search1:/etc/softwareheritage/search# apt dist-upgrade -V
...
The following NEW packages will be installed:
   python3-tree-sitter (0.19.0-1+swh1~bpo10+1)
The following packages will be upgraded:
   libnss-systemd (247.3-3~bpo10+1 => 247.3-6~bpo10+1)
   libpam-systemd (247.3-3~bpo10+1 => 247.3-6~bpo10+1)
   libsystemd0 (247.3-3~bpo10+1 => 247.3-6~bpo10+1)
   libudev1 (247.3-3~bpo10+1 => 247.3-6~bpo10+1)
   python3-swh.core (0.14.3-1~swh1~bpo10+1 => 0.14.5-1~swh1~bpo10+1)
   python3-swh.model (2.6.1-1~swh1~bpo10+1 => 2.8.0-1~swh1~bpo10+1)
   python3-swh.scheduler (0.15.0-1~swh1~bpo10+1 => 0.18.0-1~swh1~bpo10+1)
   python3-swh.search (0.9.0-1~swh1~bpo10+1 => 0.11.4-2~swh1~bpo10+1)
   python3-swh.storage (0.30.1-1~swh1~bpo10+1 => 0.36.0-1~swh1~bpo10+1)
   systemd (247.3-3~bpo10+1 => 247.3-6~bpo10+1)
   systemd-sysv (247.3-3~bpo10+1 => 247.3-6~bpo10+1)
   systemd-timesyncd (247.3-3~bpo10+1 => 247.3-6~bpo10+1)
   udev (247.3-3~bpo10+1 => 247.3-6~bpo10+1)
13 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
...

There is no need to reboot

  • enable and restart the swh-search backend
  • check the new index creation
root@search1:/etc/softwareheritage/search# curl ${ES_SERVER}/_cat/indices\?v
health status index             uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   origin-v0.11      XOUR_jKcTtWKjlPk_8EAlA  90   1          0            0     34.3kb         18.2kb
green  open   origin-v0.9.0     TH9xlECuS4CcJTDw0Fqieg  90   1  175001478     36494554      293gb        146.9gb
green  open   origin-production hZfuv0lVRImjOjO_rYgDzg  90   1  176722078     56232582      311gb        155.1gb
  • update the write index alias
root@search1:~/T3433# ./update-write-alias.sh 
{"acknowledged":true}{"acknowledged":true}root@search1:~/T3433# 
root@search1:~/T3433# curl ${ES_SERVER}/_cat/aliases\?v
alias               index             filter routing.index routing.search is_write_index
origin-write        origin-v0.11      -      -             -              -
origin-read-v0.9.0  origin-v0.9.0     -      -             -              -
origin-v0.9.0-read  origin-v0.9.0     -      -             -              -
origin-v0.9.0-write origin-v0.9.0     -      -             -              -
origin-write-v0.9.0 origin-v0.9.0     -      -             -              -
origin-read         origin-production -      -             -              -

All the v0.9.0 stuff will be cleared once the migration to the v0.11 done

  • restart the journal clients
root@search1:~# systemctl enable swh-search-journal-client@objects
Created symlink /etc/systemd/system/multi-user.target.wants/swh-search-journal-client@objects.service → /etc/systemd/system/swh-search-journal-client@.service.
root@search1:~# systemctl enable swh-search-journal-client@indexed
Created symlink /etc/systemd/system/multi-user.target.wants/swh-search-journal-client@indexed.service → /etc/systemd/system/swh-search-journal-client@.service.
root@search1:~# systemctl start swh-search-journal-client@objects
root@search1:~# systemctl start swh-search-journal-client@indexed
  • wait for the lag to recover, create additional journal clients if necessary
  • update the read index alias
  • land D6182, D6183, D6197
  • Update swh-web configuration to support the new way to configure the metadata search backend (D6202)
  • deploy them on webapp1 and moma
vsellier claimed this task.

Everything is deployed and look functional.

vsellier renamed this task from Deploy swh.search v0.10/v0.11 on staging to Deploy swh.search v0.10/v0.11.Sep 8 2021, 3:21 PM