Page MenuHomeSoftware Heritage

Deploy swh.search v0.10 on staging
Started, Work in Progress, NormalPublic

Description

  • First in staging then in production.
  • from v0.9 to v0.10.

for v0.10, as the schema got updated, a new index will need to be created and then a
backfill is in order to populate correctly that new index.

Rough plan to install and backfill the new index:

  • Redo the tagging to v0.10.0 (it is 0.10.0 [2])
  • stop puppet on nodes running the journal clients and swh-search
  • stop the objects and metadata journal clients so they stop populating the future "old" index
  • upgrade the debian packages
  • restart swh-search to declare the new mappings in the old index [1]
  • restart puppet
  • manually launch journal client configured to index on a origin-v0.10 index
  • reset offsets on the origin_visit_status topics for the journal clients' consumer client
  • wait for the end of the reindexation (journal client: no more lags)
  • upgrade the new swh-search and journal client configurations in puppet to use the new index (done for webapp1)

[1] That's actually not totally sure whether that's the way to do it in our case. We may
have to do this ourselves manually in another way.

[2] That does not agree with the packaging build

Event Timeline

ardumont triaged this task as Normal priority.EditedMon, Jul 19, 2:52 PM
ardumont created this task.

Related to T3083 T3398 T3391

We've done the following:

  • tag v0.10.0 instead of 0.10.0; wait for the package build

on staging search0:

  • disable puppet
  • upgrade Debian packages
  • disable swh.search journal clients
  • reboot (to perform the pending kernel + systemd update)
  • notice the new storage entry needed for the config. Updated the config to restart the backend, which created the new origins-v0.10.0 index with the new mapping.
  • updated the journal clients configs to use the RPC backend instead of direct elasticsearch access
  • updated the write index with the following script
#!/bin/bash

ES_SERVER=search-esnode0:9200
OLD_INDEX=origin-v0.9.0
INDEX=origin-v0.10.0

curl -XPOST -H 'Content-Type: application/json' http://$ES_SERVER/_aliases -d '
 {
   "actions" : [
     { "remove" : { "index" : "'$OLD_INDEX'", "alias" : "origin-write" } },
     { "add" : { "index" : "'$INDEX'", "alias" : "origin-write" } }
   ]
 }'

TMP_CONFIG=`mktemp`
cat >$TMP_CONFIG <<EOF
{
   "index" : {
      "translog.sync_interval" : "60s",
      "translog.durability": "async",
      "refresh_interval": "60s"
   }
 }
EOF
curl -s -H "Content-Type: application/json" -XPUT http://${ES_SERVER}/${INDEX}/_settings -d @$TMP_CONFIG
rm $TMP_CONFIG

We've then restarted the journal clients to fill the new index.

We had to fix a few "real world data" issues, which are now landed (D6011, D6012, D6014), for that to actually work. (we've deployed the journal_client.py file directly on search0 to do so).

After spawning 7 more journal clients, and waiting for grafana/prometheus data to settle, the filling of the index was expected to take around 3 weeks (compared to a few hours for the processing done when deploying 0.9.0).

We've pinpointed the issue to the fetching of snapshots (and associated revisions/releases), for computing the latest_release/revision_date fields. These storage operations take substantially more time than just pushing lightly doctored origin_visit_status dicts (directly pulled from the journal) to elasticsearch.

For now, we've dropped these fields from the working copy of journal_client.py on search0.internal.staging, so that the processing can complete and search functionality can be restored.

We will need to discuss a more targeted plan to pull this data, which doesn't involve "pulling all snapshots, and associated objects, from the archive one by one", which is what processing the whole origin_visit_status topic amounts to, eventually. This approach would not have worked very well for staging, and it definitely won't work with the production data (we have 1.5 billion visit statuses!).

To solve that problem, we have a few ideas:

  • making the snapshot fetch conditional, so the complete reindexing can avoid it
  • enabling the snapshot fetch again once the initial reindex is done (and making sure that the journal client can actually keep up)
  • and, eventually, we will need to start writing actual targeted index migration scripts, rather than let the journal client start again from scratch every time we change a mapping or add a new field.

Writing index migration scripts will become more critical, specifically if we generalize the use of data that can't be pulled directly from journal events in swh.search: these extra data fetches can generate substantial load, and a lot of them are useless as we could just as well pull the data for the latest known snapshot for each origin (instead of *every* known snapshot for each origin). But even when only changing a data type in the mapping, doing a bulk reindex from within elasticsearch would be much more efficient than reindexing all statuses for all visits of all origins.

olasd changed the task status from Open to Work in Progress.Wed, Jul 21, 5:28 PM
ardumont renamed this task from Deploy swh.search v0.10 to Deploy swh.search v0.10 on staging.Wed, Jul 21, 5:31 PM

Another idea: move this fetching to a new indexer, and make it write to a new topic, which the swh-search journal client can read from.

Pros:

  • We already use this architecture with metadata
  • We can keep using replays for migrations

Cons:

  • One more moving part in the search pipeline.

By the way, a small list of caveats encountered when deploying search (not to fix
immediately, just to mention them).

  • to run the service: "swhstorage" could/should be swhsearch or we could even use systemd dynamic user mechanism
  • cli: fix journal_process_objects docstring which is not updated accordingly
  • avoid index initialization on read-only swh-search client (e.g configuration option to inhibit this behavior)
  • bugs encountered and fixed were due to the swh.search implementation being too optimistic on expected complete snapshot (snapshot can be empty though).