Page MenuHomeSoftware Heritage

swh-search: Deploy visit_types indexation in production
Closed, MigratedEdits Locked

Description

The visit_types indexation was added on swh-search:0.6.0

The mapping of the production index needs to be apdated (for visit_types + metadata date fields)

Some tests need to be done to repeat the deployment in order to limit the search downtime.

Event Timeline

vsellier changed the task status from Open to Work in Progress.Feb 19 2021, 2:38 PM
vsellier triaged this task as Normal priority.
vsellier created this task.
vsellier moved this task from Backlog to in-progress on the System administration board.
  • A reindex of the origin index to a backup is in progress to evaluate the possible duration of such operation with production volume
  • For this migration, we are lucky as the changes are only new fields declarations. The metadata are not yet ingested in production so the documents don't have to be converted

The mapping is well updated when the initialize command line is called.
For example for a migration tested in docker (with origins and metadata ingested) :

❯ diff -U30 /tmp/mapping-0.5.0.json /tmp/mapping-0.6.1.json
--- /tmp/mapping-0.5.0.json	2021-02-19 15:06:31.222879537 +0100
+++ /tmp/mapping-0.6.1.json	2021-02-19 15:33:58.465084816 +0100
@@ -1,33 +1,34 @@
 {
   "origin" : {
     "mappings" : {
+      "date_detection" : false,
       "properties" : {
...
             },
             "http://schema" : { <------ Automatic mappings are well present
               "properties" : {
                 "org/author" : {
                   "properties" : {
                     "@list" : {
                       "properties" : {
                         "@type" : {
                           "type" : "text",
                           "fields" : {
                             "keyword" : {
                               "type" : "keyword",
@@ -125,35 +126,38 @@
...
         "sha1" : {
           "type" : "keyword"
         },
         "url" : {
           "type" : "text",
           "fields" : {
             "as_you_type" : {
               "type" : "search_as_you_type",
               "doc_values" : false,
               "analyzer" : "simple",
               "max_shingle_size" : 3
             }
           },
           "analyzer" : "simple"
+        },
+        "visit_types" : {
+          "type" : "keyword"
         }
       }
     }
   }
 }

So the current migration could be performed with the following actions :

  • stop the journal client and the service
  • upgrade the packages on search1
  • launch the initialize command
  • reset the offset of the objects journal client to the beginning for the origin_visit topic
  • restart the service and the journal_client
  • update the webapps
  • wait for the lag to be recovered
  • journal-client and swh-search service stopped
  • package upgraded
root@search1:/etc/systemd/system# apt list --upgradable
Listing... Done
python3-swh.search/unknown 0.6.1-1~swh1~bpo10+1 all [upgradable from: 0.5.0-1~swh1~bpo10+1]
python3-swh.storage/unknown 0.23.2-1~swh1~bpo10+1 all [upgradable from: 0.23.1-1~swh1~bpo10+1]
root@search1:/etc/systemd/system# apt dist-upgrade
  • new mapping applyed and checked :
    • before
% curl -s http://${ES_SERVER}/origin/_mapping\?pretty | jq '.origin.mappings' > mapping-v0.5.0.json
  • upgrade
swhstorage@search1:~$  /usr/bin/swh search --config-file /etc/softwareheritage/search/server.yml initialize
INFO:elasticsearch:HEAD http://search-esnode1.internal.softwareheritage.org:9200/origin [status:200 request:0.036s]
INFO:elasticsearch:PUT http://search-esnode2.internal.softwareheritage.org:9200/origin/_mapping [status:200 request:0.196s]
Done.
  • after
% curl -s http://${ES_SERVER}/origin/_mapping\?pretty | jq '.origin.mappings' > mapping-v0.6.1.json
  • check
% diff -U3 mapping-v0.5.0.json mapping-v0.6.1.json 
--- mapping-v0.5.0.json	2021-02-19 15:10:23.336628008 +0000
+++ mapping-v0.6.1.json	2021-02-19 15:12:50.660635267 +0000
@@ -1,4 +1,5 @@
 {
+  "date_detection": false,
   "properties": {
     "has_visits": {
       "type": "boolean"
@@ -25,6 +26,9 @@
         }
       },
       "analyzer": "simple"
+    },
+    "visit_types": {
+      "type": "keyword"
     }
   }
 }
  • reset the offsets
% /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $SERVER --reset-offsets --topic swh.journal.objects.origin_visit --to-earliest --group swh.search.journal_client --execute

GROUP                          TOPIC                          PARTITION  NEW-OFFSET     
swh.search.journal_client      swh.journal.objects.origin_visit 16         0              
swh.search.journal_client      swh.journal.objects.origin_visit 10         0              
swh.search.journal_client      swh.journal.objects.origin_visit 66         0              
...
  • restart the service and the journal client
root@search1:/etc/systemd/system# systemctl start gunicorn-swh-search.service 
root@search1:/etc/systemd/system# systemctl start swh-search-journal-client@objects.service 
root@search1:/etc/systemd/system# puppet agent --enable
  • swh-search packages updated on webapp1 and moma
  • waiting for the recovering of the lag of the journal_client

The lag has recovered so the index should contains the visit_type for all origin now