Enable the swh-search environment in staging
Closed, MigratedEdits Locked
Actions

Description

Actions:

Deploy a dedicated elasticsearch node (1 node to start, we'll see if a cluster is needed)
Deploy a rpc search service
webapp.staging uses the search service
Deploy a search journal client (read from kafka / Write to ES)
Deploy a search journal client for indexed metadata topic [1]

[1] Deployed but currently stuck due to T2876

Revisions and Commits

rSPRE sysadm-provisioning
		D4664	rSPREe95a5d8054a7 search0: Add swh-search rpc backend node
		D4663	rSPRE5cff2306dc47 search-esnode0: Add a 100Gib storage disk
		D4658	rSPRE6273f496fd96 staging: Add search-esnode0
rDSEA Archive search
		D4710	rDSEAab27eae1356a search.journal_client: Fix key error
		D4701	rDSEAe72a785757fb Allow configuration through cli or config file
rDENV Development environment
		D4704	rDENV2eb7a2c4407d docker-compose.search.yml: Add journal client for indexed values
		D4709	rDENVf5d483c14997 indexer_storage: Publish indexer computation to journal topics
rSPSITE puppet-swh-site
	Abandoned		D4654 -wip- Switch to the official elasticsearch plugin
	Closed		D4651 Puppetize elasticsearch nodes
		D4712	rSPSITEdf88875ac92c staging: Increase elasticsearch jvm heap size to half its memory
		D4699	rSPSITE632e89e0a026 search: Deploy multiple search journal client instances
		D4687	rSPSITEadbef15c22fe search: Add initialization step on install or upgrade
		D4668	rSPSITE73f427c8ebca Add swh-search-journal-client to swh_search_with_journal_client role
		D4666	rSPSITEcd37882b8053 staging: Deploy swh-search rpc backend on search0

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T1523 Search tools on metadata
Migrated	gitlab-migration	T1117 Origin search is slow when you look for very common words
Migrated	gitlab-migration	T1910 Redesign origin search using a dedicated component (swh-search)
Migrated	gitlab-migration	T2052 Publish swh-search on PyPI
Migrated	gitlab-migration	T2167 Deploy swh-search
Migrated	gitlab-migration	T2174 Add debian package for swh-search
Migrated	gitlab-migration	T2182 Switch production swh-web to use swh-search instead of postgresql search.
Migrated	gitlab-migration	T2590 Finish the indexer -> swh-search pipeline
Migrated	gitlab-migration	T2817 Enable the swh-search environment in staging

Event Timeline

vsellier renamed this task from Enable the swh-search in staging to Enable the swh-search environment in staging.Nov 26 2020, 5:58 PM

vsellier changed the task status from Open to Work in Progress.

vsellier triaged this task as Normal priority.

vsellier created this task.

vsellier updated the task description. (Show Details)

ardumont mentioned this in rSENV017e1cd6cf60: Vagrantfile: Add staging-esnode0 node.Dec 2 2020, 3:35 PM

ardumont added a revision: D4651: Puppetize elasticsearch nodes.Dec 2 2020, 4:53 PM

vsellier mentioned this in rSENVde7c399ee557: add the dependency needed by elasticsearch plugin.Dec 2 2020, 6:16 PM

vsellier added a revision: D4654: -wip- Switch to the official elasticsearch plugin.Dec 3 2020, 12:21 PM

ardumont mentioned this in rSENV8592fd52f14a: Vagrantfile: Rename puppet_default_facts to puppet_staging_facts.Dec 3 2020, 3:22 PM

ardumont mentioned this in rSENV2dd0ddbabfd0: Vagrantfile: Open esnode{1,2,3} nodes.

ardumont added a revision: D4658: staging: Add search-esnode0.Dec 3 2020, 5:59 PM

vsellier updated the task description. (Show Details)Dec 4 2020, 11:44 AM

ardumont added a commit: rSPRE6273f496fd96: staging: Add search-esnode0.Dec 4 2020, 11:44 AM

dedicated ES node for staging deployed (search-esnode0.internal.staging.swh.network) with D4658 and D4651

ardumont added a revision: D4663: search-esnode0: Add a 100Gib storage disk.Dec 4 2020, 12:04 PM

ardumont added a revision: D4664: search0: Add swh-search rpc backend node.Dec 4 2020, 12:11 PM

ardumont mentioned this in rSPSITE97ce4d5d18e7: search-esnode0: Allow zfs installation on nodes.Dec 4 2020, 12:23 PM

We added a volume of 100Gib to the search-esnode0 through terraform (D4663).
So we could mount the /srv/elasticsearch as zfs volume.

root@search-esnode0:~# apt install linux-headers-`uname -r`  # otherwise, zfs-dkms install is not happy
root@search-esnode0:~# apt install zfs-dkms
## reboot

root@search-esnode0:~# systemctl stop kafka
root@search-esnode0:~# rm -rf /srv/kafka/logdir/*

root@search-esnode0:~# zpool create -f elasticsearch-volume -m /srv/kafka/logdir  /dev/vdb
root@search-esnode0:~# zfs set relatime=on elasticsearch-volume

root@search-esnode0:~# zpool list
NAME                   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
elasticsearch-volume  99.5G   184K  99.5G        -         -     0%     0%  1.00x    ONLINE  -

root@search-esnode0:~# puppet agent --test  # to rectify the permissions adequately
Info: Loading facts                                                                                                                                                                                                                    [27/463]
Info: Caching catalog for search-esnode0.internal.staging.swh.network
Info: Applying configuration version '1607081929'
Notice: /Stage[main]/Profile::Elasticsearch/File[/srv/elasticsearch]/owner: owner changed 'root' to 'elasticsearch'
Notice: /Stage[main]/Profile::Elasticsearch/File[/srv/elasticsearch]/group: group changed 'root' to 'elasticsearch'
Notice: /Stage[main]/Profile::Elasticsearch/File[/srv/elasticsearch]/mode: mode changed '0755' to '2755'
Notice: /Stage[main]/Profile::Elasticsearch/Service[elasticsearch]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Profile::Elasticsearch/Service[elasticsearch]: Unscheduling refresh on Service[elasticsearch]
Notice: Applied catalog in 15.19 seconds
root@search-esnode0:~#
root@search-esnode0:~# systemctl status elasticsearch.service
● elasticsearch.service - Elasticsearch
   Loaded: loaded (/lib/systemd/system/elasticsearch.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/elasticsearch.service.d
           └─elasticsearch.conf
   Active: active (running) since Fri 2020-12-04 11:39:08 UTC; 4s ago
     Docs: https://www.elastic.co

# reboot to be sure

# status: everything is fine!

ardumont added a revision: D4666: staging: Deploy swh-search rpc backend on search0.Dec 4 2020, 4:54 PM

ardumont mentioned this in P893 (An Untitled Masterwork).Dec 4 2020, 5:27 PM

ardumont added a commit: rSPRE5cff2306dc47: search-esnode0: Add a 100Gib storage disk.Dec 4 2020, 7:12 PM

ardumont mentioned this in rSENV725cfe444462: Vagrantfile: Add search0 node.Dec 4 2020, 7:24 PM

ardumont added a revision: D4668: Add swh-search-journal-client to swh_search_with_journal_client role.Dec 4 2020, 7:27 PM

Interesting note about how to size the shards of an index : https://www.elastic.co/guide/en/elasticsearch/reference/7.x//size-your-shards.html

We'll try with quite important number of shards (80) to test the node behavior. It matches the advised rules of less than 20 shards per Gb of heap and shards smaller than 50Gb.

We'll use this template before creating the index :

curl -XPUT -H "Content-Type: application/json" http://$ES_SERVER:9200/_index_template/origin\?pretty -d '{"index_patterns": "origin", "template": {"settings": { "index": { "number_of_replicas":0, "number_of_shards": 80 } } } } '

Application :

 curl -XPUT -H "Content-Type: application/json" http://$ES_SERVER:9200/_index_template/origin\?pretty -d '{"index_patterns": "origin", "template": {"settings": { "index": { "number_of_replicas":0, "number_of_shards": 80 } } } } '
{
  "acknowledged" : true
}

A dashboard to monitor the ES cluster behavior has been created on grafana [1]
It will be improved during the swh-search tests

[1]: https://grafana.softwareheritage.org/d/Hk5mBWJMz/elasticsearch?orgId=1&var-environment=staging&var-cluster=swh-search

ardumont added a commit: rSPSITEcd37882b8053: staging: Deploy swh-search rpc backend on search0.Dec 8 2020, 12:17 PM

ardumont added a commit: rSPSITE73f427c8ebca: Add swh-search-journal-client to swh_search_with_journal_client role.

ardumont added a commit: rSPREe95a5d8054a7: search0: Add swh-search rpc backend node.Dec 8 2020, 1:16 PM

ardumont added a revision: D4687: search: Add initialization step on install or upgrade.Dec 8 2020, 4:06 PM

ardumont added a commit: rSPSITEadbef15c22fe: search: Add initialization step on install or upgrade.Dec 8 2020, 5:16 PM

vsellier updated the task description. (Show Details)Dec 9 2020, 9:35 AM

The search rpc backend and the journal client listening on origin and origin_visit topics are deployed.
The inventory is up to date for both hosts [1][2]

[1] search0.internal.staging.swh.network: https://inventory.internal.softwareheritage.org/virtualization/virtual-machines/89/
[2] search-esnode0.internal.staging.swh.network: https://inventory.internal.softwareheritage.org/virtualization/virtual-machines/88/

In staging, the backlog was ingested in few minutes.

health status index  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   origin 5M3bUlmgRoGmx-USuHVuEQ  80   0     176306          396    122.8mb        122.8mb

In case some performance test are needed, the copy of the production index to thee staging host was started. It needs some temporary adjustments on the Es configuration and to open the network between search-esnode1 and the production cluster in the firewall :

root@search-esnode0:~# puppet agent --disable "add temporary es configuration to reindex origin index from production"

root@search-esnode0:/etc/elasticsearch# diff -U3 /tmp/elasticsearch.yml elasticsearch.yml 
--- /tmp/elasticsearch.yml      2020-12-09 08:06:54.024000000 +0000
+++ elasticsearch.yml   2020-12-09 08:07:12.136000000 +0000
@@ -10,3 +10,4 @@
 http.port: 9200
 prometheus.indices: true
 network.host: 192.168.130.80
+reindex.remote.whitelist: 192.168.100.61:9200

root@search-esnode0:/etc/elasticsearch# systemctl restart elasticsearch


root@search-esnode0:~# export ES_SERVER=192.168.130.80:9200
root@search-esnode0:~# export NEW_INDEX=origin-production

root@search-esnode0:~# curl -s http://${ES_SERVER}/origin/_mapping\?pretty | jq '.origin.mappings' > /tmp/mapping.json 

root@search-esnode0:/etc/elasticsearch# curl -XDELETE http://$ES_SERVER/$NEW_INDEX
{"acknowledged":false}

root@search-esnode0:/etc/elasticsearch# curl -XPUT -H "Content-Type: application/json" http://$ES_SERVER/$NEW_INDEX -d '{"settings": {"index": {"number_of_shards" : 80, "number_of_replicas": 0}}}'              
{"acknowledged":true,"shards_acknowledged":true,"index":"origin-production"}

root@search-esnode0:~# curl -XPUT -H "Content-Type: application/json" http://${ES_SERVER}/${NEW_INDEX}/_mapping -d @/tmp/mapping.json                                                                              
{"acknowledged":true}

# Remote reindexing of the production index
root@search-esnode0:~# cat > /tmp/reindex-production.json <<EOF
{
  "source": {
    "remote": {
      "host": "http://192.168.100.61:9200"
    },
    "index": "origin"
  },
  "dest": {
    "index": "origin-production"
  }
}
EOF

root@search-esnode0:~# curl -XPOST -H "Content-Type: application/json" http://${ES_SERVER}/_reindex\?pretty\&refresh=true\&\&wait_for_completion=true -d @/tmp/reindex-production.json
### In progress ###

ardumont updated the task description. (Show Details)Dec 9 2020, 11:38 AM

ardumont mentioned this in T2874: swh.search: Unstuck debian package.Dec 9 2020, 4:09 PM

ardumont added a revision: D4699: search: Deploy multiple search journal client instances.Dec 9 2020, 5:20 PM

vsellier added a revision: D4701: Allow configuration through cli or config file.Dec 9 2020, 5:56 PM

vsellier added a commit: rDSEAe72a785757fb: Allow configuration through cli or config file.Dec 9 2020, 6:15 PM

ardumont added a revision: D4704: docker-compose.search.yml: Add journal client for indexed values.Dec 9 2020, 6:19 PM

ardumont added a revision: D4709: indexer_storage: Publish indexer computation to journal topics.Dec 9 2020, 10:09 PM

ardumont added a revision: D4710: search.journal_client: Fix key error.Dec 9 2020, 10:26 PM

ardumont added a commit: rDSEAab27eae1356a: search.journal_client: Fix key error.Dec 10 2020, 9:49 AM

ardumont added a commit: rDENV2eb7a2c4407d: docker-compose.search.yml: Add journal client for indexed values.

ardumont added a commit: rDENVf5d483c14997: indexer_storage: Publish indexer computation to journal topics.

The production index import failed because the limit of 90% of used disk spaces was reached at some time to fall back to around 60G after a compaction
The progression was 80M documents of 91M.

root@search-esnode0:~# curl -s  http://${ES_SERVER}/_cat/indices/origin-production\?v
health status index             uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   origin-production uNGPhykaQ-2Nu6MX9tfxmw  80   0   83828000            0     59.3gb         59.3gb

We will increase the disk size to support the index. The memory of the node will be also increased to try to improve the performance of the search queries on this index.

Parition and memory extended with terraform.
The disk resize needed some console actions to be extended :

root@search-esnode0:~# zpool set autoexpand=on elasticsearch-volume

root@search-esnode0:~# parted /dev/vdb
GNU Parted 3.2
Using /dev/vdb
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) resizepart                                                       
Partition number? 1                                                       
End?  [215GB]?                                                            

root@search-esnode0:~# zpool list                                    
NAME                   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
elasticsearch-volume   200G  59.9G   140G        -         -    19%    30%  1.00x    ONLINE  -

root@search-esnode0:~# df -h | grep elasticsearch-volume
elasticsearch-volume                 194G   60G  134G  32% /srv/elasticsearch

The copy of the production index is restarted.
To improve the speed of the copy, the index was tuned to reduce the disk pressure (it's a temporary configuration and should not be used in a normal case as it's not safe) :

cat >/tmp/config.json <<EOF
{
  "index" : {
    "translog.sync_interval" : "60s",
	"translog.durability": "async",
	"refresh_interval": "60s"
  }
}
EOF

curl -s -H "Content-Type: application/json" -XPUT http://${ES_SERVER}/origin-production/_settings -d @/tmp/config.json

The new bottleneck is now the cpu.

The graphs are showing the different behaviors :

10:45 -> 1050 : copy with the default configuration: there is a medium cpu load and a high I/O pression
10:50 -> after : Configuration updated : High cpu load / low I/O pressure

ardumont added a commit: rSPSITE632e89e0a026: search: Deploy multiple search journal client instances.Dec 10 2020, 11:41 AM

ardumont added a revision: D4712: staging: Increase elasticsearch jvm heap size to half its memory.Dec 10 2020, 11:47 AM

ardumont updated the task description. (Show Details)Dec 10 2020, 1:21 PM

vsellier updated the task description. (Show Details)Dec 10 2020, 3:19 PM

ardumont added a commit: rSPSITEdf88875ac92c: staging: Increase elasticsearch jvm heap size to half its memory.Dec 10 2020, 3:24 PM

The deployment manifest are ok and deployed in staging so this task can be resolved.
We will work on reactivating search-journal-client for the metadata in another task when T2876 is resolved

FI: The origin index was recreated with the "official" mapping and a backfill was performed (necessary after the test of the flattened mapping)

vsellier mentioned this in rSPRE56974c0407c2: staging: Increase cpu, memory and disk of search-esnode0.Dec 10 2020, 3:59 PM

The production index origin was correctly copied from the production cluster but it seems without the configuration to optimize the copy.
We keep this one and try a new optimized copy to check if the server still crash in an OOM with the new cpu and memory settings.

root@search-esnode0:~# curl -XPOST -H "Content-Type: application/json" http://${ES_SERVER}/_reindex\?pretty\&refresh=true\&requests_per_second=-1\&\&wait_for_completion=true -d @/tmp/reindex-production.json                                                             
{
  "took" : 35702805,
  "timed_out" : false,
  "total" : 91517657,
  "updated" : 0,
  "created" : 91517657,
  "deleted" : 0,
  "batches" : 91518,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : {
    "bulk" : 0,
    "search" : 0
  },
  "throttled_millis" : 0,
  "requests_per_second" : -1.0,
  "throttled_until_millis" : 0,
  "failures" : [ ]
}

"took" : 35702805, ~= 10h

With the "optimized" configuration, the import is quite faster :

root@search-esnode0:~# curl -XPOST -H "Content-Type: application/json" http://${ES_SERVER}/_reindex\?pretty\&refresh=true\&requests_per_second=-1\&\&wait_for_completion=true -d @/tmp/reindex-production.json    
{
  "took" : 10215280,
  "timed_out" : false,
  "total" : 91517657,
  "updated" : 0,
  "created" : 91517657,
  "deleted" : 0,
  "batches" : 91518,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : {
    "bulk" : 0,
    "search" : 0
  },
  "throttled_millis" : 0,
  "requests_per_second" : -1.0,
  "throttled_until_millis" : 0,
  "failures" : [ ]
}

"took" : 10215280, => 2h45

The default configuration was restored with :

root@search-esnode0:/tmp# cat config-rollback.json 
{
  "index" : {
"translog.sync_interval" : null,
"translog.durability": null,
"refresh_interval": null
  }
}
root@search-esnode0:/tmp# curl -s -H "Content-Type: application/json" -XPUT http://${ES_SERVER}/$NEW_INDEX/_settings -d @/tmp/config-rollback.json
{"acknowledged":true}

root@search-esnode0:/tmp# curl http://$ES_SERVER/_all/_settings\?pretty
{
  "origin-production2" : {
    "settings" : {
      "index" : {
        "creation_date" : "1607677617631",
        "number_of_shards" : "80",
        "number_of_replicas" : "0",
        "uuid" : "P5qDO-jmQsmUOY1CI6hUcQ",
        "version" : {
          "created" : "7090399"
        },
        "provided_name" : "origin-production2"
      }
    }
  },
  "origin" : {
    "settings" : {
      "index" : {
        "creation_date" : "1607611095095",
        "number_of_shards" : "80",
        "number_of_replicas" : "0",
        "uuid" : "T0AItBcEQqKbS7FlmOs6yA",
        "version" : {
          "created" : "7090399"
        },
        "provided_name" : "origin"
      }
    }
  }
}

This task has been migrated to GitLab.

Enable the swh-search environment in stagingClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Enable the swh-search environment in staging
Closed, MigratedEdits Locked
Actions

Related Objects
Search...