Disk usage warnings popped up relatedly to disk usage on the main elasticsearch cluster.
Maintenance is required to clean up disk space.
Disk usage warnings popped up relatedly to disk usage on the main elasticsearch cluster.
Maintenance is required to clean up disk space.
Initially wrongly written in T3060#59291.
On Tue 23 Feb 2021, warning on disk usage size popped up (80% used disk usage threshold
hit) [2]
Connection to one of the cluster server node, a high number of docs deleted popped up:
root@search-esnode1:~# curl http://192.168.100.81:9200/_cat/indices\?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open origin yFaqPPCnRFCnc5AA6Ah8lw 90 1 152755840 123305839 391.1gb 195.8gb
So a force merge was triggered to clean up those and claim some space.
curl -XPOST -H "Content-Type: application/json" http://${ES_SERVER}/origin/_forcemerge
But that did not immediately worked.
Instead the load on all 3 servers node increased.
So the swh-search-journal-client@objects (on search1.internal.softwareheritage.org) got
stopped so elasticsearch do have a chance to clean up whatever it needed to do.
It worked, the load subsided.
And the docs deleted number is actually decreasing (and disk space got freed).
The initial goal was reached (free disk space ;)
root@search-esnode1:~# date; curl http://$ES_SERVER/_cat/indices\?v Tue 23 Feb 2021 08:57:09 AM UTC health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open origin yFaqPPCnRFCnc5AA6Ah8lw 90 1 152756605 40806220 273.1gb 137.4gb
It's still in progress.
Further discussion ensued on System administrators about the force merge action:
09:39 <+olasd> Force merge should only be called against an index after you have finished writing to it. Force merge can cause very large (>5GB) segments to be produced, and if you continue to write to such an index then the automatic merge policy will never consider these segments for future merges until they mostly consist of deleted documents. ... 09:40 <+olasd> force merge should only be called against an index after you're *completely done* writing to it 09:41 <+olasd> if you're going to update the documents ever again, the merged segments won't be automatically cleaned, until most documents have been updated 09:41 <+olasd> because they're gigantic ...
So next action is actually to discontinue the usage of that index which is tainted by
our use of the force_merge action.
I'll create a new one out of the new as soon as disk space usage allows it (more than
50% free disk space).
[1]
06:32 <swhbot`> icinga PROBLEM: service disk /srv/elasticsearch/nodes on search-esnode1.internal.softwareheritage.org is WARNING: DISK WARNING - free space: /srv/elasticsearch/nodes 41310 MB (20% inode=99%); 06:46 <swhbot`> icinga RECOVERY: service disk /srv/elasticsearch/nodes on search-esnode1.internal.softwareheritage.org is OK: DISK OK - free space: /srv/elasticsearch/nodes 41590 MB (21% inode=99%); 06:49 <swhbot`> icinga PROBLEM: service disk /srv/elasticsearch/nodes on search-esnode1.internal.softwareheritage.org is WARNING: DISK WARNING - free space: /srv/elasticsearch/nodes 40615 MB (20% inode=99%); 06:53 <swhbot`> icinga RECOVERY: service disk /srv/elasticsearch/nodes on search-esnode1.internal.softwareheritage.org is OK: DISK OK - free space: /srv/elasticsearch/nodes 41737 MB (21% inode=99%); 07:05 <swhbot`> icinga PROBLEM: service disk /srv/elasticsearch/nodes on search-esnode1.internal.softwareheritage.org is WARNING: DISK WARNING - free space: /srv/elasticsearch/nodes 41305 MB (20% inode=99%); 08:06 <swhbot`> icinga RECOVERY: service disk /srv/elasticsearch/nodes on search-esnode1.internal.softwareheritage.org is OK: DISK OK - free space: /srv/elasticsearch/nodes 41527 MB (21% inode=99%); 08:11 <swhbot`> icinga PROBLEM: service disk /srv/elasticsearch/nodes on search-esnode1.internal.softwareheritage.org is WARNING: DISK WARNING - free space: /srv/elasticsearch/nodes 41135 MB (20% inode=99%); 08:12 <swhbot`> icinga PROBLEM: service disk /srv/elasticsearch/nodes on search-esnode2.internal.softwareheritage.org is WARNING: DISK WARNING - free space: /srv/elasticsearch/nodes 41403 MB (20% inode=99%); 08:28 <swhbot`> icinga RECOVERY: service disk /srv/elasticsearch/nodes on search-esnode1.internal.softwareheritage.org is OK: DISK OK - free space: /srv/elasticsearch/nodes 41620 MB (21% inode=99%); 08:33 <swhbot`> icinga PROBLEM: service disk /srv/elasticsearch/nodes on search-esnode1.internal.softwareheritage.org is WARNING: DISK WARNING - free space: /srv/elasticsearch/nodes 41248 MB (20% inode=99%); 08:44 <swhbot`> icinga RECOVERY: service disk /srv/elasticsearch/nodes on search-esnode1.internal.softwareheritage.org is OK: DISK OK - free space: /srv/elasticsearch/nodes 42165 MB (21% inode=99%); 08:48 <swhbot`> icinga RECOVERY: service disk /srv/elasticsearch/nodes on search-esnode2.internal.softwareheritage.org is OK: DISK OK - free space: /srv/elasticsearch/nodes 56738 MB (28% inode=99%);
Create new index out of the old
export ES_SERVER=192.168.100.81:9200 cat >reindex-origin.json <<EOF { "source": { "index": "origin" }, "dest": { "index": "origin-backup-20210223-1017" } } EOF curl -XPOST -H "Content-Type: application/json" http://${ES_SERVER}/_reindex\?pretty\&refresh=true\&requests_per_second=-1\&\&wait_for_completion=true -d @reindex-origin.json
in-progress ^
puppet agent got disabled on search1.internal.softwareheritage.org so puppet does not
start back the search-journal-client@objects (which writs to the index) [1]
Status:
root@search-esnode1:~# watch -n 10 "curl -s http://$ES_SERVER/_cat/nodes\?v; echo ; curl -s http://$ES_SERVER/_cat/indices\?v ; echo ; df -h | grep elastic" Every 10.0s: curl -s http://192.168.100.81:9200/_cat/nodes\?v; echo ; curl -s http://192.168.100.81:9200/_cat/indices\?v ; echo ; df -h | grep elastic search-esnode1: Tue Feb 23 09:54:10 2021 ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name 192.168.100.81 54 84 25 3.89 4.82 4.77 dilmrt - search-esnode1 192.168.100.83 61 89 30 7.28 7.20 6.80 dilmrt - search-esnode3 192.168.100.82 51 83 12 4.23 3.89 3.45 dilmrt * search-esnode2 health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open origin yFaqPPCnRFCnc5AA6Ah8lw 90 1 152756759 41962187 261gb 130.7gb green open origin-backup-20210223-1017 Aa4lYwOVRRezkDP1UL7lBw 90 1 82108 0 1.5gb 766.5mb elasticsearch-data 193G 89G 105G 47% /srv/elasticsearch/nodes
[1]
puppet agent --disable "T3067: elasticsearch maintenance, (clean up docs deleted, create new index)"
Copy just finished:
root@search-esnode1:~# curl -XPOST -H "Content-Type: application/json" http://${ES_SERVER}/_reindex\?pretty\&refresh=true\&requests_per_second=-1\&\&wait_for_completion=true -d @reindex-origin.json { "took" : 91121031, "timed_out" : false, "total" : 152756759, "updated" : 0, "created" : 152756759, "deleted" : 0, "batches" : 152757, "version_conflicts" : 0, "noops" : 0, "retries" : { "bulk" : 0, "search" : 0 }, "throttled_millis" : 0, "requests_per_second" : -1.0, "throttled_until_millis" : 0, "failures" : [ ] } root@search-esnode1:~# curl -s http://$ES_SERVER/_cat/indices\?v ; echo ; df -h | grep elastic health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open origin yFaqPPCnRFCnc5AA6Ah8lw 90 1 152756759 41962187 261gb 130.7gb green open origin-backup-20210223-1017 Aa4lYwOVRRezkDP1UL7lBw 90 1 152756759 0 92.9gb 46.5gb elasticsearch-data 193G 119G 75G 62% /srv/elasticsearch/nodes
On to actually swith to use the new index and drop the old one.
So turns out, the mapping initialization step was missing. So cleanup, rinse, repeat..
without forgetting the mapping initialization step this time...
Preparing the new index (with a better name [1]) appropriately (the settings were fine):
# close misconfigured index [2] root@search-esnode1:~# curl -s -XPOST $ES_SERVER/$BAD_INDEX/_close # Retrieve current mapping from the main index root@search-esnode1:~# curl -s $ES_SERVER/origin/_mapping | jq . > /tmp/origin-backup-mappings.json # Create new index origin-production with the good mapping root@search-esnode1:~# curl -s -XPUT -H 'Content-Type: application/json' $ES_SERVER/origin-production -d @/tmp/origin-backup-mappings.json | jq . { "acknowledged": true, "shards_acknowledged": true, "index": "origin-production" } # the right mapping is installed ok: root@search-esnode1:~# curl -s $ES_SERVER/origin-production/_mapping | jq . { "origin-production": { "mappings": { "date_detection": false, "properties": { "has_visits": { "type": "boolean" }, "intrinsic_metadata": { "type": "nested", "properties": { "@context": { "type": "keyword" } } }, "sha1": { "type": "keyword" }, "url": { "type": "text", "fields": { "as_you_type": { "type": "search_as_you_type", "doc_values": false, "analyzer": "simple", "max_shingle_size": 3 } }, "analyzer": "simple" }, "visit_types": { "type": "keyword" } } } } }
And we start again the copy:
cat >reindex-origin.json <<EOF { "source": { "index": "origin" }, "dest": { "index": "origin-production" } } EOF curl -XPOST -H "Content-Type: application/json" ${ES_SERVER}/_reindex\?pretty\&refresh=true\&requests_per_second=-1\&\&wait_for_completion=true -d @reindex-origin.json
And now we wait, following a bit stats:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open origin yFaqPPCnRFCnc5AA6Ah8lw 90 1 152756759 41962187 261gb 130.7gb green open origin-production hZfuv0lVRImjOjO_rYgDzg 90 1 1021300 0 1.1gb 917.9mb elasticsearch-data 193G 88G 106G 46% /srv/elasticsearch/nodes
[1] I'll install an alias "origin" on the new "origin-production" index. Keeping the
name "origin" as the index name is currently hard-coded in the swh-search code.
[2] We don't have that much space so cleaning up the wrong copy:
ardumont@search-esnode1:~% curl -s -X DELETE $ES_SERVER/origin-backup-20210223-1017 {"acknowledged":true}%
Status, still in progress:
Every 10.0s: curl -s http://192.168.100.81:9200/_cat/nodes\?v; echo ; curl -s http://192.168.100.81:9200/_cat/indices\?v ; echo ; df -h | grep elastic search-esnode1: Wed Feb 24 16:14:58 2021 ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name 192.168.100.81 71 89 50 5.80 5.57 5.17 dilmrt - search-esnode1 192.168.100.83 74 83 63 7.41 7.89 7.74 dilmrt - search-esnode3 192.168.100.82 74 90 22 3.13 3.74 4.06 dilmrt * search-esnode2 health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open origin yFaqPPCnRFCnc5AA6Ah8lw 90 1 152756759 41962187 261gb 130.7gb green open origin-production hZfuv0lVRImjOjO_rYgDzg 90 1 32715334 0 58.1gb 27.6gb elasticsearch-data 193G 111G 83G 58% /srv/elasticsearch/nodes
Finally finished:
root@search-esnode1:~# curl -XPOST -H "Content-Type: application/json" ${ES_SERVER}/_reindex\?pretty\&refresh=true\&requests_per_second=-1\&\&wait_for_completion=true -d @reindex-origin.json { "took" : 115296461, "timed_out" : false, "total" : 152756759, "updated" : 0, "created" : 152756759, "deleted" : 0, "batches" : 152757, "version_conflicts" : 0, "noops" : 0, "retries" : { "bulk" : 0, "search" : 0 }, "throttled_millis" : 0, "requests_per_second" : -1.0, "throttled_until_millis" : 0, "failures" : [ ] }
And the disk size did not explode:
$ curl -s http://192.168.100.81:9200/_cat/indices\?v ; echo ; df -h | grep elastic health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open origin yFaqPPCnRFCnc5AA6Ah8lw 90 1 152756759 41962187 261gb 130.7gb green open origin-production hZfuv0lVRImjOjO_rYgDzg 90 1 152756759 0 217gb 108.6gb elasticsearch-data 193G 160G 34G 83% /srv/elasticsearch/nodes
Next is installing an alias "origin" on the "origin-production" after dropping the old
"origin" index (but tomorrow).
Install alias "origin" on "origin-production" index:
Expectedly, we can't add an alias named the same as an existing index:
ardumont@search-esnode1:~% curl -s -XPOST $ES_SERVER/origin-production/_alias/origin {"error":{"root_cause":[{"type":"invalid_alias_name_exception","reason":"Invalid alias name [origin], an index exists with the same name as the alias","index_uuid":"yFaqPPCnRFCnc5AA6Ah8lw","index":"origin"}],"type":"invalid_alias_name_exce
So first close and delete the old index to reclaim space:
# Close the index ardumont@search-esnode1:~% curl -s -XPOST $ES_SERVER/origin/_close {"acknowledged":true,"shards_acknowledged":true,"indices":{"origin":{"closed":true}}}% ardumont@search-esnode1:~% curl -s -XDELETE $ES_SERVER/origin {"acknowledged":true}%
Install the new alias:
ardumont@search-esnode1:~% curl -s -XPOST $ES_SERVER/origin-production/_alias/origin {"acknowledged":true}% # check on alias ardumont@search-esnode1:~% curl -s -XGET $ES_SERVER/origin/_alias | jq . { "origin-production": { "aliases": { "origin": {} } } }
Status:
ardumont@search-esnode1:~% curl -s $ES_SERVER/_cat/indices\?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open origin-production hZfuv0lVRImjOjO_rYgDzg 90 1 152756759 0 217gb 108.6gb ardumont@search-esnode1:~% curl -s -XPOST $ES_SERVER/origin/_search | jq . > /tmp/status-search.json { "took": 11, "timed_out": false, "_shards": { "total": 90, "successful": 90, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 10000, "relation": "gte" }, "max_score": 1, "hits": [ { "_index": "origin-production", "_type": "_doc", "_id": "714a5653925512789ddc516148f4048cfc3c6b98", "_score": 1, "_source": { "url": "https://github.com/usarvesh1994/node_weather", "sha1": "714a5653925512789ddc516148f4048cfc3c6b98", "has_visits": true } }, { "_index": "origin-production", "_type": "_doc", "_id": "6634fbdeff872f7fa4d806be1ddb0bf43e494cea", "_score": 1, "_source": { "url": "https://github.com/sschimper/Wrecking_Ball", "sha1": "6634fbdeff872f7fa4d806be1ddb0bf43e494cea" } }, { "_index": "origin-production", "_type": "_doc", "_id": "d8ea246217f38442c78ab920d64830fb108b3456", "_score": 1, "_source": { "url": "https://github.com/malefique/alex", "sha1": "d8ea246217f38442c78ab920d64830fb108b3456" } }, { "_index": "origin-production", "_type": "_doc", "_id": "4da7307b060260e63a4ace3019d0ded5386d521a", "_score": 1, "_source": { "url": "https://github.com/kk50576/hello100", "sha1": "4da7307b060260e63a4ace3019d0ded5386d521a" } }, { "_index": "origin-production", "_type": "_doc", "_id": "833fbcf594895ad6b60007f45db5f1c005aca41f", "_score": 1, "_source": { "url": "https://github.com/andrespardo/proyecto", "sha1": "833fbcf594895ad6b60007f45db5f1c005aca41f" } }, { "_index": "origin-production", "_type": "_doc", "_id": "a470347d9632f966548b5ba178dc5de6ee55342a", "_score": 1, "_source": { "url": "https://github.com/rajkiranchiluvuri/Javascript", "sha1": "a470347d9632f966548b5ba178dc5de6ee55342a" } }, { "_index": "origin-production", "_type": "_doc", "_id": "f49e99c83ff2957e605955cf45eba85e8d2f6672", "_score": 1, "_source": { "url": "https://github.com/hikmetguler/H-G", "sha1": "f49e99c83ff2957e605955cf45eba85e8d2f6672" } }, { "_index": "origin-production", "_type": "_doc", "_id": "d0d2aaff25afb603a4843afdc9f854fe1c1b1010", "_score": 1, "_source": { "url": "https://github.com/jburton91/solid-octo-enigma", "sha1": "d0d2aaff25afb603a4843afdc9f854fe1c1b1010" } }, { "_index": "origin-production", "_type": "_doc", "_id": "37ed1702cd0c395974105c6bfe6bb92d4c2bb5c4", "_score": 1, "_source": { "url": "https://github.com/trivialmatters/crawler-site-screenshot-grabber", "sha1": "37ed1702cd0c395974105c6bfe6bb92d4c2bb5c4" } }, { "_index": "origin-production", "_type": "_doc", "_id": "a8e6431a4c3f0b83f6b1dcd7c3c70c29bc305bec", "_score": 1, "_source": { "url": "https://github.com/kaganozturk/test", "sha1": "a8e6431a4c3f0b83f6b1dcd7c3c70c29bc305bec" } } ] } }
We can start back the swh-search-journal-client@object service.
We can start back the swh-search-journal-client@object service.
Done:
root@search1:~# systemctl start swh-search-journal-client@objects.service root@search1:~# systemctl status swh-search-journal-client@objects.service ● swh-search-journal-client@objects.service - Software Heritage Search Journal Client objects Loaded: loaded (/etc/systemd/system/swh-search-journal-client@.service; enabled; vendor preset: enabled) Active: active (running) since Fri 2021-02-26 09:12:44 UTC; 44s ago Main PID: 1594719 (swh) Tasks: 8 (limit: 19212) Memory: 540.2M CPU: 6.207s CGroup: /system.slice/system-swh\x2dsearch\x2djournal\x2dclient.slice/swh-search-journal-client@objects.service └─1594719 /usr/bin/python3 /usr/bin/swh search --config-file /etc/softwareheritage/search/journal_client_objects.yml journal-client objects Feb 26 09:13:26 search1 swh[1594719]: INFO:elasticsearch:POST http://search-esnode2.internal.softwareheritage.org:9200/origin/_bulk [status:200 request:0.438s] Feb 26 09:13:27 search1 swh[1594719]: INFO:elasticsearch:POST http://search-esnode1.internal.softwareheritage.org:9200/origin/_bulk [status:200 request:0.205s] Feb 26 09:13:27 search1 swh[1594719]: INFO:elasticsearch:POST http://search-esnode3.internal.softwareheritage.org:9200/origin/_bulk [status:200 request:0.158s] Feb 26 09:13:27 search1 swh[1594719]: INFO:elasticsearch:POST http://search-esnode2.internal.softwareheritage.org:9200/origin/_bulk [status:200 request:0.139s] Feb 26 09:13:27 search1 swh[1594719]: INFO:elasticsearch:POST http://search-esnode1.internal.softwareheritage.org:9200/origin/_bulk [status:200 request:0.288s] Feb 26 09:13:27 search1 swh[1594719]: INFO:elasticsearch:POST http://search-esnode3.internal.softwareheritage.org:9200/origin/_bulk [status:200 request:0.155s] Feb 26 09:13:28 search1 swh[1594719]: INFO:elasticsearch:POST http://search-esnode2.internal.softwareheritage.org:9200/origin/_bulk [status:200 request:0.136s] Feb 26 09:13:28 search1 swh[1594719]: INFO:elasticsearch:POST http://search-esnode1.internal.softwareheritage.org:9200/origin/_bulk [status:200 request:0.154s] Feb 26 09:13:28 search1 swh[1594719]: INFO:elasticsearch:POST http://search-esnode3.internal.softwareheritage.org:9200/origin/_bulk [status:200 request:0.168s] Feb 26 09:13:28 search1 swh[1594719]: INFO:elasticsearch:POST http://search-esnode2.internal.softwareheritage.org:9200/origin/_bulk [status:200 request:0.356s]
And puppet reenabled:
root@search1:~# puppet agent --enable root@search1:~# puppet agent --test Info: Using configured environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Retrieving locales Info: Loading facts Info: Caching catalog for search1.internal.softwareheritage.org Info: Applying configuration version '1614330842' Notice: Applied catalog in 8.18 seconds
Expectedly with the previous action, number of documents started growing again.
green open origin-production hZfuv0lVRImjOjO_rYgDzg 90 1 152795694 297907 217.6gb 109.2gb
So finally good with this.
Closing.
The backfill / reindexation looks aggressive for the cluster and the search. There is a lot of timeouts on the webapp's search
File "/usr/lib/python3/dist-packages/elasticsearch/connection/http_urllib3.py", line 249, in perform_request raise ConnectionTimeout("TIMEOUT", str(e), e) elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='search-esnode3.internal.softwareheritage.org', port=9200): Read timed out. (read timeout=10))
Changing the index configuration for async I/O and reduce the refresh interval should help:
% cat >/tmp/config.json <<EOF { "index" : { "translog.sync_interval" : "60s", "translog.durability": "async", "refresh_interval": "60s" } } EOF % export NEW_INDEX=origin-production % curl -s -H "Content-Type: application/json" -XPUT http://${ES_SERVER}/${NEW_INDEX}/_settings -d @/tmp/config.json {"acknowledged":true}
It seems it helps as the search queries are responding in less than a second now:
Mar 01 08:41:23 moma python3[3491946]: 2021-03-01 08:41:23 [3491946] elasticsearch:INFO GET http://search-esnode3.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:0.795s] Mar 01 08:46:30 moma python3[3491945]: 2021-03-01 08:46:30 [3491945] elasticsearch:INFO GET http://search-esnode3.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:0.738s] Mar 01 08:49:26 moma python3[3491944]: 2021-03-01 08:49:26 [3491944] elasticsearch:INFO GET http://search-esnode2.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:0.565s]