Oh, and now that we have user profile pages, we should have a list of "my" save code now requests with their status visible in the user profile, for those who want to check synchronously the status of their requests (and might have disabled email notifications).
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Apr 15 2021
It would be desirable to provide the user with feedback that helps fix the issue.
Apr 12 2021
- swh-site: Deploy one systemd unit (per worker) which is able to deal with all the existing save code now requests and subscribed to the one high priority queue. Loaders are: loader-git, loader-svn, loader-mercurial for now.
A script is regurarly executed to close the oldest indexes (30days) : P1004
It should be added on puppet and scheduled in a cron
Apr 8 2021
The cluster is configured with the default value for cluster.max_shards_per_nodes [1] so it can have 3000 shards opened (1000*3)
I temporary unblock the ingestion by closing the systemlogs indexes created before 2020-07-01:
curl -s http://$ES_NODE/_cat/indices\?s=index | grep -v close | grep systemlogs | awk '{print $3}' | grep 2020.05 | xargs -n1 -t -i{} curl -XPOST http://${ES_NODE}/{}/_close curl -s http://$ES_NODE/_cat/indices\?s=index | grep -v close | grep systemlogs | awk '{print $3}' | grep 2020.06 | xargs -n1 -t -i{} curl -XPOST http://${ES_NODE}/{}/_close
it seems we have reached a limit on the cluster (from the logstash logs) :
Apr 08 10:30:24 logstash0 logstash[1605158]: [2021-04-08T10:30:24,052][WARN ][logstash.outputs.elasticsearch][main][62d11c4234b8981da77a97955da92ac9de92b9a6dcd4582f407face31fd5c664] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"systemlogs-2021.04.08", :routing=>nil, :_type=>"_doc"}, #<LogStash::Event:0x2ec8df34>], :response=>{"index"=>{"_index"=>"systemlogs-2021.04.08", "_type"=>"_doc", "_id"=>nil, "status"=>400, "error"=>{"type"=>"validation_exception", "reason"=>"Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [3000]/[3000] maximum shards open;"}}}} Apr 08 10:30:24 logstash0 logstash[1605158]: [2021-04-08T10:30:24,052][WARN ][logstash.outputs.elasticsearch][main][62d11c4234b8981da77a97955da92ac9de92b9a6dcd4582f407face31fd5c664] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"systemlogs-2021.04.08", :routing=>nil, :_type=>"_doc"}, #<LogStash::Event:0x26cfcf58>], :response=>{"index"=>{"_index"=>"systemlogs-2021.04.08", "_type"=>"_doc", "_id"=>nil, "status"=>400, "error"=>{"type"=>"validation_exception", "reason"=>"Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [3000]/[3000] maximum shards open;"}}}} Apr 08 10:30:24 logstash0 logstash[1605158]: [2021-04-08T10:30:24,053][WARN ][logstash.outputs.elasticsearch][main][62d11c4234b8981da77a97955da92ac9de92b9a6dcd4582f407face31fd5c664] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"systemlogs-2021.04.08", :routing=>nil, :_type=>"_doc"}, #<LogStash::Event:0x1ddd876d>], :response=>{"index"=>{"_index"=>"systemlogs-2021.04.08", "_type"=>"_doc", "_id"=>nil, "status"=>400, "error"=>{"type"=>"validation_exception", "reason"=>"Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [3000]/[3000] maximum shards open;"}}}}
Apr 7 2021
Operationally, there's two axes we can play with:
@ardumont we briefly discussed this a while ago with @olasd. I think the proposed solution was indeed to have a separate queue (and workers) for "save code now" request, but not necessarily one separate queue per loader, because the current priority system wasn't considered to be "fast enough". Maybe we can discuss this briefly with him and synthesize here what you come up with?
We already have a priority queue system in place in the scheduler. And for example, the
archive schedules save code now requests with a priority high [1]
Apr 6 2021
Apr 2 2021
Mar 17 2021
Mar 8 2021
Mar 7 2021
Mar 5 2021
Mar 4 2021
Feb 2 2021
Jan 13 2021
Jan 5 2021
In the new configuration, after a few time without search, the first ones are taking some time before stabilizing to the old values :
❯ ./random_search.sh 12:36:37
the index configuration was reset to its default :
cat >/tmp/config.json <<EOF { "index" : { "translog.sync_interval" : null, "translog.durability": null, "refresh_interval": null } } EOF
❯ curl -s http://192.168.100.81:9200/origin/_settings\?pretty { "origin" : { "settings" : { "index" : { "refresh_interval" : "60s", "number_of_shards" : "90", "translog" : { "sync_interval" : "60s", "durability" : "async" }, "provided_name" : "origin", "creation_date" : "1608761881782", "number_of_replicas" : "1", "uuid" : "Mq8dnlpuRXO4yYoC6CTuQw", "version" : { "created" : "7090399" } } } } } ❯ curl -s -H "Content-Type: application/json" -XPUT http://192.168.100.81:9200/origin/_settings\?pretty -d @/tmp/config.json { "acknowledged" : true } ❯ curl -s http://192.168.100.81:9200/origin/_settings\?pretty { "origin" : { "settings" : { "index" : { "creation_date" : "1608761881782", "number_of_shards" : "90", "number_of_replicas" : "1", "uuid" : "Mq8dnlpuRXO4yYoC6CTuQw", "version" : { "created" : "7090399" }, "provided_name" : "origin" } } } }
A *simple* search doesn't looked impacted (it's not a real benchmark):
❯ ./random_search.sh
Jan 4 2021
The backfill was done in a couple of days.
Dec 23 2020
search1.internal.softwareheritage.org vm deployed.
The configuration of the index was automatically performed by puppet during the initial provisionning.
Index template created in elasticsearch with 1 replica and 90 shards to have the same number of shards on each node:
export ES_SERVER=192.168.100.81:9200 curl -XPUT -H "Content-Type: application/json" http://$ES_SERVER/_index_template/origin\?pretty -d '{"index_patterns": "origin", "template": {"settings": { "index": { "number_of_replicas":1, "number_of_shards": 90 } } } } '
search-esnode[1-3] installed with zfs configured :
apt update && apt install linux-image-amd64 linux-headers-amd64 # reboot to upgrade the kernel apt install libnvpair1linux libuutil1linux libzfs2linux libzpool2linux zfs-dkms zfsutils-linux zfs-zed systemctl stop elasticsearch rm -rf /srv/elasticsearch/nodes/0 zpool create -O atime=off -m /srv/elasticsearch/nodes elasticsearch-data /dev/vdb chown elasticsearch: /srv/elasticsearch/nodes
Inventory was updated to reserve the elastisearch vms :
- search-esnode[1-3].internal.softwareheritage.org
- ips : 192.168.100.8[1-3]/24
The webapp is available at https://webapp1.internal.softwareheritage.org
In prevision of the deployment, the production index present on the staging's elasticsearch was renamed from origin-production2 to production_origin (a clone operation will be user [1], the original index will be let in place)
[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-clone-index.html
Dec 21 2020
Dec 14 2020
With the "optimized" configuration, the import is quite faster :
root@search-esnode0:~# curl -XPOST -H "Content-Type: application/json" http://${ES_SERVER}/_reindex\?pretty\&refresh=true\&requests_per_second=-1\&\&wait_for_completion=true -d @/tmp/reindex-production.json { "took" : 10215280, "timed_out" : false, "total" : 91517657, "updated" : 0, "created" : 91517657, "deleted" : 0, "batches" : 91518, "version_conflicts" : 0, "noops" : 0, "retries" : { "bulk" : 0, "search" : 0 }, "throttled_millis" : 0, "requests_per_second" : -1.0, "throttled_until_millis" : 0, "failures" : [ ] }
"took" : 10215280, => 2h45
Dec 11 2020
The production index origin was correctly copied from the production cluster but it seems without the configuration to optimize the copy.
We keep this one and try a new optimized copy to check if the server still crash in an OOM with the new cpu and memory settings.
Dec 10 2020
FI: The origin index was recreated with the "official" mapping and a backfill was performed (necessary after the test of the flattened mapping)
The deployment manifest are ok and deployed in staging so this task can be resolved.
We will work on reactivating search-journal-client for the metadata in another task when T2876 is resolved
The copy of the production index is restarted.
To improve the speed of the copy, the index was tuned to reduce the disk pressure (it's a temporary configuration and should not be used in a normal case as it's not safe) :
cat >/tmp/config.json <<EOF { "index" : { "translog.sync_interval" : "60s", "translog.durability": "async", "refresh_interval": "60s" } } EOF
- Parition and memory extended with terraform.
- The disk resize needed some console actions to be extended :
The production index import failed because the limit of 90% of used disk spaces was reached at some time to fall back to around 60G after a compaction
The progression was 80M documents of 91M.
Dec 9 2020
The search rpc backend and the journal client listening on origin and origin_visit topics are deployed.
The inventory is up to date for both hosts [1][2]
Dec 8 2020
A dashboard to monitor the ES cluster behavior has been created on grafana [1]
It will be improved during the swh-search tests
Dec 7 2020
Interesting note about how to size the shards of an index : https://www.elastic.co/guide/en/elasticsearch/reference/7.x//size-your-shards.html
Dec 4 2020
We added a volume of 100Gib to the search-esnode0 through terraform (D4663).
So we could mount the /srv/elasticsearch as zfs volume.
Dec 3 2020
Dec 2 2020
Nov 27 2020
The swh-indexer stack is deployed on staging and the initial loading is done.
The volumes are quite low :
Nov 26 2020
T2814 needs to be released before