I did too much here. Finish the pipeline swh-indexer -> swh-search on staging (so that's good nonetheless)
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Feb 10 2021
Note that the "docs.count" grew though (from 496619 to 786803) and the reason are
unclear.The same index is used to store the metadata out of the indexer with the same origin url
as key [1] and we are computing index metadata on origins already seen (thus already present
in the index afaiui). So I would have expect the docs.count stay roughly (or even
exactly?) the same as before?
swh-search-journal-client@indexed kept up with its topic:
swh.search.journal_client.indexed swh.journal.indexed.origin_intrinsic_metadata 0 13653216 13653216 0 rdkafka-7c45245c-814f-47f1-ba67-041e4f426373 /192.168.130.90 rdkafka
Feb 9 2021
tl; dr deployed on staging and it seems ok.
I just mention some on T2912#58067 but it's unclear whether that's actually true of me misremembering things.
Feb 5 2021
Feb 4 2021
This is a duplicate of T75, the history of which would probably be useful to take into account (I suspect it can be closed).
Feb 3 2021
Is there some remaining blocker on this?
(If not i'll attend to it next week)
Feb 2 2021
Feb 1 2021
The backfill is done.
Jan 29 2021
The journal_client has almost ingested the topics[1] it listens. It took some more time because a backfill of the origin_visit_status was launched for T2993.
It should be done by the end of the day.
Jan 27 2021
Let's consider it as done.
To decrease the time to recover the lag, several journal client were launched in // with :
/usr/bin/swh search --config-file /etc/softwareheritage/search/journal_client_objects.yml journal-client objects
Jan 26 2021
Back on this, the plan is now to make swh-journal not depend on the actual model definition, which is currently mostly due to the presence of the journal_data.py in swh-journal. So the plan is to move this file in swh-model so it's kept up to date with swh-model, even if it's mostly used for testing other packages (like swh-journal).
Upgrading the index configuration to speedup the indexation :
% cat >/tmp/config.json <<EOF { "index" : { "translog.sync_interval" : "60s", "translog.durability": "async", "refresh_interval": "60s" } } EOF % export ES_SERVER=192.168.100.81:9200 % export INDEX=origin % curl -s -H "Content-Type: application/json" -XPUT http://${ES_SERVER}/${INDEX}/_settings -d @/tmp/config.json {"acknowledged":true}%
Production
- puppet disabled
- Services stopped :
root@search1:~# systemctl stop swh-search-journal-client@objects.service root@search1:~# systemctl stop gunicorn-swh-search
- Index deleted and recreated
% export ES_SERVER=search-esnode1.internal.softwareheritage.org:9200 % curl -s http://$ES_SERVER/_cat/indices\?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open origin Mq8dnlpuRXO4yYoC6CTuQw 90 1 151716299 38861934 260.8gb 131gb % curl -XDELETE http://$ES_SERVER/origin {"acknowledged":true}% % swh search --config-file /etc/softwareheritage/search/server.yml initialize INFO:elasticsearch:PUT http://search-esnode1.internal.softwareheritage.org:9200/origin [status:200 request:2.216s] INFO:elasticsearch:PUT http://search-esnode3.internal.softwareheritage.org:9200/origin/_mapping [status:200 request:0.151s] Done. % curl -s http://$ES_SERVER/_cat/indices\?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open origin yFaqPPCnRFCnc5AA6Ah8lw 90 1 0 0 36.5kb 18.2kb
- journal client's consumer group delete:
% export SERVER=kafka1.internal.softwareheritage.org:9092 % ./kafka-consumer-groups.sh --bootstrap-server ${SERVER} --delete --group swh.search.journal_client Deletion of requested consumer groups ('swh.search.journal_client') was successful.
- journal client restarted
- puppet enabled
The filter on visited origins is working correctly on staging. The has_visit flag looks good.
For example for the https://www.npmjs.com/package/@ehmicky/dev-tasks origin
{ "_index" : "origin", "_type" : "_doc", "_id" : "019bd314416108304165e82dd92e00bc9ea85a53", "_score" : 60.56421, "_source" : { "url" : "https://www.npmjs.com/package/@ehmicky/dev-tasks", "sha1" : "019bd314416108304165e82dd92e00bc9ea85a53" }, "sort" : [ 60.56421, "019bd314416108304165e82dd92e00bc9ea85a53" ] }
swh=> select * from origin join origin_visit_status on id=origin where id=469380; id | url | origin | visit | date | status | metadata | snapshot | type --------+--------------------------------------------------+--------+-------+-------------------------------+---------+----------+--------------------------------------------+------ 469380 | https://www.npmjs.com/package/@ehmicky/dev-tasks | 469380 | 1 | 2021-01-25 13:30:47.221937+00 | created | | | npm 469380 | https://www.npmjs.com/package/@ehmicky/dev-tasks | 469380 | 1 | 2021-01-25 13:41:59.435579+00 | partial | | \xe3f24413d81fd3e9c309686fcfb6c8f5eb549acf | npm
Jan 25 2021
Staging
We are proceeding to a complete index rebuilding
Regarding the index rebuilding process, using a naive approach with aliases with the old and the new index[1] returns duplicated results when the search is done.
Using an alias with only the old index, rebuilding a new index and switching the alias to the new index[2] can be a first approach with the default the old index will not be updated until the alias is switched to the new index.
It also requires the swh-search code is able to use different names for the read and write operations.
Jan 21 2021
Jan 13 2021
I close this issue as there is not more action to perform at the moment.
Diagnosis and eventual fixes will be followed on dedicated issues
Jan 11 2021
Jan 7 2021
version v0.4.1 created with the last commit (rDSEA47db624364d4e781f8fa157b2d72d0eb9929b7a0)
Oh right, they were wrongfully set to True. I guess we can write a small script to set them all to False before we re-consume stasuses
how doing it without killing all the search
It depends of what will be implemented in T2936, but a new reindex will probably have to be done to fix the search. It will be the opportunity to think on how doing it without killing all the search
Yes indeed. swh-search was written before we had origin visit statuses, and I forgot to update it.
@vlorentz I was checking some differences between swh-search and the current search. does the journal client has to listen the origin_visit topic? It seems that `origin_visit_status should be enough to match the behavior of the search in the webapp.
Jan 6 2021
webapp1 is now plugged on the real live production index
Let monitor the behavior with real searches.
First constatation, the search retrieves all the documents and is not as progressive as the random search script.
The response times are longer than expected:
Jan 06 09:59:46 search1 python3[813]: 2021-01-06 09:59:46 [813] elasticsearch:INFO GET http://search-esnode1.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:3.399s] Jan 06 10:06:18 search1 python3[848]: 2021-01-06 10:06:18 [848] elasticsearch:INFO GET http://search-esnode1.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:7.422s] Jan 06 10:06:21 search1 python3[813]: 2021-01-06 10:06:21 [813] elasticsearch:INFO GET http://search-esnode3.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:5.077s] Jan 06 10:07:32 search1 python3[813]: 2021-01-06 10:07:32 [813] elasticsearch:INFO GET http://search-esnode2.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:4.819s] Jan 06 10:08:06 search1 python3[813]: 2021-01-06 10:08:06 [813] elasticsearch:INFO GET http://search-esnode1.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:2.700s] Jan 06 10:08:15 search1 python3[813]: 2021-01-06 10:08:15 [813] elasticsearch:INFO GET http://search-esnode3.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:2.414s]
the performances looks acceptable as it for a small number of parallel searches (~10), let's try now with real searches, it will also help to adapt the cluster configuration and validate the behavior
Jan 5 2021
In the new configuration, after a few time without search, the first ones are taking some time before stabilizing to the old values :
❯ ./random_search.sh 12:36:37
the index configuration was reset to its default :
cat >/tmp/config.json <<EOF { "index" : { "translog.sync_interval" : null, "translog.durability": null, "refresh_interval": null } } EOF
❯ curl -s http://192.168.100.81:9200/origin/_settings\?pretty { "origin" : { "settings" : { "index" : { "refresh_interval" : "60s", "number_of_shards" : "90", "translog" : { "sync_interval" : "60s", "durability" : "async" }, "provided_name" : "origin", "creation_date" : "1608761881782", "number_of_replicas" : "1", "uuid" : "Mq8dnlpuRXO4yYoC6CTuQw", "version" : { "created" : "7090399" } } } } } ❯ curl -s -H "Content-Type: application/json" -XPUT http://192.168.100.81:9200/origin/_settings\?pretty -d @/tmp/config.json { "acknowledged" : true } ❯ curl -s http://192.168.100.81:9200/origin/_settings\?pretty { "origin" : { "settings" : { "index" : { "creation_date" : "1608761881782", "number_of_shards" : "90", "number_of_replicas" : "1", "uuid" : "Mq8dnlpuRXO4yYoC6CTuQw", "version" : { "created" : "7090399" }, "provided_name" : "origin" } } } }
A *simple* search doesn't looked impacted (it's not a real benchmark):
❯ ./random_search.sh
Jan 4 2021
The backfill was done in a couple of days.
Dec 23 2020
search1.internal.softwareheritage.org vm deployed.
The configuration of the index was automatically performed by puppet during the initial provisionning.
Index template created in elasticsearch with 1 replica and 90 shards to have the same number of shards on each node:
export ES_SERVER=192.168.100.81:9200 curl -XPUT -H "Content-Type: application/json" http://$ES_SERVER/_index_template/origin\?pretty -d '{"index_patterns": "origin", "template": {"settings": { "index": { "number_of_replicas":1, "number_of_shards": 90 } } } } '
search-esnode[1-3] installed with zfs configured :
apt update && apt install linux-image-amd64 linux-headers-amd64 # reboot to upgrade the kernel apt install libnvpair1linux libuutil1linux libzfs2linux libzpool2linux zfs-dkms zfsutils-linux zfs-zed systemctl stop elasticsearch rm -rf /srv/elasticsearch/nodes/0 zpool create -O atime=off -m /srv/elasticsearch/nodes elasticsearch-data /dev/vdb chown elasticsearch: /srv/elasticsearch/nodes
Inventory was updated to reserve the elastisearch vms :
- search-esnode[1-3].internal.softwareheritage.org
- ips : 192.168.100.8[1-3]/24
The webapp is available at https://webapp1.internal.softwareheritage.org
In prevision of the deployment, the production index present on the staging's elasticsearch was renamed from origin-production2 to production_origin (a clone operation will be user [1], the original index will be let in place)
[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-clone-index.html
Dec 22 2020
Dec 21 2020
Dec 14 2020
With the "optimized" configuration, the import is quite faster :
root@search-esnode0:~# curl -XPOST -H "Content-Type: application/json" http://${ES_SERVER}/_reindex\?pretty\&refresh=true\&requests_per_second=-1\&\&wait_for_completion=true -d @/tmp/reindex-production.json { "took" : 10215280, "timed_out" : false, "total" : 91517657, "updated" : 0, "created" : 91517657, "deleted" : 0, "batches" : 91518, "version_conflicts" : 0, "noops" : 0, "retries" : { "bulk" : 0, "search" : 0 }, "throttled_millis" : 0, "requests_per_second" : -1.0, "throttled_until_millis" : 0, "failures" : [ ] }
"took" : 10215280, => 2h45
Dec 11 2020
The production index origin was correctly copied from the production cluster but it seems without the configuration to optimize the copy.
We keep this one and try a new optimized copy to check if the server still crash in an OOM with the new cpu and memory settings.
Dec 10 2020
FI: The origin index was recreated with the "official" mapping and a backfill was performed (necessary after the test of the flattened mapping)
The deployment manifest are ok and deployed in staging so this task can be resolved.
We will work on reactivating search-journal-client for the metadata in another task when T2876 is resolved
The copy of the production index is restarted.
To improve the speed of the copy, the index was tuned to reduce the disk pressure (it's a temporary configuration and should not be used in a normal case as it's not safe) :
cat >/tmp/config.json <<EOF { "index" : { "translog.sync_interval" : "60s", "translog.durability": "async", "refresh_interval": "60s" } } EOF
- Parition and memory extended with terraform.
- The disk resize needed some console actions to be extended :
The production index import failed because the limit of 90% of used disk spaces was reached at some time to fall back to around 60G after a compaction
The progression was 80M documents of 91M.