The JSON document associated to an origin in ES has a has_visit field, closing this as invalid.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Mar 1 2021
the backfill is done, the search on metadata seems to work correctly.
The backfill / reindexation looks aggressive for the cluster and the search. There is a lot of timeouts on the webapp's search
File "/usr/lib/python3/dist-packages/elasticsearch/connection/http_urllib3.py", line 249, in perform_request raise ConnectionTimeout("TIMEOUT", str(e), e) elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='search-esnode3.internal.softwareheritage.org', port=9200): Read timed out. (read timeout=10))
Feb 26 2021
Expectedly with the previous action, number of documents started growing again.
green open origin-production hZfuv0lVRImjOjO_rYgDzg 90 1 152795694 297907 217.6gb 109.2gb
We can start back the swh-search-journal-client@object service.
Install alias "origin" on "origin-production" index:
Feb 25 2021
Finally finished:
root@search-esnode1:~# curl -XPOST -H "Content-Type: application/json" ${ES_SERVER}/_reindex\?pretty\&refresh=true\&requests_per_second=-1\&\&wait_for_completion=true -d @reindex-origin.json { "took" : 115296461, "timed_out" : false, "total" : 152756759, "updated" : 0, "created" : 152756759, "deleted" : 0, "batches" : 152757, "version_conflicts" : 0, "noops" : 0, "retries" : { "bulk" : 0, "search" : 0 }, "throttled_millis" : 0, "requests_per_second" : -1.0, "throttled_until_millis" : 0, "failures" : [ ] }
Feb 24 2021
Status, still in progress:
Every 10.0s: curl -s http://192.168.100.81:9200/_cat/nodes\?v; echo ; curl -s http://192.168.100.81:9200/_cat/indices\?v ; echo ; df -h | grep elastic search-esnode1: Wed Feb 24 16:14:58 2021
So turns out, the mapping initialization step was missing. So cleanup, rinse, repeat..
without forgetting the mapping initialization step this time...
Copy just finished:
root@search-esnode1:~# curl -XPOST -H "Content-Type: application/json" http://${ES_SERVER}/_reindex\?pretty\&refresh=true\&requests_per_second=-1\&\&wait_for_completion=true -d @reindex-origin.json { "took" : 91121031, "timed_out" : false, "total" : 152756759, "updated" : 0, "created" : 152756759, "deleted" : 0, "batches" : 152757, "version_conflicts" : 0, "noops" : 0, "retries" : { "bulk" : 0, "search" : 0 }, "throttled_millis" : 0, "requests_per_second" : -1.0, "throttled_until_millis" : 0, "failures" : [ ] }
Feb 23 2021
Create new index out of the old
Initially wrongly written in T3060#59291.
Comment discarded (unrelated to this task) and reported in a dedicated task [1]
Feb 19 2021
- stop the journal client
root@search0:~# systemctl stop swh-search-journal-client@objects.service root@search0:~# puppet agent --disable "stop search journal client to reset offsets"
- reset the offset for the swh.journal.objects.origin_visit topic:
vsellier@journal0 ~ % /opt/kafka/bin/kafka-consumer-groups.sh --bootstrap-server $SERVER --reset-offsets --topic swh.journal.objects.origin_visit --to-earliest --group swh.search.journal_client --execute
Regarding the missing visit_type, one of the topic with the visit_type needs to be visited again to populate the fields for all the origins.
As the index was restored from the backup, the fields was only set for the visits done since the last 15days.
The offset will be reset for the origin_visit to limit the work.
Regarding the index size, it seems it's due to a huge number of deleted documents (probably due to the backlog and an update of the documents at each change)
% curl -s http://${ES_SERVER}/_cat/indices\?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open origin HthJj42xT5uO7w3Aoxzppw 80 0 868634 8577610 10.5gb 10.5gb green close origin-backup-20210209-1736 P1CKjXW0QiWM5zlzX46-fg 80 0 green open origin-v0.5.0 SGplSaqPR_O9cPYU4ZsmdQ 80 0 868121 0 987.7mb 987.7mb green open origin-toremove PL7WEs3FTJSQy4dgGIwpeQ 80 0 868610 0 987.5mb 987.5mb <-- A clean copy of the origin index has almose the same size as yesterday
Forcing a merge seems restore a decent size :
% curl -XPOST -H "Content-Type: application/json" http://${ES_SERVER}/origin/_forcemerge {"_shards":{"total":80,"successful":80,"failed":0}}%
% curl -s http://${ES_SERVER}/_cat/indices\?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open origin HthJj42xT5uO7w3Aoxzppw 80 0 868684 3454 1gb 1gb green close origin-backup-20210209-1736 P1CKjXW0QiWM5zlzX46-fg 80 0 green open origin-v0.5.0 SGplSaqPR_O9cPYU4ZsmdQ 80 0 868121 0 987.7mb 987.7mb green open origin-toremove PL7WEs3FTJSQy4dgGIwpeQ 80 0 868610 0 987.5mb 987.5mb
It will be probably something to schedule regularly on production index if size matters
The journal clients recovered, so the index is up-to-date.
Let's check some point before closing :
- The index size looks huge (~10g) compared to before the deployment
- it seems some document have no origin_visit_type populated as they should :
swh=> select * from origin where url='deb://Debian/packages/node-response-time'; id | url -------+------------------------------------------ 15552 | deb://Debian/packages/node-response-time (1 row)
Feb 18 2021
- Copy the backup of the index done in T2780
- delete current index
stop the journal clients and swh-search
root@search0:~# puppet agent --disable "swh-search upgrade" root@search0:~# systemctl stop swh-search-journal-client@objects.service root@search0:~# systemctl stop swh-search-journal-client@indexed.service root@search0:~# systemctl stop gunicorn-swh-search.service
update the packages
root@search0:~# apt update && apt list --upgradable ... python3-swh.search/unknown 0.6.0-1~swh1~bpo10+1 all [upgradable from: 0.5.0-1~swh1~bpo10+1] ...
The dashboard was moved to the system directory: the new url is https://grafana.softwareheritage.org/goto/uBHBojEGz
swh-search:v0.5.0 deployed in all the environments, the metrics are correctly gathered by prometheus.
Let's create a real dashboard now [1]
This is the mapping of the origin index with the metadata : P953
Feb 17 2021
Feb 16 2021
Feb 15 2021
Feb 12 2021
A basic dashboard [1] is created on garfana based on the number of log line.
It's too limited as it's not possible to isolate the logs per environment as the information is not available.
It will be added in T3043
Feb 11 2021
Done scheduling:
T3041 needs to be done before this one (for the production environment)
D5063 is applied, the main webapp is now using swh-search by default.
Running:
swhscheduler@saatchi:~$ /usr/bin/swh scheduler --config-file /etc/softwareheritage/scheduler/backend.yml task schedule_ origins --storage-url http://saam.internal.softwareheritage.org:5002 --batch-size 20 index-origin-metadata | tee /tmp/schedule-origins.txt
@ardumont no, OriginMetadataIndexer lacks a filter step.
@ardumont no, OriginMetadataIndexer lacks a filter step.
Although, now i'm wondering something.
Is that enough to write what's not in the topics?
Ah no! I misused the cli, with the right flags:
This needs a storage access so edit a dedicated configuration file.
The main webapp search can be switch from the sql search to the swh-search as all the tests performed on staging and https://webapp1.internal.softwareheritage.org are ok
That's it! [1]