User Details
- User Since
- Mar 21 2019, 4:14 PM (97 w, 24 m)
- Roles
- Administrator
Today
The fix is deployed on webapp1 and solved the problem.
The storage version v0.21.1 is deployed in staging, the problem looks fixed :
❯ curl -s https://webapp.staging.swh.network/api/1/origin/https://gitlab.com/miwc/miwc.github.io.git/visit/latest/\?require_snapshot\=true | jq '' { "origin": "https://gitlab.com/miwc/miwc.github.io.git", "date": "2020-12-07T18:21:58.967952+00:00", "type": "git", "visit": 1, "status": "full", "snapshot": "759b36e0e3e81e8cbf601181829571daa645b5d2", "metadata": {}, "origin_url": "https://webapp.staging.swh.network/api/1/origin/https://gitlab.com/miwc/miwc.github.io.git/get/", "snapshot_url": "https://webapp.staging.swh.network/api/1/snapshot/759b36e0e3e81e8cbf601181829571daa645b5d2/" }
Yesterday
This is a try to generate a global schema of the staging environment (P929):
It seems to be ok :)
Thanks :)
Use an exception to validate a repo page can be accessed
rebase
Restore missing log when the date can't be parsed
rebase
Remove useless variable
- Reorder methods
- Adapt date parsing according the review
To decrease the time to recover the lag, several journal client were launched in // with :
/usr/bin/swh search --config-file /etc/softwareheritage/search/journal_client_objects.yml journal-client objects
Tue, Jan 26
Inline unecessary indirection
Add missing test coverage
Upgrading the index configuration to speedup the indexation :
% cat >/tmp/config.json <<EOF { "index" : { "translog.sync_interval" : "60s", "translog.durability": "async", "refresh_interval": "60s" } } EOF % export ES_SERVER=192.168.100.81:9200 % export INDEX=origin % curl -s -H "Content-Type: application/json" -XPUT http://${ES_SERVER}/${INDEX}/_settings -d @/tmp/config.json {"acknowledged":true}%
Production
- puppet disabled
- Services stopped :
root@search1:~# systemctl stop swh-search-journal-client@objects.service root@search1:~# systemctl stop gunicorn-swh-search
- Index deleted and recreated
% export ES_SERVER=search-esnode1.internal.softwareheritage.org:9200 % curl -s http://$ES_SERVER/_cat/indices\?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open origin Mq8dnlpuRXO4yYoC6CTuQw 90 1 151716299 38861934 260.8gb 131gb % curl -XDELETE http://$ES_SERVER/origin {"acknowledged":true}% % swh search --config-file /etc/softwareheritage/search/server.yml initialize INFO:elasticsearch:PUT http://search-esnode1.internal.softwareheritage.org:9200/origin [status:200 request:2.216s] INFO:elasticsearch:PUT http://search-esnode3.internal.softwareheritage.org:9200/origin/_mapping [status:200 request:0.151s] Done. % curl -s http://$ES_SERVER/_cat/indices\?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open origin yFaqPPCnRFCnc5AA6Ah8lw 90 1 0 0 36.5kb 18.2kb
- journal client's consumer group delete:
% export SERVER=kafka1.internal.softwareheritage.org:9092 % ./kafka-consumer-groups.sh --bootstrap-server ${SERVER} --delete --group swh.search.journal_client Deletion of requested consumer groups ('swh.search.journal_client') was successful.
- journal client restarted
- puppet enabled
The filter on visited origins is working correctly on staging. The has_visit flag looks good.
For example for the https://www.npmjs.com/package/@ehmicky/dev-tasks origin
{ "_index" : "origin", "_type" : "_doc", "_id" : "019bd314416108304165e82dd92e00bc9ea85a53", "_score" : 60.56421, "_source" : { "url" : "https://www.npmjs.com/package/@ehmicky/dev-tasks", "sha1" : "019bd314416108304165e82dd92e00bc9ea85a53" }, "sort" : [ 60.56421, "019bd314416108304165e82dd92e00bc9ea85a53" ] }
swh=> select * from origin join origin_visit_status on id=origin where id=469380; id | url | origin | visit | date | status | metadata | snapshot | type --------+--------------------------------------------------+--------+-------+-------------------------------+---------+----------+--------------------------------------------+------ 469380 | https://www.npmjs.com/package/@ehmicky/dev-tasks | 469380 | 1 | 2021-01-25 13:30:47.221937+00 | created | | | npm 469380 | https://www.npmjs.com/package/@ehmicky/dev-tasks | 469380 | 1 | 2021-01-25 13:41:59.435579+00 | partial | | \xe3f24413d81fd3e9c309686fcfb6c8f5eb549acf | npm
Mon, Jan 25
Staging
We are proceeding to a complete index rebuilding
Regarding the index rebuilding process, using a naive approach with aliases with the old and the new index[1] returns duplicated results when the search is done.
Using an alias with only the old index, rebuilding a new index and switching the alias to the new index[2] can be a first approach with the default the old index will not be updated until the alias is switched to the new index.
It also requires the swh-search code is able to use different names for the read and write operations.
LGTM
- rebase
- update tests according to the review feedbacks
It seems redis has a Hyperloglog functionnality[1] that can match with the requirements (bloom filter / limited deviation / small memory footprint / efficiency).
Sat, Jan 23
Fri, Jan 22
Thu, Jan 21
rebase
rebase
Test with "several" database upserts as it's more realistic
split the long entrypoint's command line
rebase