Build is green
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
All Stories
Jan 26 2021
Build is green
Build is green
rebase
rebase
rebase
rebase
- rebase (including documentation of RAM usage)
- document _visit_times
- rename get_current_snapshot to get_current_snapshot_id
Build is green
In D4914#124243, @douardda wrote:Something I don't understand: why do you need to keep both _visit_times and latest_snapshots in "caches" when a snapshot is derived from this visit time (and visit type and origin)?
Build is green
Build is green
rebase
rebase
rebase
add doc on the origin model
If i'm implementing this nonetheless, i'll go with the header (bearer) implementation.
- stop scheduler runner to avoid blocking queries:
some suggestions inline.
In any case, bravo for the effort on this tough lister!
Specification is enough to reference than a .xsd file.
On a related note, I found this list of Community-Hosted GitLab Instances. Most of them have public access and could be added to the set of Gitlab instances listed by Software Heritage.
- Stop the workers:
$ clush -b -w @swh-workers 'puppet agent --disable "Deploy new storage version"; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@*; do systemctl disable $unit; done; systemctl stop swh-worker@*'
Upgrading the index configuration to speedup the indexation :
% cat >/tmp/config.json <<EOF { "index" : { "translog.sync_interval" : "60s", "translog.durability": "async", "refresh_interval": "60s" } } EOF % export ES_SERVER=192.168.100.81:9200 % export INDEX=origin % curl -s -H "Content-Type: application/json" -XPUT http://${ES_SERVER}/${INDEX}/_settings -d @/tmp/config.json {"acknowledged":true}%
Bravo, I did not check. I trusted the old code.
It's doing what was done in the earlier version [1]
I'll check now.
And once again, this "cache" behavior makes the simulator unable to run "forever" (it will eat RAM). Maybe it's an assumed design choice, but please document it somewhere.
Something I don't understand: why do you need to keep both _visit_times and latest_snapshots in "caches" when a snapshot is derived from this visit time (and visit type and origin)?
Production
- puppet disabled
- Services stopped :
root@search1:~# systemctl stop swh-search-journal-client@objects.service root@search1:~# systemctl stop gunicorn-swh-search
- Index deleted and recreated
% export ES_SERVER=search-esnode1.internal.softwareheritage.org:9200 % curl -s http://$ES_SERVER/_cat/indices\?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open origin Mq8dnlpuRXO4yYoC6CTuQw 90 1 151716299 38861934 260.8gb 131gb % curl -XDELETE http://$ES_SERVER/origin {"acknowledged":true}% % swh search --config-file /etc/softwareheritage/search/server.yml initialize INFO:elasticsearch:PUT http://search-esnode1.internal.softwareheritage.org:9200/origin [status:200 request:2.216s] INFO:elasticsearch:PUT http://search-esnode3.internal.softwareheritage.org:9200/origin/_mapping [status:200 request:0.151s] Done. % curl -s http://$ES_SERVER/_cat/indices\?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open origin yFaqPPCnRFCnc5AA6Ah8lw 90 1 0 0 36.5kb 18.2kb
- journal client's consumer group delete:
% export SERVER=kafka1.internal.softwareheritage.org:9092 % ./kafka-consumer-groups.sh --bootstrap-server ${SERVER} --delete --group swh.search.journal_client Deletion of requested consumer groups ('swh.search.journal_client') was successful.
- journal client restarted
- puppet enabled
Isn't there some inherent limitation with this lister_process (gradually eating RAM) that should be documented (maybe)?
Note that I still think there should be something in docs/simulator.rst also...
In D4909#123949, @vlorentz wrote:We're not claiming this is a realistic model. We only tried to do something that isn't completely naive, and exercises simple edge cases. Making it realistic is hard, and will probably be most of @olasd's work this week.
I do not see any section regarding basic HTTP authentication for API requests in Gitlab API doc. Are you sure it is working ?
The filter on visited origins is working correctly on staging. The has_visit flag looks good.
For example for the https://www.npmjs.com/package/@ehmicky/dev-tasks origin
{ "_index" : "origin", "_type" : "_doc", "_id" : "019bd314416108304165e82dd92e00bc9ea85a53", "_score" : 60.56421, "_source" : { "url" : "https://www.npmjs.com/package/@ehmicky/dev-tasks", "sha1" : "019bd314416108304165e82dd92e00bc9ea85a53" }, "sort" : [ 60.56421, "019bd314416108304165e82dd92e00bc9ea85a53" ] }
swh=> select * from origin join origin_visit_status on id=origin where id=469380; id | url | origin | visit | date | status | metadata | snapshot | type --------+--------------------------------------------------+--------+-------+-------------------------------+---------+----------+--------------------------------------------+------ 469380 | https://www.npmjs.com/package/@ehmicky/dev-tasks | 469380 | 1 | 2021-01-25 13:30:47.221937+00 | created | | | npm 469380 | https://www.npmjs.com/package/@ehmicky/dev-tasks | 469380 | 1 | 2021-01-25 13:41:59.435579+00 | partial | | \xe3f24413d81fd3e9c309686fcfb6c8f5eb549acf | npm
Jan 25 2021
Now that we have a cgit lister, this should be a no brainer.
If that's the case, we need it up and running quickly.
Sorry I accepted the diff before seeing you used Basic Auth which I think is not working.
I do not see any section regarding basic HTTP authentication for API requests in Gitlab API doc. Are you sure it is working ?
You should add a test to check the last_update field value in scheduler database is not None.
Build is green
Build is green
Update only the diff with authentication code
Build has FAILED