use redis-server package instead of a metapackage
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Mar 12 2021
In D5235#133003, @anlambert wrote:...
Looks simpler to me but there might be a reason to not use apt-key.
update commit message
Add task link on the commit message
Update the commit message
Fix review feedbacks
- All workers and journal clients stopped before upgrading storage1 and db1
Mar 11 2021
swh-search0
- stopping writes
root@search0:~# systemctl stop swh-search-journal-client@objects root@search0:~# systemctl stop swh-search-journal-client@indexed root@search0:~# puppet agent --disable "zfs upgrade" `` - package upgrades - `swh-search0` rebooted - `swh-search0` rebooted - all service are up and running
Add tests
Mar 10 2021
Mail sent to the dsi to request the installation of 2 of the new disks
Overview of the system :
- 2 slots availables (10 slot occupied on a total of 12)
- system installed on 2 disks ssd disk (wwn-0x500a075122f366e4 and wwn-0x500a075122f357f1)
- 2 zfs pools
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT hdd 29.1T 22.8T 6.29T - - 20% 78% 1.00x ONLINE - ssd 10.3T 7.91T 2.44T - - 24% 76% 1.00x ONLINE -
root@granet:~# zpool status -v hdd pool: hdd state: ONLINE scan: scrub repaired 0B in 0 days 15:42:24 with 0 errors on Sun Feb 14 16:06:26 2021 config:
Mar 8 2021
Mar 5 2021
Thanks for the feedback
- Repository created : https://forge.softwareheritage.org/source/swh-counters/
- Jenkins jobs configured : https://jenkins.softwareheritage.org/job/DCNT/
Let's start the subject ;)
I forgot one step, cleaning the previous alias origin -> origin_production not needed anymore:
vsellier@search-esnode1 ~ % curl -s http://$ES_SERVER/_cat/indices\?v && echo && curl -s http://$ES_SERVER/_cat/aliases\?v && echo && curl -s http://$ES_SERVER/_cat/health\?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open origin-production hZfuv0lVRImjOjO_rYgDzg 90 1 153130652 26701625 273.4gb 137.3gb
The new configuration is deployed, swh-search is now using the alias which should help for the future upgrades
Deployment in production:
- puppet stopped
- configuration updated to declare the index, it needs to be done to make swh-search initializing the aliaes before the journal clients starts (not guaranteed with a puppet apply)
- package updated
- gunicorn-swh-search service restarted:
Mar 05 09:08:46 search1 python3[1881743]: 2021-03-05 09:08:46 [1881743] gunicorn.error:INFO Starting gunicorn 19.9.0 Mar 05 09:08:46 search1 python3[1881743]: 2021-03-05 09:08:46 [1881743] gunicorn.error:INFO Listening at: unix:/run/gunicorn/swh-search/gunicorn.sock (1881743) Mar 05 09:08:46 search1 python3[1881743]: 2021-03-05 09:08:46 [1881743] gunicorn.error:INFO Using worker: sync Mar 05 09:08:46 search1 python3[1881748]: 2021-03-05 09:08:46 [1881748] gunicorn.error:INFO Booting worker with pid: 1881748 Mar 05 09:08:46 search1 python3[1881749]: 2021-03-05 09:08:46 [1881749] gunicorn.error:INFO Booting worker with pid: 1881749 Mar 05 09:08:46 search1 python3[1881750]: 2021-03-05 09:08:46 [1881750] gunicorn.error:INFO Booting worker with pid: 1881750 Mar 05 09:08:46 search1 python3[1881751]: 2021-03-05 09:08:46 [1881751] gunicorn.error:INFO Booting worker with pid: 1881751 Mar 05 09:08:53 search1 python3[1881750]: 2021-03-05 09:08:53 [1881750] swh.search.api.server:INFO Initializing indexes with configuration: Mar 05 09:08:53 search1 python3[1881750]: 2021-03-05 09:08:53 [1881750] elasticsearch:INFO HEAD http://search-esnode2.internal.softwareheritage.org:9200/origin-production [status:200 request:0.023s] Mar 05 09:08:54 search1 python3[1881750]: 2021-03-05 09:08:54 [1881750] elasticsearch:INFO PUT http://search-esnode1.internal.softwareheritage.org:9200/origin-production/_alias/origin-read [status:200 request:0.487s] Mar 05 09:08:54 search1 python3[1881750]: 2021-03-05 09:08:54 [1881750] elasticsearch:INFO PUT http://search-esnode3.internal.softwareheritage.org:9200/origin-production/_alias/origin-write [status:200 request:0.152s] Mar 05 09:08:54 search1 python3[1881750]: 2021-03-05 09:08:54 [1881750] elasticsearch:INFO PUT http://search-esnode1.internal.softwareheritage.org:9200/origin-production/_mapping [status:200 request:0.009s]
vsellier@search-esnode1 ~ % curl -s http://$ES_SERVER/_cat/indices\?v && echo && curl -s http://$ES_SERVER/_cat/aliases\?v && echo && curl -s http://$ES_SERVER/_cat/health\?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open origin-production hZfuv0lVRImjOjO_rYgDzg 90 1 153097672 144224208 288.1gb 149gb
Mar 4 2021
swh-search:v0.7.1 deployed in staging according to the defined plan.
The aliases are well created and used by the services
vsellier@search-esnode0 ~ % curl -XGET -H "Content-Type: application/json" http://192.168.130.80:9200/_cat/indices green open origin HthJj42xT5uO7w3Aoxzppw 80 0 929692 137147 4gb 4gb green close origin-backup-20210209-1736 P1CKjXW0QiWM5zlzX46-fg 80 0 green close origin-v0.5.0 SGplSaqPR_O9cPYU4ZsmdQ 80 0 vsellier@search-esnode0 ~ % curl -XGET -H "Content-Type: application/json" http://192.168.130.80:9200/_cat/aliases origin-read origin - - - - origin-write origin - - - -
Journal clients:
Mar 04 16:22:40 search0 swh[3598137]: INFO:elasticsearch:POST http://search-esnode0.internal.staging.swh.network:9200/origin-write/_bulk [status:200 request:0.013s] Mar 04 16:22:41 search0 swh[3598137]: INFO:elasticsearch:POST http://search-esnode0.internal.staging.swh.network:9200/origin-write/_bulk [status:200 request:0.012s]
Search:
Mar 04 15:40:20 search0 python3[3598040]: 2021-03-04 15:40:20 [3598040] swh.search.api.server:INFO Initializing indexes with configuration: Mar 04 15:40:20 search0 python3[3598040]: 2021-03-04 15:40:20 [3598040] elasticsearch:INFO HEAD http://search-esnode0.internal.staging.swh.network:9200/origin [status:200 request:0.005s] Mar 04 15:40:20 search0 python3[3598040]: 2021-03-04 15:40:20 [3598040] elasticsearch:INFO HEAD http://search-esnode0.internal.staging.swh.network:9200/origin-read/_alias [status:200 request:0.001s] Mar 04 15:40:20 search0 python3[3598040]: 2021-03-04 15:40:20 [3598040] elasticsearch:INFO HEAD http://search-esnode0.internal.staging.swh.network:9200/origin-write/_alias [status:200 request:0.001s] Mar 04 15:40:20 search0 python3[3598040]: 2021-03-04 15:40:20 [3598040] elasticsearch:INFO PUT http://search-esnode0.internal.staging.swh.network:9200/origin/_mapping [status:200 request:0.006s] Mar 04 16:19:27 search0 python3[3598042]: 2021-03-04 16:19:27 [3598042] elasticsearch:INFO GET http://search-esnode0.internal.staging.swh.network:9200/origin-read/_search?size=100 [status:200 request:0.076s]
Remove the tests because the flask application is not reinitialized
between 2 unit tests and testing the ElasticSearch class instanciation
with different configuration by flask is not working.
The 4 files seems to be accessible without errors which look like a good news ;):
root@belvedere:~# time cp /srv/softwareheritage/postgres/11/indexer/base/16406/774467031.317 /dev/null
I have found some interesting pointers relative to the management of small files in hdfs (found them when looking for unrelated other stuff). Is it something you have identified and excluded from the scope due to some blockers ?
Isn't this around when we've restarted production after expanding the storage pool?
The loaders were restarted in late November, but perhaps more of them were launched at this moment
Mar 3 2021
The disk was tested completely with read/write operations (interrupted on the 2d pass)
- fix wrong error log level
- fix typo on the commit message
For the record, it seems the 4 impacted files are related to the primary key of the softwareheritage-indexer.content_mimetype table
yep it's weird, but after looking at the code of the function, I realized it seems to be a known problem :
https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/40-funcs.sql$665-667
limit 2:
https://explain.depesz.com/s/UW9Z
softwareheritage=> explain analyze with filtered_snapshot_branches as ( select '\xdfea9cb3249b932235b1cd60ed49c5e316a03147'::bytea as snapshot_id, name, target, target_type from snapshot_branches inner join snapshot_branch on snapshot_branches.branch_id = snapshot_branch.object_id where snapshot_id = (select object_id from snapshot where snapshot.id = '\xdfea9cb3249b932235b1cd60ed49c5e316a03147'::bytea) and (NULL :: snapshot_target[] is null or target_type = any(NULL :: snapshot_target[])) ) select snapshot_id, name, target, target_type from filtered_snapshot_branches where name >= '\x'::bytea and (NULL is null or convert_from(name, 'utf-8') ilike NULL) and (NULL is null or convert_from(name, 'utf-8') not ilike NULL) order by name limit 2; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=1004.01..6764.11 rows=2 width=76) (actual time=172523.081..173555.673 rows=2 loops=1) InitPlan 1 (returns $0) -> Index Scan using snapshot_id_idx on snapshot (cost=0.57..2.59 rows=1 width=8) (actual time=0.028..0.036 rows=1 loops=1) Index Cond: ((id)::bytea = '\xdfea9cb3249b932235b1cd60ed49c5e316a03147'::bytea) -> Gather Merge (cost=1001.43..168852423.27 rows=58628 width=76) (actual time=172523.079..173555.661 rows=2 loops=1) Workers Planned: 2 Params Evaluated: $0 Workers Launched: 2 -> Nested Loop (cost=1.40..168844656.12 rows=24428 width=76) (actual time=126442.320..167761.276 rows=2 loops=3) -> Parallel Index Scan using snapshot_branch_name_target_target_type_idx on snapshot_branch (cost=0.70..12612971.47 rows=154824599 width=52) (actual time=0.077..80926.811 rows=23123612 loops=3) Index Cond: (name >= '\x'::bytea) -> Index Only Scan using snapshot_branches_pkey on snapshot_branches (cost=0.70..1.01 rows=1 width=8) (actual time=0.004..0.004 rows=0 loops=69370837) Index Cond: ((snapshot_id = $0) AND (branch_id = snapshot_branch.object_id)) Heap Fetches: 5 Planning Time: 0.993 ms Execution Time: 173555.864 ms (16 rows)
It seems there are some differences in term of indexes between the main and replica databases.
On the replica, only the primary keys are present on the snapshot_branches and the snapshot_branch tables. Perhaps the query optimizer is confused by something and is doing a wrong choice somewhere.
No problems are detected on the IDrac, smartcl on the drives looks ok.
The lag has recovered so the index should contains the visit_type for all origin now
(not related or directly retated to the issue) Looking at some potential issues on disk i/os,I discovered a weird behavior change on the i/o on belvedere after the 2020-12-31 :
There is no errors on the postgresql logs related to the files listed on the zfs status but I'm not sure the indexer database is read.
It seems there are some reccurring alerts on the system journal about several disks since some time :
Feb 24 01:33:36 belvedere.internal.softwareheritage.org kernel: sd 0:0:14:0: [sdi] tag#808 Sense Key : Recovered Error [current] [descriptor] Feb 24 01:33:36 belvedere.internal.softwareheritage.org kernel: sd 0:0:14:0: [sdi] tag#808 Add. Sense: Defect list not found Feb 24 01:33:39 belvedere.internal.softwareheritage.org kernel: sd 0:0:16:0: [sdk] tag#650 Sense Key : Recovered Error [current] [descriptor] Feb 24 01:33:39 belvedere.internal.softwareheritage.org kernel: sd 0:0:16:0: [sdk] tag#650 Add. Sense: Defect list not found Feb 24 01:33:41 belvedere.internal.softwareheritage.org kernel: sd 0:0:17:0: [sdl] tag#669 Sense Key : Recovered Error [current] [descriptor] Feb 24 01:33:41 belvedere.internal.softwareheritage.org kernel: sd 0:0:17:0: [sdl] tag#669 Add. Sense: Defect list not found Feb 24 01:33:43 belvedere.internal.softwareheritage.org kernel: sd 0:0:18:0: [sdm] tag#682 Sense Key : Recovered Error [current] [descriptor] Feb 24 01:33:43 belvedere.internal.softwareheritage.org kernel: sd 0:0:18:0: [sdm] tag#682 Add. Sense: Defect list not found Feb 24 01:33:44 belvedere.internal.softwareheritage.org kernel: sd 0:0:21:0: [sdo] tag#668 Sense Key : Recovered Error [current] [descriptor] Feb 24 01:33:44 belvedere.internal.softwareheritage.org kernel: sd 0:0:21:0: [sdo] tag#668 Add. Sense: Defect list not found Feb 24 01:33:44 belvedere.internal.softwareheritage.org kernel: sd 0:0:22:0: [sdp] tag#682 Sense Key : Recovered Error [current] [descriptor] Feb 24 01:33:44 belvedere.internal.softwareheritage.org kernel: sd 0:0:22:0: [sdp] tag#682 Add. Sense: Defect list not found ...
root@belvedere:/var/log# journalctl -k --since=yesterday | awk '{print $8}' | sort | uniq -c 274 [sdi] 274 [sdk] 274 [sdl] 274 [sdm] 274 [sdo] 274 [sdp]
Configure the indexes with a Dict with an entry per index type
Mar 2 2021
lgtm