Can this task be closed since the subject was addressed in T2620 ?
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jan 5 2021
In the new configuration, after a few time without search, the first ones are taking some time before stabilizing to the old values :
❯ ./random_search.sh 12:36:37
the index configuration was reset to its default :
cat >/tmp/config.json <<EOF { "index" : { "translog.sync_interval" : null, "translog.durability": null, "refresh_interval": null } } EOF
❯ curl -s http://192.168.100.81:9200/origin/_settings\?pretty { "origin" : { "settings" : { "index" : { "refresh_interval" : "60s", "number_of_shards" : "90", "translog" : { "sync_interval" : "60s", "durability" : "async" }, "provided_name" : "origin", "creation_date" : "1608761881782", "number_of_replicas" : "1", "uuid" : "Mq8dnlpuRXO4yYoC6CTuQw", "version" : { "created" : "7090399" } } } } } ❯ curl -s -H "Content-Type: application/json" -XPUT http://192.168.100.81:9200/origin/_settings\?pretty -d @/tmp/config.json { "acknowledged" : true } ❯ curl -s http://192.168.100.81:9200/origin/_settings\?pretty { "origin" : { "settings" : { "index" : { "creation_date" : "1608761881782", "number_of_shards" : "90", "number_of_replicas" : "1", "uuid" : "Mq8dnlpuRXO4yYoC6CTuQw", "version" : { "created" : "7090399" }, "provided_name" : "origin" } } } }
A *simple* search doesn't looked impacted (it's not a real benchmark):
❯ ./random_search.sh
Jan 4 2021
Closing this task as all the direct work is done.
The documentation will be addressed in T2920
The backfill was done in a couple of days.
Dec 23 2020
search1.internal.softwareheritage.org vm deployed.
The configuration of the index was automatically performed by puppet during the initial provisionning.
Index template created in elasticsearch with 1 replica and 90 shards to have the same number of shards on each node:
export ES_SERVER=192.168.100.81:9200 curl -XPUT -H "Content-Type: application/json" http://$ES_SERVER/_index_template/origin\?pretty -d '{"index_patterns": "origin", "template": {"settings": { "index": { "number_of_replicas":1, "number_of_shards": 90 } } } } '
search-esnode[1-3] installed with zfs configured :
apt update && apt install linux-image-amd64 linux-headers-amd64 # reboot to upgrade the kernel apt install libnvpair1linux libuutil1linux libzfs2linux libzpool2linux zfs-dkms zfsutils-linux zfs-zed systemctl stop elasticsearch rm -rf /srv/elasticsearch/nodes/0 zpool create -O atime=off -m /srv/elasticsearch/nodes elasticsearch-data /dev/vdb chown elasticsearch: /srv/elasticsearch/nodes
Inventory was updated to reserve the elastisearch vms :
- search-esnode[1-3].internal.softwareheritage.org
- ips : 192.168.100.8[1-3]/24
The webapp is available at https://webapp1.internal.softwareheritage.org
In prevision of the deployment, the production index present on the staging's elasticsearch was renamed from origin-production2 to production_origin (a clone operation will be user [1], the original index will be let in place)
[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-clone-index.html
Remove useless fixture declaration
thanks, I change that
Use a prefix instead of changing the index name.
Make it optional to avoid to have to rename the index on the instances already deployed
the shards reallocation is still in progress :
~ ❯ curl -s http://esnode3.internal.softwareheritage.org:9200/_cat/shards\?h\=prirep,node | sort | uniq -c 09:40:21 1216 p esnode1 1183 p esnode2 1 p esnode2 -> 192.168.100.61 t4iSb7f1RZmEwpH4O_OoGw esnode1 1840 p esnode3 1 p esnode3 -> 192.168.100.61 t4iSb7f1RZmEwpH4O_OoGw esnode1 1208 r esnode1 1845 r esnode2 1188 r esnode3
p: primary shard
r: replica shard
Dec 22 2020
The atime was activated by default. I switched to relatime :
root@esnode1:~# zfs get all | grep time elasticsearch-data atime on default elasticsearch-data relatime off default
- puppet executed
- esnode1 is back on the cluster but still not selected to received shard due to a configuration rule :
~ ❯ curl -s http://esnode3.internal.softwareheritage.org:9200/_cat/nodes\?v; echo; curl -s http://esnode3.internal.softwareheritage.org:9200/_cat/health\?v 16:02:37 ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name 192.168.100.61 3 57 0 0.35 0.25 0.12 dilmrt - esnode1 192.168.100.63 35 97 1 0.68 0.65 0.70 dilmrt * esnode3 192.168.100.62 35 96 2 0.66 0.75 0.82 dilmrt - esnode2
As puppet can be restart to avoid elasticsearch to restart before zfs is configured, zfs was manually installed :
Replicate disk sda partitioning on all disks
root@esnode1:~# sfdisk -l /dev/sda Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors Disk model: HGST HUS726020AL Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: 543964DA-9ECA-4222-952D-BA8A90FAB2B9
old raid cleanup
root@esnode1:~# umount /srv/elasticsearch root@esnode1:~# diff -U3 /tmp/fstab /etc/fstab --- /tmp/fstab 2020-12-22 11:37:17.318967701 +0000 +++ /etc/fstab 2020-12-22 11:37:28.687049499 +0000 @@ -11,5 +11,3 @@ UUID=AE23-D5B8 /boot/efi vfat umask=0077 0 1 # swap was on /dev/sda3 during installation UUID=3eaaa22d-e1d2-4dde-9a45-d2fa22696cdf none swap sw 0 0 -UUID=6adb1e63-e709-4efb-8be1-76818b1b4751 /srv/kafka ext4 errors=remount-ro 0 0 -/dev/md127 /srv/elasticsearch xfs defaults,noatime 0 0
The fix is deployed in staging and production
The disks can't be replaced before beginning of January because of a closed logistic service
Dell was notified about the delay for the disk replacement. The next package retrieval attempt by UPS is scheduled for the *2020-01-11*
testing in staging with a manual change in the code to force an assertion, It works well
Everything looks good, let's try to add some documentation before closing the issue
Tested locally, it looks good. I just add a small comment about the installation directory usually in /opt instead of the user home dir.
Dec 21 2020
- A new vm objstorage0.internal.staging.swh.network is configured with an read-only object storage service
- It's exposed to internet via the reverse proxy at https://objstorage.staging.swh.network (it quite different as the usual objstorage:5003 url but it allow to expose the service without new network configuration)
- DNS entry added on gandi
- Inventory updated
- before :
root@riverside:~# pvscan PV /dev/sda1 VG riverside-vg lvm2 [<63.98 GiB / 0 free] Total: 1 [<63.98 GiB] / in use: 1 [<63.98 GiB] / in no VG: 0 [0 ] root@riverside:~# df -h / Filesystem Size Used Avail Use% Mounted on /dev/mapper/riverside--vg-root 60G 56G 1.4G 98% /
(2% some cleanup seems to have occur since the creation of the task :) )
- disk extended on proxmox by 16Go on proxmox
(extract of dmesg of riverside) [350521.461023] sd 2:0:0:0: Capacity data has changed [350521.461339] sd 2:0:0:0: [sda] 167772160 512-byte logical blocks: (85.9 GB/80.0 GiB) [350521.461484] sda: detected capacity change from 68719476736 to 85899345920
- partition resized :
root@riverside:~# parted /dev/sda GNU Parted 3.2 Using /dev/sda Welcome to GNU Parted! Type 'help' to view a list of commands. (parted) print free Model: QEMU QEMU HARDDISK (scsi) Disk /dev/sda: 85.9GB Sector size (logical/physical): 512B/512B Partition Table: msdos Disk Flags:
Remove out of scope changes
A user was correctly configured and a read test performed :
The network configuration is done. The server is now accessible from the internet at broker0.journal.staging.swh.network:9093
lgtm, I have a doubt on the numa activation but as it's also activated for kelvingrove, I assume it's correct
Changing to high priority (@ardumont recommandation)
Dec 18 2020
The request to expose the journal to internet was done this afternoon to the dsi.
To eliminate another possible root cause, a test was done in a temporary project with the last version of the python library, it doesn't work either
Dec 17 2020
lgtm
we have followed the event track on the consumer code without finding anything suspicious.
As a last try, we have fully rebooted the vm, but as expected, it changed nothing at all.
@olasd, if you have some detailed of the version upgrades you have performed yesterday, perhaps it could help to diagnose.