The interface on VLAN1330 and VLAN440 were already configured
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jan 19 2021
Remove an wrong file removal
The package python3-swh.icingaplugins:v0.4.3 is released and deployed on pergamon
The shard reallocation is done :
❯ curl -s http://192.168.100.63:9200/_cat/allocation\?v\&s\=node shards disk.indices disk.used disk.avail disk.total disk.percent host ip node 2935 2.9tb 3tb 3.7tb 6.7tb 44 192.168.100.61 192.168.100.61 esnode1 2936 2.9tb 3tb 3.7tb 6.7tb 44 192.168.100.62 192.168.100.62 esnode2 2935 2.9tb 2.9tb 3.8tb 6.7tb 43 192.168.100.63 192.168.100.63 esnode3
Jan 18 2021
Rework the sql query to use the "using" keyword to join
esnode3 configured with the same procedure as esnode2 (check the previous comments)
esnode3 is ready to be migrated :
❯ curl -s http://192.168.100.63:9200/_cat/allocation\?v\&s\=node 09:09:53 shards disk.indices disk.used disk.avail disk.total disk.percent host ip node 4397 4.4tb 4.4tb 2.3tb 6.7tb 65 192.168.100.61 192.168.100.61 esnode1 4397 4.4tb 4.4tb 2.3tb 6.7tb 65 192.168.100.62 192.168.100.62 esnode2 0 0b 5.9gb 5.4tb 5.4tb 0 192.168.100.63 192.168.100.63 esnode3
Jan 15 2021
Remove type from the clustering key of OriginVisitStatus
rebase
The cluster is stabilized :
❯ curl -s http://192.168.100.63:9200/_cat/health\?v epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent 1610689991 05:53:11 swh-logging-prod green 3 3 8758 4379 0 0 0 0 - 100.0%
Jan 14 2021
After a reboot, a message Failed to start Import ZFS pools by cache file is displayed on the server console and the pool is not mounted. It seems it can be caused by using /dev/sd* disk names directly.
- installation and configuration of zfs on esnode2
- backport packages installed
- kernel upgraded to 5.0
root@esnode2:~# apt update root@esnode2:~# apt list --upgradable Listing... Done libnss-systemd/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1] libpam-systemd/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1] libsystemd0/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1] libudev1/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1] linux-image-amd64/buster-backports 5.9.15-1~bpo10+1 amd64 [upgradable from: 4.19+105+deb10u8] systemd-sysv/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1] systemd-timesyncd/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1] systemd/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1] udev/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1] root@esnode2:~# apt dist-upgrade root@esnode2:~# systemctl disable elasticsearch Synchronizing state of elasticsearch.service with SysV service script with /lib/systemd/systemd-sysv-install. Executing: /lib/systemd/systemd-sysv-install disable elasticsearch Removed /etc/systemd/system/multi-user.target.wants/elasticsearch.service. root@esnode2:~# puppet agent --disable "zfs installation" root@esnode2:~# shutdown -r now
- zfs installation
root@esnode2:~# apt install libnvpair1linux libuutil1linux libzfs2linux libzpool2linux zfs-dkms zfsutils-linux zfs-zed
- kafka partition and old elasticsearch raid removed
root@esnode2:~# diff -U3 /tmp/fstab /etc/fstab root@esnode2:~# diff -U3 /tmp/fstab /etc/fstab --- /tmp/fstab 2021-01-14 09:05:59.609906708 +0000 +++ /etc/fstab 2021-01-14 09:06:49.390527123 +0000 @@ -9,8 +9,5 @@ UUID=3700082d-41e5-4c54-8667-46280f124b33 / ext4 errors=remount-ro 0 1 # /boot/efi was on /dev/sda1 during installation UUID=0228-9320 /boot/efi vfat umask=0077 0 1 -#/srv/kafka was on /dev/sda4 during installation -#UUID=c97780cb-378c-4963-ac31-59281410b2f9 /srv/kafka ext4 defaults 0 2 # swap was on /dev/sda3 during installation UUID=3eea10c5-9913-44c1-aa85-a1e93ae12970 none swap sw 0 0 -/dev/md0 /srv/elasticsearch xfs defaults,noatime 0 0
- Removing old raid :
root@esnode2:~# mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Wed May 23 08:21:35 2018 Raid Level : raid0 Array Size : 5860150272 (5588.67 GiB 6000.79 GB) Raid Devices : 3 Total Devices : 3 Persistence : Superblock is persistent
The cause of the problem was a high write i/o pressure on esnode1 due to the index copy from esnode3.
Jan 13 2021
Interesting reading on how the cluster state is replicated/persisted : https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-state-publishing.html
It seems there is a lot of pressure on the cluster with the shard reallocation. After the first timeout, it seems esnode3 is managing all the primary shards and esnode1 is trying again and again to recover until a new timeout is occuring.
Implemented in T2964
Add a comment to explain why the field type is optional
Remove unnecessary changes on tests
- Gently remove the node from the cluster :
❯ export ES_NODE=esnode3.internal.softwareheritage.org:9200 ❯ curl -H "Content-Type: application/json" -XPUT http://${ES_NODE}/_cluster/settings\?pretty -d '{ "transient" : { "cluster.routing.allocation.exclude._ip" : "192.168.100.62" } }'
I close this issue as there is not more action to perform at the moment.
Diagnosis and eventual fixes will be followed on dedicated issues
After a week of observation, there is no visible differences on the different system[1] and elasticsearch[2] monitoring.
Jan 12 2021
Rebase
Adapt according to review
The actions to replace the disk on esnode1 and stabilize the cluster are done, so the state of this task can be changed to resolved.
The other remaining task will be done in dedicated ones.
Jan 11 2021
well well well