❯ curl -s http://192.168.100.63:9200/_cat/allocation\?v\&s\=node
shards disk.indices disk.used disk.avail disk.total disk.percent host           ip             node
  2935        2.9tb       3tb      3.7tb      6.7tb           44 192.168.100.61 192.168.100.61 esnode1
  2936        2.9tb       3tb      3.7tb      6.7tb           44 192.168.100.62 192.168.100.62 esnode2
  2935        2.9tb     2.9tb      3.8tb      6.7tb           43 192.168.100.63 192.168.100.63 esnode3

Jan 19 2021, 8:54 AM · System administration

vsellier closed T2958: Use all the disks on esnode2 and esnode3, a subtask of T2888: Elasticsearch cluster failure during a rolling restart, as Resolved.

Jan 19 2021, 8:54 AM · System administration

vsellier requested review of D4878: Remove the deprecated external_identifer from metadata.

Jan 19 2021, 7:45 AM

vsellier added a revision to T2976: Deposit tests end-to-end are failing in icinga: D4878: Remove the deprecated external_identifer from metadata.

Jan 19 2021, 7:44 AM · System administration, SWORD deposit

Jan 18 2021

vsellier triaged T2976: Deposit tests end-to-end are failing in icinga as Normal priority.

Jan 18 2021, 7:57 PM · System administration, SWORD deposit

vsellier moved T2920: Document staging infrastructure from Backlog to Weekly backlog on the System administration board.

Jan 18 2021, 7:13 PM · Documentation, System administration, Staging environment

vsellier added a project to T2920: Document staging infrastructure: System administration.

Jan 18 2021, 7:13 PM · Documentation, System administration, Staging environment

vsellier moved T2975: Disk replacement on esnode1 from Backlog to Weekly backlog on the System administration board.

Jan 18 2021, 7:03 PM · System administration

vsellier triaged T2975: Disk replacement on esnode1 as Normal priority.

Jan 18 2021, 7:02 PM · System administration

vsellier closed T2966: Backfill origin_visit_status **with** the `visit_type` field properly given, a subtask of T2443: Implement a bulk-queryable cache of latest visits for use by the recurrent visit scheduler, as Resolved.

Jan 18 2021, 12:02 PM · Sprint 2021 01, Scheduling utilities

vsellier closed T2966: Backfill origin_visit_status **with** the `visit_type` field properly given as Resolved.

Jan 18 2021, 12:02 PM · Storage manager, Sprint 2021 01, Scheduling utilities

vsellier closed D4871: Backfiller: Add type to the origin_visit_status topic.

Jan 18 2021, 12:01 PM

vsellier committed rDSTOd04165f5b458: Add type to the origin_visit_status topic (authored by vsellier).

Add type to the origin_visit_status topic

Jan 18 2021, 12:01 PM

vsellier added inline comments to D4871: Backfiller: Add type to the origin_visit_status topic.

Jan 18 2021, 11:51 AM

vsellier updated the diff for D4871: Backfiller: Add type to the origin_visit_status topic.

Rework the sql query to use the "using" keyword to join

Jan 18 2021, 11:50 AM

vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.

esnode3 configured with the same procedure as esnode2 (check the previous comments)

Jan 18 2021, 9:39 AM · System administration

vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.

esnode3 is ready to be migrated :

❯ curl -s http://192.168.100.63:9200/_cat/allocation\?v\&s\=node                                                                                                                                        09:09:53
shards disk.indices disk.used disk.avail disk.total disk.percent host           ip             node
  4397        4.4tb     4.4tb      2.3tb      6.7tb           65 192.168.100.61 192.168.100.61 esnode1
  4397        4.4tb     4.4tb      2.3tb      6.7tb           65 192.168.100.62 192.168.100.62 esnode2
     0           0b     5.9gb      5.4tb      5.4tb            0 192.168.100.63 192.168.100.63 esnode3

Jan 18 2021, 9:10 AM · System administration

Jan 15 2021

vsellier requested review of D4871: Backfiller: Add type to the origin_visit_status topic.

Jan 15 2021, 2:46 PM

vsellier added a revision to T2966: Backfill origin_visit_status **with** the `visit_type` field properly given: D4871: Backfiller: Add type to the origin_visit_status topic.

Jan 15 2021, 2:41 PM · Storage manager, Sprint 2021 01, Scheduling utilities

vsellier changed the status of T2966: Backfill origin_visit_status **with** the `visit_type` field properly given, a subtask of T2443: Implement a bulk-queryable cache of latest visits for use by the recurrent visit scheduler, from Open to Work in Progress.

Jan 15 2021, 2:40 PM · Sprint 2021 01, Scheduling utilities

vsellier changed the status of T2966: Backfill origin_visit_status **with** the `visit_type` field properly given from Open to Work in Progress.

Jan 15 2021, 2:40 PM · Storage manager, Sprint 2021 01, Scheduling utilities

vsellier closed T2964: Adapt origin_visit_status_(get|add) api to deal with the visit_type, a subtask of T2443: Implement a bulk-queryable cache of latest visits for use by the recurrent visit scheduler, as Resolved.

Jan 15 2021, 2:01 PM · Sprint 2021 01, Scheduling utilities

vsellier closed T2964: Adapt origin_visit_status_(get|add) api to deal with the visit_type as Resolved.

Jan 15 2021, 2:01 PM · Storage manager, Sprint 2021 01

vsellier closed D4858: Add persistence of the field OriginVisitStatus.type.

Jan 15 2021, 1:58 PM

vsellier committed rDSTOc24d35f86a06: Add persistence of the field OriginVisitStatus.type (authored by vsellier).

Add persistence of the field OriginVisitStatus.type

Jan 15 2021, 1:58 PM

vsellier updated the diff for D4858: Add persistence of the field OriginVisitStatus.type.

Remove type from the clustering key of OriginVisitStatus

Jan 15 2021, 12:41 PM

vsellier updated the diff for D4858: Add persistence of the field OriginVisitStatus.type.

rebase

Jan 15 2021, 12:37 PM

vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.

The cluster is stabilized :

❯ curl -s http://192.168.100.63:9200/_cat/health\?v
epoch      timestamp cluster          status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1610689991 05:53:11  swh-logging-prod green           3         3   8758 4379    0    0        0             0                  -                100.0%

Jan 15 2021, 6:55 AM · System administration

Jan 14 2021

vsellier requested review of D4858: Add persistence of the field OriginVisitStatus.type.

Jan 14 2021, 2:44 PM

vsellier closed D4857: Add new field OriginVisitStatus.type field on test data.

Jan 14 2021, 2:27 PM

vsellier committed rDJNLc451ecd54231: Add new field OriginVisitStatus.type field on test data (authored by vsellier).

Add new field OriginVisitStatus.type field on test data

Jan 14 2021, 2:27 PM

vsellier requested review of D4857: Add new field OriginVisitStatus.type field on test data.

Jan 14 2021, 2:24 PM

vsellier closed D4848: Add an optional type field on OriginVisitStatus object.

Jan 14 2021, 2:12 PM

vsellier committed rDMOD1ca92a5ce003: Add an optional type field on OriginVisitStatus object (authored by vsellier).

Add an optional type field on OriginVisitStatus object

Jan 14 2021, 2:12 PM

vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.

After a reboot, a message Failed to start Import ZFS pools by cache file is displayed on the server console and the pool is not mounted. It seems it can be caused by using /dev/sd* disk names directly.

Jan 14 2021, 1:10 PM · System administration

vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.

installation and configuration of zfs on esnode2
- backport packages installed
- kernel upgraded to 5.0

root@esnode2:~# apt update
root@esnode2:~# apt list --upgradable
Listing... Done
libnss-systemd/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
libpam-systemd/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
libsystemd0/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
libudev1/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
linux-image-amd64/buster-backports 5.9.15-1~bpo10+1 amd64 [upgradable from: 4.19+105+deb10u8]
systemd-sysv/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
systemd-timesyncd/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
systemd/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
udev/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
root@esnode2:~# apt dist-upgrade
root@esnode2:~# systemctl disable elasticsearch
Synchronizing state of elasticsearch.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable elasticsearch
Removed /etc/systemd/system/multi-user.target.wants/elasticsearch.service.
root@esnode2:~# puppet agent --disable "zfs installation"
root@esnode2:~# shutdown -r now

zfs installation

root@esnode2:~#  apt install libnvpair1linux libuutil1linux libzfs2linux libzpool2linux zfs-dkms zfsutils-linux zfs-zed

kafka partition and old elasticsearch raid removed

root@esnode2:~# diff -U3 /tmp/fstab /etc/fstab
root@esnode2:~# diff -U3 /tmp/fstab /etc/fstab
--- /tmp/fstab	2021-01-14 09:05:59.609906708 +0000
+++ /etc/fstab	2021-01-14 09:06:49.390527123 +0000
@@ -9,8 +9,5 @@
 UUID=3700082d-41e5-4c54-8667-46280f124b33 /               ext4    errors=remount-ro 0       1
 # /boot/efi was on /dev/sda1 during installation
 UUID=0228-9320  /boot/efi       vfat    umask=0077      0       1
-#/srv/kafka was on /dev/sda4 during installation
-#UUID=c97780cb-378c-4963-ac31-59281410b2f9 /srv/kafka      ext4    defaults        0       2
 # swap was on /dev/sda3 during installation
 UUID=3eea10c5-9913-44c1-aa85-a1e93ae12970 none            swap    sw              0       0
-/dev/md0	/srv/elasticsearch	xfs	defaults,noatime	0 0

Removing old raid :

root@esnode2:~# mdadm --detail /dev/md0  
/dev/md0:
           Version : 1.2
     Creation Time : Wed May 23 08:21:35 2018
        Raid Level : raid0
        Array Size : 5860150272 (5588.67 GiB 6000.79 GB)
      Raid Devices : 3
     Total Devices : 3
       Persistence : Superblock is persistent

Jan 14 2021, 12:38 PM · System administration

vsellier updated the summary of D4849: Adapt cassandra storage to ignore the new OriginVisitStatus.type field.

Jan 14 2021, 11:05 AM

vsellier closed D4849: Adapt cassandra storage to ignore the new OriginVisitStatus.type field.

Closed by rDSTO0b44b37254974db02f87339448f4629fa9a91ded

Jan 14 2021, 11:03 AM

vsellier committed rDSTO0b44b3725497: Adapt cassandra storage to ignore the new OriginVisitStatus.type field (authored by vsellier).

Adapt cassandra storage to ignore the new OriginVisitStatus.type field

Jan 14 2021, 11:00 AM

vsellier committed rSPSITE2c3462e16b1a: Prepare zfs packages installation on esnode[2-3] (authored by vsellier).

Prepare zfs packages installation on esnode[2-3]

Jan 14 2021, 9:27 AM

vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.

The cause of the problem was a high write i/o pressure on esnode1 due to the index copy from esnode3.

Jan 14 2021, 9:18 AM · System administration

Jan 13 2021

vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.

Interesting reading on how the cluster state is replicated/persisted : https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-state-publishing.html
It seems there is a lot of pressure on the cluster with the shard reallocation. After the first timeout, it seems esnode3 is managing all the primary shards and esnode1 is trying again and again to recover until a new timeout is occuring.

Jan 13 2021, 5:31 PM · System administration

vsellier closed T2965: Adapt storage to actually write the visit_type in the origin_visit_status topic as Resolved.

Implemented in T2964

Jan 13 2021, 4:51 PM · Storage manager, Sprint 2021 01

vsellier closed T2965: Adapt storage to actually write the visit_type in the origin_visit_status topic, a subtask of T2443: Implement a bulk-queryable cache of latest visits for use by the recurrent visit scheduler, as Resolved.

Jan 13 2021, 4:51 PM · Sprint 2021 01, Scheduling utilities

vsellier moved T2964: Adapt origin_visit_status_(get|add) api to deal with the visit_type from in-progress to code review on the Sprint 2021 01 board.

Jan 13 2021, 4:50 PM · Storage manager, Sprint 2021 01

vsellier added a revision to T2964: Adapt origin_visit_status_(get|add) api to deal with the visit_type: D4858: Add persistence of the field OriginVisitStatus.type.

Jan 13 2021, 4:47 PM · Storage manager, Sprint 2021 01

vsellier added a revision to T2965: Adapt storage to actually write the visit_type in the origin_visit_status topic: D4857: Add new field OriginVisitStatus.type field on test data.

Jan 13 2021, 4:22 PM · Storage manager, Sprint 2021 01

vsellier closed D4838: Add an new origin visit info model object and related backend api.

Jan 13 2021, 11:49 AM

vsellier committed rDSCHa62003397d6e: Add an new origin visit info model object and related backend api (authored by vsellier).

Add an new origin visit info model object and related backend api

Jan 13 2021, 11:49 AM

vsellier moved T2964: Adapt origin_visit_status_(get|add) api to deal with the visit_type from in-progress to todo on the Sprint 2021 01 board.

Jan 13 2021, 11:39 AM · Storage manager, Sprint 2021 01

vsellier claimed T2965: Adapt storage to actually write the visit_type in the origin_visit_status topic.

Jan 13 2021, 11:34 AM · Storage manager, Sprint 2021 01

vsellier updated the diff for D4848: Add an optional type field on OriginVisitStatus object.

Add a comment to explain why the field type is optional

Jan 13 2021, 11:18 AM

vsellier requested review of D4849: Adapt cassandra storage to ignore the new OriginVisitStatus.type field.

Jan 13 2021, 11:10 AM

vsellier updated the diff for D4848: Add an optional type field on OriginVisitStatus object.

Remove unnecessary changes on tests

Jan 13 2021, 10:40 AM

vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.

Gently remove the node from the cluster :

❯ export ES_NODE=esnode3.internal.softwareheritage.org:9200
❯ curl -H "Content-Type: application/json" -XPUT http://${ES_NODE}/_cluster/settings\?pretty -d '{
    "transient" : {
        "cluster.routing.allocation.exclude._ip" : "192.168.100.62"
    }
}'

Jan 13 2021, 9:31 AM · System administration

vsellier closed T2905: Deploy swh-search for production, a subtask of T2182: Switch production swh-web to use swh-search instead of postgresql search., as Resolved.

Jan 13 2021, 9:23 AM · System administration, Archive search, Storage manager

vsellier closed T2905: Deploy swh-search for production, a subtask of T2904: Create a new production webapp using the frozen index on the staging ES, as Resolved.

Jan 13 2021, 9:23 AM · System administrators, Journal, Archive search

vsellier closed T2905: Deploy swh-search for production as Resolved.

I close this issue as there is not more action to perform at the moment.
Diagnosis and eventual fixes will be followed on dedicated issues

Jan 13 2021, 9:23 AM · System administration, Journal, Archive search

vsellier moved T2939: Replace out of order disks on db1.staging and storage1.staging from in-progress to Weekly backlog on the System administration board.

Jan 13 2021, 9:22 AM · System administration

vsellier moved T2958: Use all the disks on esnode2 and esnode3 from Backlog to in-progress on the System administration board.

Jan 13 2021, 9:21 AM · System administration

vsellier changed the status of T2958: Use all the disks on esnode2 and esnode3 from Open to Work in Progress.

Jan 13 2021, 9:21 AM · System administration

vsellier changed the status of T2958: Use all the disks on esnode2 and esnode3, a subtask of T2888: Elasticsearch cluster failure during a rolling restart, from Open to Work in Progress.

Jan 13 2021, 9:21 AM · System administration

vsellier closed T2903: Test different disk configuration on esnode1 as Resolved.

After a week of observation, there is no visible differences on the different system[1] and elasticsearch[2] monitoring.

Jan 13 2021, 9:20 AM · System administration

vsellier closed T2903: Test different disk configuration on esnode1, a subtask of T2888: Elasticsearch cluster failure during a rolling restart, as Resolved.

Jan 13 2021, 9:20 AM · System administration

Jan 12 2021

vsellier added a revision to T2443: Implement a bulk-queryable cache of latest visits for use by the recurrent visit scheduler: D4849: Adapt cassandra storage to ignore the new OriginVisitStatus.type field.

Jan 12 2021, 6:29 PM · Sprint 2021 01, Scheduling utilities

vsellier requested review of D4848: Add an optional type field on OriginVisitStatus object.

Jan 12 2021, 6:12 PM

vsellier added a revision to T2443: Implement a bulk-queryable cache of latest visits for use by the recurrent visit scheduler: D4848: Add an optional type field on OriginVisitStatus object.

Jan 12 2021, 6:11 PM · Sprint 2021 01, Scheduling utilities

vsellier updated the diff for D4838: Add an new origin visit info model object and related backend api.

Rebase

Jan 12 2021, 2:48 PM

vsellier updated the diff for D4838: Add an new origin visit info model object and related backend api.

Adapt according to review

Jan 12 2021, 2:18 PM

vsellier closed T2888: Elasticsearch cluster failure during a rolling restart, a subtask of T2852: Take back control on elasticsearch puppet manifests, as Resolved.

Jan 12 2021, 12:38 PM · System administration

vsellier closed T2888: Elasticsearch cluster failure during a rolling restart as Resolved.

The actions to replace the disk on esnode1 and stabilize the cluster are done, so the state of this task can be changed to resolved.
The other remaining task will be done in dedicated ones.

Jan 12 2021, 12:38 PM · System administration

vsellier requested review of D4838: Add an new origin visit info model object and related backend api.

Jan 12 2021, 12:35 PM

vsellier added a revision to T2443: Implement a bulk-queryable cache of latest visits for use by the recurrent visit scheduler: D4838: Add an new origin visit info model object and related backend api.

Jan 12 2021, 12:16 PM · Sprint 2021 01, Scheduling utilities

vsellier updated the task description for T2888: Elasticsearch cluster failure during a rolling restart.

Jan 12 2021, 10:46 AM · System administration

vsellier updated the task description for T2888: Elasticsearch cluster failure during a rolling restart.

Jan 12 2021, 10:46 AM · System administration

vsellier triaged T2960: Add disk health monitoring as Normal priority.

Jan 12 2021, 10:44 AM · System administration

vsellier triaged T2959: Move the system partition on a soft raid on esnode* as Normal priority.