Page MenuHomeSoftware Heritage
Feed Advanced Search

Jan 18 2021

vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.

esnode3 is ready to be migrated :

❯ curl -s http://192.168.100.63:9200/_cat/allocation\?v\&s\=node                                                                                                                                        09:09:53
shards disk.indices disk.used disk.avail disk.total disk.percent host           ip             node
  4397        4.4tb     4.4tb      2.3tb      6.7tb           65 192.168.100.61 192.168.100.61 esnode1
  4397        4.4tb     4.4tb      2.3tb      6.7tb           65 192.168.100.62 192.168.100.62 esnode2
     0           0b     5.9gb      5.4tb      5.4tb            0 192.168.100.63 192.168.100.63 esnode3
Jan 18 2021, 9:10 AM · System administration

Jan 15 2021

vsellier requested review of D4871: Backfiller: Add type to the origin_visit_status topic.
Jan 15 2021, 2:46 PM
vsellier added a revision to T2966: Backfill origin_visit_status **with** the `visit_type` field properly given: D4871: Backfiller: Add type to the origin_visit_status topic.
Jan 15 2021, 2:41 PM · Storage manager, Sprint 2021 01, Scheduling utilities
vsellier changed the status of T2966: Backfill origin_visit_status **with** the `visit_type` field properly given, a subtask of T2443: Implement a bulk-queryable cache of latest visits for use by the recurrent visit scheduler, from Open to Work in Progress.
Jan 15 2021, 2:40 PM · Sprint 2021 01, Scheduling utilities
vsellier changed the status of T2966: Backfill origin_visit_status **with** the `visit_type` field properly given from Open to Work in Progress.
Jan 15 2021, 2:40 PM · Storage manager, Sprint 2021 01, Scheduling utilities
vsellier closed T2964: Adapt origin_visit_status_(get|add) api to deal with the visit_type, a subtask of T2443: Implement a bulk-queryable cache of latest visits for use by the recurrent visit scheduler, as Resolved.
Jan 15 2021, 2:01 PM · Sprint 2021 01, Scheduling utilities
vsellier closed T2964: Adapt origin_visit_status_(get|add) api to deal with the visit_type as Resolved.
Jan 15 2021, 2:01 PM · Storage manager, Sprint 2021 01
vsellier closed D4858: Add persistence of the field OriginVisitStatus.type.
Jan 15 2021, 1:58 PM
vsellier committed rDSTOc24d35f86a06: Add persistence of the field OriginVisitStatus.type (authored by vsellier).
Add persistence of the field OriginVisitStatus.type
Jan 15 2021, 1:58 PM
vsellier updated the diff for D4858: Add persistence of the field OriginVisitStatus.type.

Remove type from the clustering key of OriginVisitStatus

Jan 15 2021, 12:41 PM
vsellier updated the diff for D4858: Add persistence of the field OriginVisitStatus.type.

rebase

Jan 15 2021, 12:37 PM
vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.

The cluster is stabilized :

❯ curl -s http://192.168.100.63:9200/_cat/health\?v
epoch      timestamp cluster          status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1610689991 05:53:11  swh-logging-prod green           3         3   8758 4379    0    0        0             0                  -                100.0%
Jan 15 2021, 6:55 AM · System administration

Jan 14 2021

vsellier requested review of D4858: Add persistence of the field OriginVisitStatus.type.
Jan 14 2021, 2:44 PM
vsellier closed D4857: Add new field OriginVisitStatus.type field on test data.
Jan 14 2021, 2:27 PM
vsellier committed rDJNLc451ecd54231: Add new field OriginVisitStatus.type field on test data (authored by vsellier).
Add new field OriginVisitStatus.type field on test data
Jan 14 2021, 2:27 PM
vsellier requested review of D4857: Add new field OriginVisitStatus.type field on test data.
Jan 14 2021, 2:24 PM
vsellier closed D4848: Add an optional type field on OriginVisitStatus object.
Jan 14 2021, 2:12 PM
vsellier committed rDMOD1ca92a5ce003: Add an optional type field on OriginVisitStatus object (authored by vsellier).
Add an optional type field on OriginVisitStatus object
Jan 14 2021, 2:12 PM
vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.

After a reboot, a message Failed to start Import ZFS pools by cache file is displayed on the server console and the pool is not mounted. It seems it can be caused by using /dev/sd* disk names directly.

Jan 14 2021, 1:10 PM · System administration
vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.
  • installation and configuration of zfs on esnode2
    • backport packages installed
    • kernel upgraded to 5.0
root@esnode2:~# apt update
root@esnode2:~# apt list --upgradable
Listing... Done
libnss-systemd/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
libpam-systemd/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
libsystemd0/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
libudev1/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
linux-image-amd64/buster-backports 5.9.15-1~bpo10+1 amd64 [upgradable from: 4.19+105+deb10u8]
systemd-sysv/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
systemd-timesyncd/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
systemd/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
udev/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
root@esnode2:~# apt dist-upgrade
root@esnode2:~# systemctl disable elasticsearch
Synchronizing state of elasticsearch.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable elasticsearch
Removed /etc/systemd/system/multi-user.target.wants/elasticsearch.service.
root@esnode2:~# puppet agent --disable "zfs installation"
root@esnode2:~# shutdown -r now
  • zfs installation
root@esnode2:~#  apt install libnvpair1linux libuutil1linux libzfs2linux libzpool2linux zfs-dkms zfsutils-linux zfs-zed
  • kafka partition and old elasticsearch raid removed
root@esnode2:~# diff -U3 /tmp/fstab /etc/fstab
root@esnode2:~# diff -U3 /tmp/fstab /etc/fstab
--- /tmp/fstab	2021-01-14 09:05:59.609906708 +0000
+++ /etc/fstab	2021-01-14 09:06:49.390527123 +0000
@@ -9,8 +9,5 @@
 UUID=3700082d-41e5-4c54-8667-46280f124b33 /               ext4    errors=remount-ro 0       1
 # /boot/efi was on /dev/sda1 during installation
 UUID=0228-9320  /boot/efi       vfat    umask=0077      0       1
-#/srv/kafka was on /dev/sda4 during installation
-#UUID=c97780cb-378c-4963-ac31-59281410b2f9 /srv/kafka      ext4    defaults        0       2
 # swap was on /dev/sda3 during installation
 UUID=3eea10c5-9913-44c1-aa85-a1e93ae12970 none            swap    sw              0       0
-/dev/md0	/srv/elasticsearch	xfs	defaults,noatime	0 0
  • Removing old raid :
root@esnode2:~# mdadm --detail /dev/md0  
/dev/md0:
           Version : 1.2
     Creation Time : Wed May 23 08:21:35 2018
        Raid Level : raid0
        Array Size : 5860150272 (5588.67 GiB 6000.79 GB)
      Raid Devices : 3
     Total Devices : 3
       Persistence : Superblock is persistent
Jan 14 2021, 12:38 PM · System administration
vsellier updated the summary of D4849: Adapt cassandra storage to ignore the new OriginVisitStatus.type field.
Jan 14 2021, 11:05 AM
vsellier closed D4849: Adapt cassandra storage to ignore the new OriginVisitStatus.type field.

Closed by rDSTO0b44b37254974db02f87339448f4629fa9a91ded

Jan 14 2021, 11:03 AM
vsellier committed rDSTO0b44b3725497: Adapt cassandra storage to ignore the new OriginVisitStatus.type field (authored by vsellier).
Adapt cassandra storage to ignore the new OriginVisitStatus.type field
Jan 14 2021, 11:00 AM
vsellier committed rSPSITE2c3462e16b1a: Prepare zfs packages installation on esnode[2-3] (authored by vsellier).
Prepare zfs packages installation on esnode[2-3]
Jan 14 2021, 9:27 AM
vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.

The cause of the problem was a high write i/o pressure on esnode1 due to the index copy from esnode3.

Jan 14 2021, 9:18 AM · System administration

Jan 13 2021

vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.

Interesting reading on how the cluster state is replicated/persisted : https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-state-publishing.html
It seems there is a lot of pressure on the cluster with the shard reallocation. After the first timeout, it seems esnode3 is managing all the primary shards and esnode1 is trying again and again to recover until a new timeout is occuring.

Jan 13 2021, 5:31 PM · System administration
vsellier closed T2965: Adapt storage to actually write the visit_type in the origin_visit_status topic as Resolved.

Implemented in T2964

Jan 13 2021, 4:51 PM · Storage manager, Sprint 2021 01
vsellier closed T2965: Adapt storage to actually write the visit_type in the origin_visit_status topic, a subtask of T2443: Implement a bulk-queryable cache of latest visits for use by the recurrent visit scheduler, as Resolved.
Jan 13 2021, 4:51 PM · Sprint 2021 01, Scheduling utilities
vsellier moved T2964: Adapt origin_visit_status_(get|add) api to deal with the visit_type from in-progress to code review on the Sprint 2021 01 board.
Jan 13 2021, 4:50 PM · Storage manager, Sprint 2021 01
vsellier added a revision to T2964: Adapt origin_visit_status_(get|add) api to deal with the visit_type: D4858: Add persistence of the field OriginVisitStatus.type.
Jan 13 2021, 4:47 PM · Storage manager, Sprint 2021 01
vsellier added a revision to T2965: Adapt storage to actually write the visit_type in the origin_visit_status topic: D4857: Add new field OriginVisitStatus.type field on test data.
Jan 13 2021, 4:22 PM · Storage manager, Sprint 2021 01
vsellier closed D4838: Add an new origin visit info model object and related backend api.
Jan 13 2021, 11:49 AM
vsellier committed rDSCHa62003397d6e: Add an new origin visit info model object and related backend api (authored by vsellier).
Add an new origin visit info model object and related backend api
Jan 13 2021, 11:49 AM
vsellier moved T2964: Adapt origin_visit_status_(get|add) api to deal with the visit_type from in-progress to todo on the Sprint 2021 01 board.
Jan 13 2021, 11:39 AM · Storage manager, Sprint 2021 01
vsellier claimed T2965: Adapt storage to actually write the visit_type in the origin_visit_status topic.
Jan 13 2021, 11:34 AM · Storage manager, Sprint 2021 01
vsellier updated the diff for D4848: Add an optional type field on OriginVisitStatus object.

Add a comment to explain why the field type is optional

Jan 13 2021, 11:18 AM
vsellier requested review of D4849: Adapt cassandra storage to ignore the new OriginVisitStatus.type field.
Jan 13 2021, 11:10 AM
vsellier updated the diff for D4848: Add an optional type field on OriginVisitStatus object.

Remove unnecessary changes on tests

Jan 13 2021, 10:40 AM
vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.
  • Gently remove the node from the cluster :
❯ export ES_NODE=esnode3.internal.softwareheritage.org:9200
❯ curl -H "Content-Type: application/json" -XPUT http://${ES_NODE}/_cluster/settings\?pretty -d '{
    "transient" : {
        "cluster.routing.allocation.exclude._ip" : "192.168.100.62"
    }
}'
Jan 13 2021, 9:31 AM · System administration
vsellier closed T2905: Deploy swh-search for production, a subtask of T2182: Switch production swh-web to use swh-search instead of postgresql search., as Resolved.
Jan 13 2021, 9:23 AM · System administration, Archive search, Storage manager
vsellier closed T2905: Deploy swh-search for production, a subtask of T2904: Create a new production webapp using the frozen index on the staging ES, as Resolved.
Jan 13 2021, 9:23 AM · System administrators, Journal, Archive search
vsellier closed T2905: Deploy swh-search for production as Resolved.

I close this issue as there is not more action to perform at the moment.
Diagnosis and eventual fixes will be followed on dedicated issues

Jan 13 2021, 9:23 AM · System administration, Journal, Archive search
vsellier moved T2939: Replace out of order disks on db1.staging and storage1.staging from in-progress to Weekly backlog on the System administration board.
Jan 13 2021, 9:22 AM · System administration
vsellier moved T2958: Use all the disks on esnode2 and esnode3 from Backlog to in-progress on the System administration board.
Jan 13 2021, 9:21 AM · System administration
vsellier changed the status of T2958: Use all the disks on esnode2 and esnode3 from Open to Work in Progress.
Jan 13 2021, 9:21 AM · System administration
vsellier changed the status of T2958: Use all the disks on esnode2 and esnode3, a subtask of T2888: Elasticsearch cluster failure during a rolling restart, from Open to Work in Progress.
Jan 13 2021, 9:21 AM · System administration
vsellier closed T2903: Test different disk configuration on esnode1 as Resolved.

After a week of observation, there is no visible differences on the different system[1] and elasticsearch[2] monitoring.

Jan 13 2021, 9:20 AM · System administration
vsellier closed T2903: Test different disk configuration on esnode1, a subtask of T2888: Elasticsearch cluster failure during a rolling restart, as Resolved.
Jan 13 2021, 9:20 AM · System administration

Jan 12 2021

vsellier added a revision to T2443: Implement a bulk-queryable cache of latest visits for use by the recurrent visit scheduler: D4849: Adapt cassandra storage to ignore the new OriginVisitStatus.type field.
Jan 12 2021, 6:29 PM · Sprint 2021 01, Scheduling utilities
vsellier requested review of D4848: Add an optional type field on OriginVisitStatus object.
Jan 12 2021, 6:12 PM
vsellier added a revision to T2443: Implement a bulk-queryable cache of latest visits for use by the recurrent visit scheduler: D4848: Add an optional type field on OriginVisitStatus object.
Jan 12 2021, 6:11 PM · Sprint 2021 01, Scheduling utilities
vsellier updated the diff for D4838: Add an new origin visit info model object and related backend api.

Rebase

Jan 12 2021, 2:48 PM
vsellier updated the diff for D4838: Add an new origin visit info model object and related backend api.

Adapt according to review

Jan 12 2021, 2:18 PM
vsellier closed T2888: Elasticsearch cluster failure during a rolling restart, a subtask of T2852: Take back control on elasticsearch puppet manifests, as Resolved.
Jan 12 2021, 12:38 PM · System administration
vsellier closed T2888: Elasticsearch cluster failure during a rolling restart as Resolved.

The actions to replace the disk on esnode1 and stabilize the cluster are done, so the state of this task can be changed to resolved.
The other remaining task will be done in dedicated ones.

Jan 12 2021, 12:38 PM · System administration
vsellier requested review of D4838: Add an new origin visit info model object and related backend api.
Jan 12 2021, 12:35 PM
vsellier added a revision to T2443: Implement a bulk-queryable cache of latest visits for use by the recurrent visit scheduler: D4838: Add an new origin visit info model object and related backend api.
Jan 12 2021, 12:16 PM · Sprint 2021 01, Scheduling utilities
vsellier updated the task description for T2888: Elasticsearch cluster failure during a rolling restart.
Jan 12 2021, 10:46 AM · System administration
vsellier updated the task description for T2888: Elasticsearch cluster failure during a rolling restart.
Jan 12 2021, 10:46 AM · System administration
vsellier triaged T2960: Add disk health monitoring as Normal priority.
Jan 12 2021, 10:44 AM · System administration
vsellier triaged T2959: Move the system partition on a soft raid on esnode* as Normal priority.
Jan 12 2021, 10:20 AM · System administration
vsellier updated the task description for T2958: Use all the disks on esnode2 and esnode3.
Jan 12 2021, 10:11 AM · System administration
vsellier triaged T2958: Use all the disks on esnode2 and esnode3 as Normal priority.
Jan 12 2021, 10:11 AM · System administration

Jan 11 2021

vsellier added a comment to T2939: Replace out of order disks on db1.staging and storage1.staging.

well well well

Jan 11 2021, 8:27 PM · System administration
vsellier added a comment to T2939: Replace out of order disks on db1.staging and storage1.staging.

The model number to use on the request is : ST6000NM0115
There is an obscure message limiting the number of return / country / year to 3 (!):

Jan 11 2021, 8:14 PM · System administration
vsellier updated the task description for T2939: Replace out of order disks on db1.staging and storage1.staging.
Jan 11 2021, 1:53 PM · System administration
vsellier accepted D4831: hedgedoc: Fix reverse proxy configuration.

LGTM

Jan 11 2021, 9:49 AM
vsellier updated the task description for T2939: Replace out of order disks on db1.staging and storage1.staging.
Jan 11 2021, 9:42 AM · System administration
vsellier added a comment to T2939: Replace out of order disks on db1.staging and storage1.staging.

The test of /dev/sdb finally ends ... in error :

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       80%     26004         2559298584

So we have 2 disks to replace on each server. What's weird is that the 2 disks to replace are at the same position on each server...

Jan 11 2021, 9:39 AM · System administration

Jan 8 2021

vsellier added a comment to T2939: Replace out of order disks on db1.staging and storage1.staging.

the test is still running on one disk on storage1 (sdb). No new errors were discovered on all the other disk

Jan 8 2021, 2:17 PM · System administration
vsellier accepted D4822: admin: Provision new rp0.internal.admin.swh.network.

LGTM

Jan 8 2021, 11:16 AM
vsellier accepted D4823: admin: Add rp0.internal.admin.swh.network reverse proxy for admin nodes.

LGTM

Jan 8 2021, 10:14 AM

Jan 7 2021

vsellier triaged T2944: Deploy swh-search v0.4.1 as Normal priority.
Jan 7 2021, 6:39 PM · System administration, Journal, Archive search
vsellier added a comment to T2936: Update the swh-search journal client to only set "has_visit" on "full" status of the visit.

version v0.4.1 created with the last commit (rDSEA47db624364d4e781f8fa157b2d72d0eb9929b7a0)

Jan 7 2021, 4:16 PM · Journal, Archive search
vsellier accepted D4818: Do not set 'has_visit' when receiving a visit from the journal.

LGTM
thanks for the query to fix the index

Jan 7 2021, 2:16 PM
vsellier updated the task description for T2939: Replace out of order disks on db1.staging and storage1.staging.
Jan 7 2021, 12:36 PM · System administration
vsellier updated the task description for T2939: Replace out of order disks on db1.staging and storage1.staging.
Jan 7 2021, 12:35 PM · System administration
vsellier added a comment to T2939: Replace out of order disks on db1.staging and storage1.staging.

Tests launched :

root@db1:~# echo /dev/sd{a..n} | xargs -t -n1 smartctl -t long
smartctl -t long /dev/sda 
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.9.0-0.bpo.2-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
Jan 7 2021, 12:21 PM · System administration
vsellier added a comment to T2939: Replace out of order disks on db1.staging and storage1.staging.

Complete disk statuses :

  • db1.staging:
root@db1:~# ls  /dev/sd{a..n} | xargs -t -n1 smartctl -a | grep -e "/dev/sd?" -e Reallocated_Sector_Ct -e "Model Family" -e "Serial Number" -e "Reported_Uncorrect" -e lifetime -e "Extended offline" -e "Offline_Uncorrectable"
smartctl -a /dev/sda 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27CCS
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       8
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25432         -
smartctl -a /dev/sdb 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27C4P
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       8
# 1  Extended offline    Completed: read failure       70%     25421         4131034152
smartctl -a /dev/sdc 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27DW0
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25432         -
smartctl -a /dev/sdd 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27A44
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25432         -
smartctl -a /dev/sde 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27BA5
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25432         -
smartctl -a /dev/sdf 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27DCG
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25432         -
smartctl -a /dev/sdg 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD270KS
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25431         -
smartctl -a /dev/sdh 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27A4P
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25431         -
smartctl -a /dev/sdi 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD27E48
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25431         -
smartctl -a /dev/sdj 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD26YN2
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25432         -
smartctl -a /dev/sdk 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD279XY
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25432         -
smartctl -a /dev/sdl 
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Serial Number:    ZAD279ZX
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
# 1  Extended offline    Completed without error       00%     25427         -
smartctl -a /dev/sdm 
Model Family:     Intel 730 and DC S35x0/3610/3700 Series SSDs
Serial Number:    PHDV71810017150MGN
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
# 1  Extended offline    Completed without error       00%     25415         -
smartctl -a /dev/sdn 
Model Family:     Intel 730 and DC S35x0/3610/3700 Series SSDs
Serial Number:    PHDV718004DM150MGN
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
# 1  Extended offline    Completed without error       00%     25415         -
Jan 7 2021, 12:18 PM · System administration
vsellier lowered the priority of T2888: Elasticsearch cluster failure during a rolling restart from High to Normal.

Reducing priority to normal as there is no more risks for the data

Jan 7 2021, 12:04 PM · System administration
vsellier moved T2939: Replace out of order disks on db1.staging and storage1.staging from Backlog to in-progress on the System administration board.
Jan 7 2021, 12:03 PM · System administration
vsellier changed the status of T2939: Replace out of order disks on db1.staging and storage1.staging from Open to Work in Progress.
Jan 7 2021, 12:02 PM · System administration
vsellier added a comment to T2905: Deploy swh-search for production.

It depends of what will be implemented in T2936, but a new reindex will probably have to be done to fix the search. It will be the opportunity to think on how doing it without killing all the search

Jan 7 2021, 11:36 AM · System administration, Journal, Archive search
vsellier updated subscribers of T2905: Deploy swh-search for production.

@vlorentz I was checking some differences between swh-search and the current search. does the journal client has to listen the origin_visit topic? It seems that `origin_visit_status should be enough to match the behavior of the search in the webapp.

Jan 7 2021, 10:14 AM · System administration, Journal, Archive search

Jan 6 2021

vsellier committed rSPRE556448d54882: align search-esnode* configuration with the real number (authored by vsellier).
align search-esnode* configuration with the real number
Jan 6 2021, 4:05 PM
vsellier added a comment to T2888: Elasticsearch cluster failure during a rolling restart.

previous comment moved to T2903#56023

Jan 6 2021, 3:45 PM · System administration
vsellier added a comment to T2903: Test different disk configuration on esnode1.

The benchmark is done:

Version  1.98       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
esnode1-zfs-arc 63G  312k  99  478m  49  200m  43  640k  93  445m  53 400.8  31
Latency             31118us   58579us     748ms     231ms   78052us     275ms
Version  1.98       ------Sequential Create------ --------Random Create--------
esnode1-zfs-arc-lim -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 16384  24 +++++ +++ 16384   7 16384  35 +++++ +++ 16384   6
Latency               145ms    2012us     826ms     105ms      21us     842ms
1.98,1.98,esnode1-zfs-arc-limited,1,1609729287,63G,,8192,5,312,99,489919,49,204649,43,640,93,455669,53,400.8,31,16,,,,,4059,24,+++++,+++,3023,7,11686,35,+++++,+++,2398,6,31118us,58579us,748ms,231ms,78052us,275ms,145ms,2012us,826ms,105ms,21us,842ms

(sorry for the formating, didn't find how to make it better)

Jan 6 2021, 3:43 PM · System administration
vsellier updated the task description for T2888: Elasticsearch cluster failure during a rolling restart.
Jan 6 2021, 3:43 PM · System administration
vsellier added a comment to T2888: Elasticsearch cluster failure during a rolling restart.
Jan 6 2021, 3:42 PM · System administration
vsellier updated the task description for T2905: Deploy swh-search for production.
Jan 6 2021, 11:06 AM · System administration, Journal, Archive search
vsellier added a comment to T2905: Deploy swh-search for production.

webapp1 is now plugged on the real live production index
Let monitor the behavior with real searches.
First constatation, the search retrieves all the documents and is not as progressive as the random search script.
The response times are longer than expected:

Jan 06 09:59:46 search1 python3[813]: 2021-01-06 09:59:46 [813] elasticsearch:INFO GET http://search-esnode1.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:3.399s]
Jan 06 10:06:18 search1 python3[848]: 2021-01-06 10:06:18 [848] elasticsearch:INFO GET http://search-esnode1.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:7.422s]
Jan 06 10:06:21 search1 python3[813]: 2021-01-06 10:06:21 [813] elasticsearch:INFO GET http://search-esnode3.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:5.077s]
Jan 06 10:07:32 search1 python3[813]: 2021-01-06 10:07:32 [813] elasticsearch:INFO GET http://search-esnode2.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:4.819s]
Jan 06 10:08:06 search1 python3[813]: 2021-01-06 10:08:06 [813] elasticsearch:INFO GET http://search-esnode1.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:2.700s]
Jan 06 10:08:15 search1 python3[813]: 2021-01-06 10:08:15 [813] elasticsearch:INFO GET http://search-esnode3.internal.softwareheritage.org:9200/origin/_search?size=100 [status:200 request:2.414s]
Jan 6 2021, 11:01 AM · System administration, Journal, Archive search
vsellier closed D4809: Plug webapp1 on the swh-search with live production data.
Jan 6 2021, 10:28 AM
vsellier committed rSPSITE57b96d45f705: Plug webapp1 on the swh-search with live production data (authored by vsellier).
Plug webapp1 on the swh-search with live production data
Jan 6 2021, 10:28 AM
vsellier added a comment to T2905: Deploy swh-search for production.

the performances looks acceptable as it for a small number of parallel searches (~10), let's try now with real searches, it will also help to adapt the cluster configuration and validate the behavior

Jan 6 2021, 9:59 AM · System administration, Journal, Archive search
vsellier updated the task description for T2905: Deploy swh-search for production.
Jan 6 2021, 9:56 AM · System administration, Journal, Archive search
vsellier requested review of D4809: Plug webapp1 on the swh-search with live production data.
Jan 6 2021, 9:55 AM
vsellier added a revision to T2905: Deploy swh-search for production: D4809: Plug webapp1 on the swh-search with live production data.
Jan 6 2021, 9:55 AM · System administration, Journal, Archive search
vsellier committed rSENV69103055ea85: Update octocatalog-diff facts (authored by vsellier).
Update octocatalog-diff facts
Jan 6 2021, 9:53 AM

Jan 5 2021

vsellier added a comment to T2888: Elasticsearch cluster failure during a rolling restart.

The new disk are ok according the smart test:

root@esnode1:~# echo /dev/sd{b,c} | xargs -n1 smartctl -a | grep -A2 "Self-test log"
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         3         -
--
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         3         -
Jan 5 2021, 7:52 PM · System administration
vsellier added a comment to T2888: Elasticsearch cluster failure during a rolling restart.

The 2 disks were replaced :

root@esnode1:~# smartctl -a /dev/sdb
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.9.0-0.bpo.2-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
Jan 5 2021, 3:50 PM · System administration