Page MenuHomeSoftware Heritage

System administrationFolder
ActivePublic

Members

  • This project does not have any members.
  • View All

Watchers

  • This project does not have any watchers.
  • View All

Details

Description

general system administration tasks, not specific to any product

Recent Activity

Yesterday

winkies added a comment to T2277: varnish: limit maximum size of incoming POST requests for Web API.

According the django documentation, the DATA_UPLOAD_MAX_MEMORY_SIZE variable should do the job.
By default, the variable is set to 2.5 MB. From my experience, 1 MB is sufficient (for example, it's also the limit for some JS library).

Wed, Jan 20, 7:53 PM · System administration
moranegg added a project to T2920: Document staging infrastructure: Documentation.
Wed, Jan 20, 10:33 AM · Documentation, System administration, Staging environment
vsellier added a comment to T2976: Deposit tests end-to-end are failing in icinga.

it seems it's the scheduler running that is taking time to scheduler the deposit task :
08:37:53 -> task is created
08:43:05 -> the runner is scheduling the task
08:43:24 -> the worker acknowledge the task

Wed, Jan 20, 9:50 AM · System administration, SWORD deposit

Tue, Jan 19

vsellier closed T2866: Integrate former Uffizi server to the proxmox cluster as Resolved.
Tue, Jan 19, 7:51 PM · System administration
vsellier closed T2866: Integrate former Uffizi server to the proxmox cluster, a subtask of T2865: Prepare an environment to test the ClearlyDefined integration, as Resolved.
Tue, Jan 19, 7:51 PM · System administration
vsellier moved T2866: Integrate former Uffizi server to the proxmox cluster from in-progress to deployed on the System administration board.
Tue, Jan 19, 7:50 PM · System administration
vsellier added a comment to T2866: Integrate former Uffizi server to the proxmox cluster.

The interface on VLAN1330 and VLAN440 were already configured

Tue, Jan 19, 7:50 PM · System administration
vsellier added a comment to T2976: Deposit tests end-to-end are failing in icinga.

The package python3-swh.icingaplugins:v0.4.3 is released and deployed on pergamon

Tue, Jan 19, 11:35 AM · System administration, SWORD deposit
vsellier closed T2958: Use all the disks on esnode2 and esnode3 as Resolved.

The shard reallocation is done :

❯ curl -s http://192.168.100.63:9200/_cat/allocation\?v\&s\=node
shards disk.indices disk.used disk.avail disk.total disk.percent host           ip             node
  2935        2.9tb       3tb      3.7tb      6.7tb           44 192.168.100.61 192.168.100.61 esnode1
  2936        2.9tb       3tb      3.7tb      6.7tb           44 192.168.100.62 192.168.100.62 esnode2
  2935        2.9tb     2.9tb      3.8tb      6.7tb           43 192.168.100.63 192.168.100.63 esnode3
Tue, Jan 19, 8:54 AM · System administration
vsellier closed T2958: Use all the disks on esnode2 and esnode3, a subtask of T2888: Elasticsearch cluster failure during a rolling restart, as Resolved.
Tue, Jan 19, 8:54 AM · System administration
vsellier added a revision to T2976: Deposit tests end-to-end are failing in icinga: D4878: Remove the deprecated external_identifer from metadata.
Tue, Jan 19, 7:44 AM · System administration, SWORD deposit

Mon, Jan 18

vsellier triaged T2976: Deposit tests end-to-end are failing in icinga as Normal priority.
Mon, Jan 18, 7:57 PM · System administration, SWORD deposit
vsellier moved T2920: Document staging infrastructure from Backlog to Weekly backlog on the System administration board.
Mon, Jan 18, 7:13 PM · Documentation, System administration, Staging environment
vsellier added a project to T2920: Document staging infrastructure: System administration.
Mon, Jan 18, 7:13 PM · Documentation, System administration, Staging environment
vsellier moved T2975: Disk replacement on esnode1 from Backlog to Weekly backlog on the System administration board.
Mon, Jan 18, 7:03 PM · System administration
vsellier triaged T2975: Disk replacement on esnode1 as Normal priority.
Mon, Jan 18, 7:02 PM · System administration
vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.

esnode3 configured with the same procedure as esnode2 (check the previous comments)

Mon, Jan 18, 9:39 AM · System administration
vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.

esnode3 is ready to be migrated :

❯ curl -s http://192.168.100.63:9200/_cat/allocation\?v\&s\=node                                                                                                                                        09:09:53
shards disk.indices disk.used disk.avail disk.total disk.percent host           ip             node
  4397        4.4tb     4.4tb      2.3tb      6.7tb           65 192.168.100.61 192.168.100.61 esnode1
  4397        4.4tb     4.4tb      2.3tb      6.7tb           65 192.168.100.62 192.168.100.62 esnode2
     0           0b     5.9gb      5.4tb      5.4tb            0 192.168.100.63 192.168.100.63 esnode3
Mon, Jan 18, 9:10 AM · System administration

Fri, Jan 15

vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.

The cluster is stabilized :

❯ curl -s http://192.168.100.63:9200/_cat/health\?v
epoch      timestamp cluster          status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1610689991 05:53:11  swh-logging-prod green           3         3   8758 4379    0    0        0             0                  -                100.0%
Fri, Jan 15, 6:55 AM · System administration

Thu, Jan 14

ardumont added a comment to T2962: hedgedoc: Fix irrelevant access to hedgedoc instance.

Manually deployed:

root@rp1:/etc/varnish/includes# cat 90_vhost_forbidden_access_swh-rproxy3.inria.fr.vcl
# vhost_forbidden_access_swh-rproxy3.inria.fr.vcl
#
# Settings for swh-rproxy3.inria.fr vhost to refuse access
#
# File managed by puppet. All modifications will be lost.
Thu, Jan 14, 2:19 PM · System administration
ardumont updated the task description for T2962: hedgedoc: Fix irrelevant access to hedgedoc instance.
Thu, Jan 14, 2:18 PM · System administration
ardumont moved T2962: hedgedoc: Fix irrelevant access to hedgedoc instance from in-progress to code-review on the System administration board.
Thu, Jan 14, 2:17 PM · System administration
ardumont added a revision to T2962: hedgedoc: Fix irrelevant access to hedgedoc instance: D4862: varnish: Define vhost with forbidden access.
Thu, Jan 14, 2:17 PM · System administration
ardumont changed the status of T2962: hedgedoc: Fix irrelevant access to hedgedoc instance from Open to Work in Progress.
Thu, Jan 14, 2:07 PM · System administration
vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.

After a reboot, a message Failed to start Import ZFS pools by cache file is displayed on the server console and the pool is not mounted. It seems it can be caused by using /dev/sd* disk names directly.

Thu, Jan 14, 1:10 PM · System administration
vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.
  • installation and configuration of zfs on esnode2
    • backport packages installed
    • kernel upgraded to 5.0
root@esnode2:~# apt update
root@esnode2:~# apt list --upgradable
Listing... Done
libnss-systemd/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
libpam-systemd/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
libsystemd0/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
libudev1/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
linux-image-amd64/buster-backports 5.9.15-1~bpo10+1 amd64 [upgradable from: 4.19+105+deb10u8]
systemd-sysv/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
systemd-timesyncd/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
systemd/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
udev/buster-backports 247.2-4~bpo10+1 amd64 [upgradable from: 247.2-1~bpo10+1]
root@esnode2:~# apt dist-upgrade
root@esnode2:~# systemctl disable elasticsearch
Synchronizing state of elasticsearch.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable elasticsearch
Removed /etc/systemd/system/multi-user.target.wants/elasticsearch.service.
root@esnode2:~# puppet agent --disable "zfs installation"
root@esnode2:~# shutdown -r now
  • zfs installation
root@esnode2:~#  apt install libnvpair1linux libuutil1linux libzfs2linux libzpool2linux zfs-dkms zfsutils-linux zfs-zed
  • kafka partition and old elasticsearch raid removed
root@esnode2:~# diff -U3 /tmp/fstab /etc/fstab
root@esnode2:~# diff -U3 /tmp/fstab /etc/fstab
--- /tmp/fstab	2021-01-14 09:05:59.609906708 +0000
+++ /etc/fstab	2021-01-14 09:06:49.390527123 +0000
@@ -9,8 +9,5 @@
 UUID=3700082d-41e5-4c54-8667-46280f124b33 /               ext4    errors=remount-ro 0       1
 # /boot/efi was on /dev/sda1 during installation
 UUID=0228-9320  /boot/efi       vfat    umask=0077      0       1
-#/srv/kafka was on /dev/sda4 during installation
-#UUID=c97780cb-378c-4963-ac31-59281410b2f9 /srv/kafka      ext4    defaults        0       2
 # swap was on /dev/sda3 during installation
 UUID=3eea10c5-9913-44c1-aa85-a1e93ae12970 none            swap    sw              0       0
-/dev/md0	/srv/elasticsearch	xfs	defaults,noatime	0 0
  • Removing old raid :
root@esnode2:~# mdadm --detail /dev/md0  
/dev/md0:
           Version : 1.2
     Creation Time : Wed May 23 08:21:35 2018
        Raid Level : raid0
        Array Size : 5860150272 (5588.67 GiB 6000.79 GB)
      Raid Devices : 3
     Total Devices : 3
       Persistence : Superblock is persistent
Thu, Jan 14, 12:38 PM · System administration
vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.

The cause of the problem was a high write i/o pressure on esnode1 due to the index copy from esnode3.

Thu, Jan 14, 9:18 AM · System administration

Wed, Jan 13

vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.

Interesting reading on how the cluster state is replicated/persisted : https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-state-publishing.html
It seems there is a lot of pressure on the cluster with the shard reallocation. After the first timeout, it seems esnode3 is managing all the primary shards and esnode1 is trying again and again to recover until a new timeout is occuring.

Wed, Jan 13, 5:31 PM · System administration
vsellier added a comment to T2958: Use all the disks on esnode2 and esnode3.
  • Gently remove the node from the cluster :
❯ export ES_NODE=esnode3.internal.softwareheritage.org:9200
❯ curl -H "Content-Type: application/json" -XPUT http://${ES_NODE}/_cluster/settings\?pretty -d '{
    "transient" : {
        "cluster.routing.allocation.exclude._ip" : "192.168.100.62"
    }
}'
Wed, Jan 13, 9:31 AM · System administration
vsellier closed T2905: Deploy swh-search for production as Resolved.

I close this issue as there is not more action to perform at the moment.
Diagnosis and eventual fixes will be followed on dedicated issues

Wed, Jan 13, 9:23 AM · System administration, Journal, Archive search
vsellier moved T2939: Replace out of order disks on db1.staging and storage1.staging from in-progress to Weekly backlog on the System administration board.
Wed, Jan 13, 9:22 AM · System administration
vsellier moved T2958: Use all the disks on esnode2 and esnode3 from Backlog to in-progress on the System administration board.
Wed, Jan 13, 9:21 AM · System administration
vsellier changed the status of T2958: Use all the disks on esnode2 and esnode3 from Open to Work in Progress.
Wed, Jan 13, 9:21 AM · System administration
vsellier changed the status of T2958: Use all the disks on esnode2 and esnode3, a subtask of T2888: Elasticsearch cluster failure during a rolling restart, from Open to Work in Progress.
Wed, Jan 13, 9:21 AM · System administration
vsellier closed T2903: Test different disk configuration on esnode1 as Resolved.

After a week of observation, there is no visible differences on the different system[1] and elasticsearch[2] monitoring.

Wed, Jan 13, 9:20 AM · System administration
vsellier closed T2903: Test different disk configuration on esnode1, a subtask of T2888: Elasticsearch cluster failure during a rolling restart, as Resolved.
Wed, Jan 13, 9:20 AM · System administration

Tue, Jan 12

ardumont added a comment to T2962: hedgedoc: Fix irrelevant access to hedgedoc instance.

Tentatively tried:

Tue, Jan 12, 2:43 PM · System administration
ardumont triaged T2962: hedgedoc: Fix irrelevant access to hedgedoc instance as Normal priority.
Tue, Jan 12, 2:41 PM · System administration
vsellier closed T2888: Elasticsearch cluster failure during a rolling restart, a subtask of T2852: Take back control on elasticsearch puppet manifests, as Resolved.
Tue, Jan 12, 12:38 PM · System administration
vsellier closed T2888: Elasticsearch cluster failure during a rolling restart as Resolved.

The actions to replace the disk on esnode1 and stabilize the cluster are done, so the state of this task can be changed to resolved.
The other remaining task will be done in dedicated ones.

Tue, Jan 12, 12:38 PM · System administration
vlorentz shifted T2939: Replace out of order disks on db1.staging and storage1.staging from the Restricted Space space to the S1 Public space.
Tue, Jan 12, 11:38 AM · System administration
vlorentz shifted T2939: Replace out of order disks on db1.staging and storage1.staging from the S1 Public space to the Restricted Space space.
Tue, Jan 12, 11:37 AM · System administration
vsellier updated the task description for T2888: Elasticsearch cluster failure during a rolling restart.
Tue, Jan 12, 10:46 AM · System administration
vsellier updated the task description for T2888: Elasticsearch cluster failure during a rolling restart.
Tue, Jan 12, 10:46 AM · System administration
vsellier triaged T2960: Add disk health monitoring as Normal priority.
Tue, Jan 12, 10:44 AM · System administration
vsellier triaged T2959: Move the system partition on a soft raid on esnode* as Normal priority.
Tue, Jan 12, 10:20 AM · System administration
vsellier updated the task description for T2958: Use all the disks on esnode2 and esnode3.
Tue, Jan 12, 10:11 AM · System administration
vsellier triaged T2958: Use all the disks on esnode2 and esnode3 as Normal priority.
Tue, Jan 12, 10:11 AM · System administration

Mon, Jan 11

vsellier added a comment to T2939: Replace out of order disks on db1.staging and storage1.staging.

well well well

Mon, Jan 11, 8:27 PM · System administration
vsellier added a comment to T2939: Replace out of order disks on db1.staging and storage1.staging.

The model number to use on the request is : ST6000NM0115
There is an obscure message limiting the number of return / country / year to 3 (!):

Mon, Jan 11, 8:14 PM · System administration