export ES_NODE=192.168.100.86:9200
curl -H "Content-Type: application/json" -XPUT http://${ES_NODE}/_cluster/settings\?pretty -d '{ 
    "transient" : {
        "cluster.routing.allocation.exclude._ip" : "192.168.100.81,192.168.100.82,192.168.100.83"
    }
}'
{
  "acknowledged" : true,
  "persistent" : { },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "exclude" : {
            "_ip" : "192.168.100.81,192.168.100.82,192.168.100.83"
          }
        }
      }
    }
  }
}

The shards start to be gently moved from the old servers:

curl -s http://search-esnode4:9200/_cat/allocation\?s\=host\&v                                                                        10:22:58
shards disk.indices disk.used disk.avail disk.total disk.percent host           ip             node
    27       38.7gb    38.8gb    153.7gb    192.6gb           20 192.168.100.81 192.168.100.81 search-esnode1
    27       37.7gb    37.8gb    154.8gb    192.6gb           19 192.168.100.82 192.168.100.82 search-esnode2
    22       30.5gb    30.6gb      162gb    192.6gb           15 192.168.100.83 192.168.100.83 search-esnode3
    35         50gb    50.1gb      6.6tb      6.7tb            0 192.168.100.86 192.168.100.86 search-esnode4
    35         50gb    50.2gb      6.6tb      6.7tb            0 192.168.100.87 192.168.100.87 search-esnode5
    34       49.4gb    49.5gb      6.6tb      6.7tb            0 192.168.100.88 192.168.100.88 search-esnode6

When they will be no shards on the old servers, we will be able to stop them and remove them from the proxmox server.

Jun 10 2021, 10:24 AM · System administration, Archive search

vsellier committed rSPSITEef470755943c: vagrant: declare new search-esnode servers (authored by vsellier).

vagrant: declare new search-esnode servers

Jun 10 2021, 9:59 AM

vsellier committed rSPSITE18d3746d0f92: swh-search: change elasticsearch nodes (authored by vsellier).

swh-search: change elasticsearch nodes

Jun 10 2021, 9:59 AM

Jun 9 2021

vsellier added a comment to T3041: [production] Provision enough space for the search ES cluster to ingest all intrinsic metadata.

And all the new nodes are now in the production cluster:

curl -s http://search\-esnode4:9200/_cat/allocation\?s\=host\&v                                                                35m 9s 18:47:23
shards disk.indices disk.used disk.avail disk.total disk.percent host           ip             node
    30       42.7gb    42.9gb    149.7gb    192.6gb           22 192.168.100.81 192.168.100.81 search-esnode1
    30       41.4gb    41.6gb      151gb    192.6gb           21 192.168.100.82 192.168.100.82 search-esnode2
    30       41.7gb    41.8gb    150.8gb    192.6gb           21 192.168.100.83 192.168.100.83 search-esnode3
    30       41.9gb      42gb      6.6tb      6.7tb            0 192.168.100.86 192.168.100.86 search-esnode4
    30       41.8gb    41.9gb      6.6tb      6.7tb            0 192.168.100.87 192.168.100.87 search-esnode5
    30       41.2gb    41.3gb      6.6tb      6.7tb            0 192.168.100.88 192.168.100.88 search-esnode6

The next step will be to switch the swh-search configurations to use the new nodes and progressively remove the old nodes from the cluster.

Jun 9 2021, 6:49 PM · System administration, Archive search

vsellier added a comment to T3041: [production] Provision enough space for the search ES cluster to ingest all intrinsic metadata.

zfs installation:

root@search-esnode4:~# apt update && apt install linux-image-amd64 linux-headers-amd64
root@search-esnode4:~# shutdown -r now  # to apply the kernel
root@search-esnode4:~# apt install libnvpair1linux libuutil1linux libzfs2linux libzpool2linux zfs-dkms zfsutils-linux zfs-zed

refresh with the last packages installed from backports

root@search-esnode4:~# apt dist-upgrade # trigger a udev upgrade which leads to a network interface renaming
root@search-esnode4:~# sed -i 's/ens1/enp2s0/g' /etc/network/interfaces

pre zfs configuration actions:

root@search-esnode4:~# puppet agent --disable
root@search-esnode4:~# systemctl disable elasticsearch
root@search-esnode4:~# systemctl stop elasticsearch
root@search-esnode4:~# rm -rf /srv/elasticsearch/nodes

Jun 9 2021, 6:10 PM · System administration, Archive search

vsellier committed rSPSITEb3a73bccc1ee: swh-search: Declare new bare metal nodes (authored by vsellier).

swh-search: Declare new bare metal nodes

Jun 9 2021, 12:17 PM

vsellier added a comment to T3041: [production] Provision enough space for the search ES cluster to ingest all intrinsic metadata.

To manage the disks via zfs, the raid card needed to be configured in enhanced HBA mode in the idrac
after a rebbot, the disks are well detected by the system:

root@search-esnode4:~# ls -al /dev/sd*
brw-rw---- 1 root disk 8,  0 Jun  9 04:54 /dev/sda
brw-rw---- 1 root disk 8, 16 Jun  9 04:54 /dev/sdb
brw-rw---- 1 root disk 8, 32 Jun  9 04:54 /dev/sdc
brw-rw---- 1 root disk 8, 48 Jun  9 04:54 /dev/sdd
brw-rw---- 1 root disk 8, 64 Jun  9 04:54 /dev/sde
brw-rw---- 1 root disk 8, 80 Jun  9 04:54 /dev/sdf
brw-rw---- 1 root disk 8, 96 Jun  9 04:54 /dev/sdg
brw-rw---- 1 root disk 8, 97 Jun  9 04:54 /dev/sdg1
brw-rw---- 1 root disk 8, 98 Jun  9 04:54 /dev/sdg2
brw-rw---- 1 root disk 8, 99 Jun  9 04:54 /dev/sdg3

root@search-esnode4:~# smartctl -a /dev/sda
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-16-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

Jun 9 2021, 11:58 AM · System administration, Archive search

Jun 8 2021

vsellier committed rDSNIPe2b98bc6aba1: grid5000/cassandra: management scripts (authored by vsellier).

grid5000/cassandra: management scripts

Jun 8 2021, 9:13 PM

vsellier committed rDSNIP75fb371aeda0: grid5000/cassandra: install and configure zfs on cassandra nodes (authored by vsellier).

grid5000/cassandra: install and configure zfs on cassandra nodes

Jun 8 2021, 9:13 PM

vsellier committed rDSNIPfb7444db1d62: grid5000/cassandra: Not functional terrarform/ansible poc (authored by vsellier).

grid5000/cassandra: Not functional terrarform/ansible poc

Jun 8 2021, 9:13 PM

Jun 4 2021

vsellier committed rDSNIP183077d59de6: grid5000/cassandra: configure a cassandra cluster with ansible (authored by vsellier).

grid5000/cassandra: configure a cassandra cluster with ansible

Jun 4 2021, 6:17 PM

Jun 3 2021

vsellier updated subscribers of T3357: Perform some tests of the cassandra storage on Grid5000.

I played with grid5000 to experiment how the jobs work and how to initialize the reserved nodes.

Jun 3 2021, 7:30 PM · System administration, Storage manager

vsellier committed rDSNIPe39b40412e79: grid5000: test of terraform provisionning (authored by vsellier).

grid5000: test of terraform provisionning

Jun 3 2021, 7:25 PM

vsellier committed rDSNIPd0d0e73961ef: add VPN migration phases (authored by vsellier).

add VPN migration phases

Jun 3 2021, 3:54 PM

vsellier accepted D5814: Dedicate a loader_oneshot service for temporary use.

lgtm

Jun 3 2021, 10:20 AM

Jun 2 2021

vsellier changed the status of T3357: Perform some tests of the cassandra storage on Grid5000 from Open to Work in Progress.

Jun 2 2021, 6:25 PM · System administration, Storage manager

vsellier added a comment to T1526: Install a new VPN endpoint at Rocquencourt.

Actually, we have the old openvpn and ipsec running in parallel of the new opnsenses VPNs:

Jun 2 2021, 5:24 PM · System administration

vsellier closed T3355: Running save code now request are never detected as completed by the webapp as Resolved.

The fix was deployed on webapp1 and moma
The refresh script was manually launched:

root@webapp1:~# /usr/local/bin/refresh-savecodenow-statuses
Successfully updated 140 save request(s).

The previous requests were correctly refreshed and are now displaying the right status.

Jun 2 2021, 3:14 PM · Save Code Now, Web app

vsellier added a comment to T3355: Running save code now request are never detected as completed by the webapp .

Will be deployed with version v0.0.310 of the webapp (build in progress)

Jun 2 2021, 2:22 PM · Save Code Now, Web app

vsellier closed D5810: Update running save origin request status.

Jun 2 2021, 2:17 PM

vsellier committed rDWAPPS267e8365f0d6: Update running save origin request status (authored by vsellier).

Update running save origin request status

Jun 2 2021, 2:17 PM

vsellier updated the diff for D5810: Update running save origin request status.

fix typo in commit message

Jun 2 2021, 12:21 PM

vsellier updated the summary of D5810: Update running save origin request status.

Jun 2 2021, 12:20 PM

vsellier requested review of D5810: Update running save origin request status.

Jun 2 2021, 12:18 PM

vsellier added a revision to T3355: Running save code now request are never detected as completed by the webapp : D5810: Update running save origin request status.

Jun 2 2021, 12:07 PM · Save Code Now, Web app

vsellier renamed T3355: Running save code now request are never detected as completed by the webapp from Running save code now request are never finalized to Running save code now request are never detected as completed by the webapp .

Jun 2 2021, 11:58 AM · Save Code Now, Web app

vsellier changed the status of T3355: Running save code now request are never detected as completed by the webapp from Open to Work in Progress.

Jun 2 2021, 11:57 AM · Save Code Now, Web app

Jun 1 2021

vsellier committed rSPSITEc1f48c4fa734: Increase the limit to write pack files on disk (authored by vsellier).

Increase the limit to write pack files on disk

Jun 1 2021, 4:27 PM

May 28 2021

vsellier closed D5800: network: Declare the new opnsense vpn network range.

May 28 2021, 5:12 PM

vsellier committed rSPSITE991600f4f8df: network: Declare the new opnsense vpn network range (authored by vsellier).

network: Declare the new opnsense vpn network range

May 28 2021, 5:12 PM

vsellier requested review of D5800: network: Declare the new opnsense vpn network range.

May 28 2021, 3:23 PM

vsellier added a revision to T1526: Install a new VPN endpoint at Rocquencourt: D5800: network: Declare the new opnsense vpn network range.

May 28 2021, 3:23 PM · System administration

vsellier added a comment to T1526: Install a new VPN endpoint at Rocquencourt.

The OPNsense firewall configuration was finalized based on the initial configuration olasd has previously done on the OPNsense firewalls.

May 28 2021, 12:00 PM · System administration

May 27 2021

vsellier changed the status of T1526: Install a new VPN endpoint at Rocquencourt from Open to Work in Progress.

May 27 2021, 11:00 AM · System administration

vsellier added a comment to T3129: Reliable monitoring of services: for users and for admins .

The save code now queue statistics are now displayed on the status.io page[1] as an example. The data are refreshed each 5 minutes.

May 27 2021, 10:59 AM · Roadmap 2022, Roadmap 2021, Monitoring, meta-task

vsellier added a comment to T3320: Test rancher pros/cons.

May 27 2021, 10:58 AM · System administration

vsellier committed rSPSITE705f4d26a234: status.io: fix api credentials (authored by vsellier).

status.io: fix api credentials

May 27 2021, 10:46 AM

vsellier closed D5787: status.io: push save code now statistics.

May 27 2021, 9:10 AM · System administration

vsellier committed rSPSITE9c01d2124948: status.io: push save code now statistics (authored by vsellier).

status.io: push save code now statistics

May 27 2021, 9:10 AM

May 26 2021

vsellier updated the diff for D5787: status.io: push save code now statistics.

update python script:

remove some prints
add missing types
use dict access instead of get

May 26 2021, 5:24 PM · System administration

vsellier added a project to D5787: status.io: push save code now statistics: System administration.

May 26 2021, 5:09 PM · System administration

vsellier updated subscribers of D5787: status.io: push save code now statistics.

May 26 2021, 5:09 PM · System administration

vsellier updated subscribers of D5787: status.io: push save code now statistics.

May 26 2021, 5:09 PM · System administration

vsellier requested review of D5787: status.io: push save code now statistics.

May 26 2021, 5:07 PM · System administration

vsellier added a revision to T3129: Reliable monitoring of services: for users and for admins : D5787: status.io: push save code now statistics.

May 26 2021, 5:07 PM · Roadmap 2022, Roadmap 2021, Monitoring, meta-task

vsellier committed rSPPRIVCc5692f8dbafc: Add censored status.io api credentials (authored by vsellier).

Add censored status.io api credentials

May 26 2021, 5:00 PM

vsellier created P1053 (An Untitled Masterwork).

May 26 2021, 3:16 PM

vsellier committed rPSTATIO584726209f06: Force stable rebuild (authored by vsellier).

Force stable rebuild

May 26 2021, 2:22 PM

vsellier committed rCJSWHbe7d718f636b: jobs/dependency-packages: change python3-statusio display name (authored by vsellier).

jobs/dependency-packages: change python3-statusio display name

May 26 2021, 1:50 PM

vsellier committed rCJSWHd241c051ac0d: jobs/dependency-packages: Add statusio-python package (authored by vsellier).

jobs/dependency-packages: Add statusio-python package

May 26 2021, 1:48 PM

vsellier committed rDSNIP8a5d814541aa: status.io: configure the script via parameters (authored by vsellier).

status.io: configure the script via parameters

May 26 2021, 10:03 AM

May 25 2021

vsellier added a comment to T3041: [production] Provision enough space for the search ES cluster to ingest all intrinsic metadata.

The servers should be installed on the rack the 26th May. The network configuration will follow the same day or next day.
They will be installed as it by the "DSI" so we will have to install the system via the iDRAC when they will be reachable.

May 25 2021, 3:07 PM · System administration, Archive search

vsellier added a comment to T3320: Test rancher pros/cons.

With a master declared in the dns, everything seems to work well.
when the docker command is launched on a node, it's status is well detected and the node is correctly configured after a couple of minute.
The cluster explorer is also working now.

May 25 2021, 2:59 PM · System administration

vsellier committed rSPSITEed1df1bc2d17: poc-rancher: add internal in the domain name (authored by vsellier).

poc-rancher: add internal in the domain name

May 25 2021, 12:05 PM

vsellier closed D5775: declare a temporary dns entry for the rancher master.

May 25 2021, 12:01 PM

vsellier committed rSPSITE3b49be29b0ce: declare a temporary dns entry for the rancher master (authored by vsellier).

declare a temporary dns entry for the rancher master

May 25 2021, 12:01 PM

vsellier requested review of D5775: declare a temporary dns entry for the rancher master.

May 25 2021, 12:00 PM

vsellier added a revision to T3320: Test rancher pros/cons: D5775: declare a temporary dns entry for the rancher master.

May 25 2021, 12:00 PM · System administration

vsellier added a comment to T3129: Reliable monitoring of services: for users and for admins .

Metrics can easily be pushed to the status page.
The simple poc for the save code now request is available here : https://forge.softwareheritage.org/source/snippets/browse/master/sysadmin/status.io/update_metrics.py

May 25 2021, 9:17 AM · Roadmap 2022, Roadmap 2021, Monitoring, meta-task

vsellier committed rDSNIP8af56d780f4c: status.io remove useless comments (authored by vsellier).

status.io remove useless comments

May 25 2021, 9:15 AM

vsellier committed rDSNIPa212bc6fa643: POC status.io's metrics (authored by vsellier).

POC status.io's metrics

May 25 2021, 9:13 AM

May 20 2021

vsellier added a comment to T3320: Test rancher pros/cons.

The basic installation with helm is simple for a mono server installation: https://rancher.com/docs/rancher/v2.5/en/installation/install-rancher-on-k8s/#install-the-rancher-helm-chart

May 20 2021, 6:37 PM · System administration

vsellier added a comment to T3129: Reliable monitoring of services: for users and for admins .

for the status.swh.org point of view, status.io is providing some api endpoint to push metrics. It should be possible to add some metrics (up to 10 with our plan) to expose the behavior of the platform (daily/weekly and monthly statistics).
As a first step, we could expose the number of pending save code now requests and the number of origin visits to have some live data. An example of a status page with metrics : https://status.docker.com/
I'm working on a code snippet to test the integration feasibility/complexity.

May 20 2021, 6:07 PM · Roadmap 2022, Roadmap 2021, Monitoring, meta-task

vsellier accepted D5759: Vagrantfile: Factorize duplication.

great simplification ! thanks

May 20 2021, 3:20 PM

vsellier changed the status of T3129: Reliable monitoring of services: for users and for admins from Open to Work in Progress.

May 20 2021, 12:01 PM · Roadmap 2022, Roadmap 2021, Monitoring, meta-task

vsellier accepted D5758: README: Update documentation to mention the standard puppet use works.

LGTM

May 20 2021, 11:50 AM

vsellier committed rSENVfa731b948cd5: vagrant: Fix wrong staging/production facts (authored by vsellier).

vagrant: Fix wrong staging/production facts

May 20 2021, 11:43 AM

vsellier accepted D5757: subnets/vagrant: Adapt pergamon manifests.

LGTM \o/

May 20 2021, 11:25 AM

May 19 2021

vsellier committed rSENVbb6f55e5ba6d: vagrant: Ensure facts are configured on the vms (authored by vsellier).

vagrant: Ensure facts are configured on the vms

May 19 2021, 8:13 PM

vsellier added a comment to T3325: Vagrantify puppet master.

After some hard time with vagrant internal and pergamon configuration, we finally have a puppet master working.
The collected resources are well detected and applied, for example here, with the logstash0's incinga resources :

Notice: /Stage[main]/Profile::Icinga2::Master/Icinga2::Object::Host[pergamon.softwareheritage.org]/Icinga2::Object[icinga2::object::Host::pergamon.softwareheritage.org]/Concat[/etc/icinga2/zones.d/master/pergamon.softwareheritage.org.conf]/File[/etc/icinga2/zones.d/master/pergamon.softwareheritage.org.conf]/ensure: defined content as '{md5}e98c7cafc5300df8101f591d1c7a708b'
Info: Concat[/etc/icinga2/zones.d/master/pergamon.softwareheritage.org.conf]: Scheduling refresh of Class[Icinga2::Service]
Notice: /Stage[main]/Profile::Grafana::Vhost/Icinga2::Object::Service[grafana http redirect on pergamon.softwareheritage.org]/Icinga2::Object[icinga2::object::Service::grafana http redirect on pergamon.softwareheritage.org]/Concat[/etc/icinga2/zones.d/master/exported-checks.conf]/File[/etc/icinga2/zones.d/master/exported-checks.conf]/content:

May 19 2021, 8:06 PM · System administration

vsellier committed rSENVa2fc2ee2d6fe: Remove etckeeper lag when a new package is installed (authored by vsellier).

Remove etckeeper lag when a new package is installed

May 19 2021, 6:37 PM

vsellier committed rSENVa0ebcc01a3e8: Use default puppet configuration to work when executed though passenger (authored by vsellier).

Use default puppet configuration to work when executed though passenger

May 19 2021, 4:47 PM

vsellier committed rSENVb98b1303d307: vagrant: Fix production/staging environment mismatch (authored by vsellier).

vagrant: Fix production/staging environment mismatch

May 19 2021, 12:31 PM

vsellier closed T3332: Create a dedicated icinga load profile for proxmox hypervisors as Resolved.

May 19 2021, 10:01 AM · System administration