The old servers are stopped and the VMs removed from proxmox
nodes removed from puppet inventory
Inventory site is updated
Journal client for indexed metadata deployed, the backfill is done (in a couple of hours): https://grafana.softwareheritage.org/goto/ndjfw66Gz
The metadata search via ES is activated on https://webapp1.internal.softwareheritage.org/

Jun 11 2021, 10:23 AM · System administration, Archive search

Jun 10 2021

vsellier committed rDSNIP1654dcf786fa: grid5000/cassandra: improve scripts and fix few errors (authored by vsellier).

grid5000/cassandra: improve scripts and fix few errors

Jun 10 2021, 7:21 PM

vsellier added a comment to T3357: Perform some tests of the cassandra storage on Grid5000.

Some status about the automation:

Cassandra nodes are ok (os installation, zfs configuration according to the defined environment except a problem during the first initialization with new disks, startup, cluster configuration)
swh-storage node is ok (os installation, gunicorn/swh-storage installation and startup)
cassandra database initialization :

root@parasilo-3:~#  nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load        Tokens  Owns (effective)  Host ID                               Rack 
UN  172.16.97.3  78.85 KiB   256     31.6%             49d46dd8-4640-45eb-9d4c-b6b16fc954ab  rack1
UN  172.16.97.5  105.45 KiB  256     26.0%             47e99bb4-4846-4e03-a06c-53ea2862172d  rack1
UN  172.16.97.4  98.35 KiB   256     18.1%             e2aeff29-c89a-4c7a-9352-77aaf78e91b3  rack1
UN  172.16.97.2  78.85 KiB   256     24.3%             edd1b72b-4c35-44bd-b7e5-316f41a156c4  rack1

root@parasilo-3:~# cqlsh 172.16.97.3
Connected to swh-storage at 172.16.97.3:9042
[cqlsh 6.0.0 | Cassandra 4.0 | CQL spec 3.4.5 | Native protocol v5]
cqlsh> desc KEYSPACES

Jun 10 2021, 7:02 PM · System administration, Storage manager

vsellier accepted D5851: docker: Explicit the scheduler runner for save-code-now.

LGTM

Jun 10 2021, 4:00 PM · Scheduling utilities, Save Code Now

vsellier accepted D5826: runner: Separate scheduling tasks with and without priority concerns.

LGTM, still not a big fan of the usage of random in the tests ;), but otherwise, it matches what you explain to me this morning

Jun 10 2021, 3:56 PM · Scheduling utilities, Save Code Now

vsellier committed rDSNIP306b44b4f019: Add swh-storage installation on ansible scripts (authored by vsellier).

Add swh-storage installation on ansible scripts

Jun 10 2021, 3:42 PM

vsellier committed rDSNIPda71dd2666e8: swh-storage (authored by vsellier).

swh-storage

Jun 10 2021, 3:42 PM

vsellier added a comment to T3041: [production] Provision enough space for the search ES cluster to ingest all intrinsic metadata.

old nodes removed from proxmox. It has freed up some space on ceph:

Jun 10 2021, 11:27 AM · System administration, Archive search

vsellier committed rSPRE309b302651eb: Removing search-esnode[1-3] nodes replaced by bare metal servers (authored by vsellier).

Removing search-esnode[1-3] nodes replaced by bare metal servers

Jun 10 2021, 11:11 AM

vsellier added a comment to T3041: [production] Provision enough space for the search ES cluster to ingest all intrinsic metadata.

configuration of the swh-search and journal clients services deployed
Old node decommissionning on the cluster:

export ES_NODE=192.168.100.86:9200
curl -H "Content-Type: application/json" -XPUT http://${ES_NODE}/_cluster/settings\?pretty -d '{ 
    "transient" : {
        "cluster.routing.allocation.exclude._ip" : "192.168.100.81,192.168.100.82,192.168.100.83"
    }
}'
{
  "acknowledged" : true,
  "persistent" : { },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "exclude" : {
            "_ip" : "192.168.100.81,192.168.100.82,192.168.100.83"
          }
        }
      }
    }
  }
}

The shards start to be gently moved from the old servers:

curl -s http://search-esnode4:9200/_cat/allocation\?s\=host\&v                                                                        10:22:58
shards disk.indices disk.used disk.avail disk.total disk.percent host           ip             node
    27       38.7gb    38.8gb    153.7gb    192.6gb           20 192.168.100.81 192.168.100.81 search-esnode1
    27       37.7gb    37.8gb    154.8gb    192.6gb           19 192.168.100.82 192.168.100.82 search-esnode2
    22       30.5gb    30.6gb      162gb    192.6gb           15 192.168.100.83 192.168.100.83 search-esnode3
    35         50gb    50.1gb      6.6tb      6.7tb            0 192.168.100.86 192.168.100.86 search-esnode4
    35         50gb    50.2gb      6.6tb      6.7tb            0 192.168.100.87 192.168.100.87 search-esnode5
    34       49.4gb    49.5gb      6.6tb      6.7tb            0 192.168.100.88 192.168.100.88 search-esnode6

When they will be no shards on the old servers, we will be able to stop them and remove them from the proxmox server.

Jun 10 2021, 10:24 AM · System administration, Archive search

vsellier committed rSPSITEef470755943c: vagrant: declare new search-esnode servers (authored by vsellier).

vagrant: declare new search-esnode servers

Jun 10 2021, 9:59 AM

vsellier committed rSPSITE18d3746d0f92: swh-search: change elasticsearch nodes (authored by vsellier).

swh-search: change elasticsearch nodes

Jun 10 2021, 9:59 AM

Jun 9 2021

vsellier added a comment to T3041: [production] Provision enough space for the search ES cluster to ingest all intrinsic metadata.

And all the new nodes are now in the production cluster:

curl -s http://search\-esnode4:9200/_cat/allocation\?s\=host\&v                                                                35m 9s 18:47:23
shards disk.indices disk.used disk.avail disk.total disk.percent host           ip             node
    30       42.7gb    42.9gb    149.7gb    192.6gb           22 192.168.100.81 192.168.100.81 search-esnode1
    30       41.4gb    41.6gb      151gb    192.6gb           21 192.168.100.82 192.168.100.82 search-esnode2
    30       41.7gb    41.8gb    150.8gb    192.6gb           21 192.168.100.83 192.168.100.83 search-esnode3
    30       41.9gb      42gb      6.6tb      6.7tb            0 192.168.100.86 192.168.100.86 search-esnode4
    30       41.8gb    41.9gb      6.6tb      6.7tb            0 192.168.100.87 192.168.100.87 search-esnode5
    30       41.2gb    41.3gb      6.6tb      6.7tb            0 192.168.100.88 192.168.100.88 search-esnode6

The next step will be to switch the swh-search configurations to use the new nodes and progressively remove the old nodes from the cluster.

Jun 9 2021, 6:49 PM · System administration, Archive search

vsellier added a comment to T3041: [production] Provision enough space for the search ES cluster to ingest all intrinsic metadata.

zfs installation:

root@search-esnode4:~# apt update && apt install linux-image-amd64 linux-headers-amd64
root@search-esnode4:~# shutdown -r now  # to apply the kernel
root@search-esnode4:~# apt install libnvpair1linux libuutil1linux libzfs2linux libzpool2linux zfs-dkms zfsutils-linux zfs-zed

refresh with the last packages installed from backports

root@search-esnode4:~# apt dist-upgrade # trigger a udev upgrade which leads to a network interface renaming
root@search-esnode4:~# sed -i 's/ens1/enp2s0/g' /etc/network/interfaces

pre zfs configuration actions:

root@search-esnode4:~# puppet agent --disable
root@search-esnode4:~# systemctl disable elasticsearch
root@search-esnode4:~# systemctl stop elasticsearch
root@search-esnode4:~# rm -rf /srv/elasticsearch/nodes

Jun 9 2021, 6:10 PM · System administration, Archive search

vsellier committed rSPSITEb3a73bccc1ee: swh-search: Declare new bare metal nodes (authored by vsellier).

swh-search: Declare new bare metal nodes

Jun 9 2021, 12:17 PM

vsellier added a comment to T3041: [production] Provision enough space for the search ES cluster to ingest all intrinsic metadata.

To manage the disks via zfs, the raid card needed to be configured in enhanced HBA mode in the idrac
after a rebbot, the disks are well detected by the system:

root@search-esnode4:~# ls -al /dev/sd*
brw-rw---- 1 root disk 8,  0 Jun  9 04:54 /dev/sda
brw-rw---- 1 root disk 8, 16 Jun  9 04:54 /dev/sdb
brw-rw---- 1 root disk 8, 32 Jun  9 04:54 /dev/sdc
brw-rw---- 1 root disk 8, 48 Jun  9 04:54 /dev/sdd
brw-rw---- 1 root disk 8, 64 Jun  9 04:54 /dev/sde
brw-rw---- 1 root disk 8, 80 Jun  9 04:54 /dev/sdf
brw-rw---- 1 root disk 8, 96 Jun  9 04:54 /dev/sdg
brw-rw---- 1 root disk 8, 97 Jun  9 04:54 /dev/sdg1
brw-rw---- 1 root disk 8, 98 Jun  9 04:54 /dev/sdg2
brw-rw---- 1 root disk 8, 99 Jun  9 04:54 /dev/sdg3

root@search-esnode4:~# smartctl -a /dev/sda
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-16-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

Jun 9 2021, 11:58 AM · System administration, Archive search

Jun 8 2021

vsellier committed rDSNIPe2b98bc6aba1: grid5000/cassandra: management scripts (authored by vsellier).

grid5000/cassandra: management scripts

Jun 8 2021, 9:13 PM

vsellier committed rDSNIP75fb371aeda0: grid5000/cassandra: install and configure zfs on cassandra nodes (authored by vsellier).

grid5000/cassandra: install and configure zfs on cassandra nodes

Jun 8 2021, 9:13 PM

vsellier committed rDSNIPfb7444db1d62: grid5000/cassandra: Not functional terrarform/ansible poc (authored by vsellier).

grid5000/cassandra: Not functional terrarform/ansible poc

Jun 8 2021, 9:13 PM

Jun 4 2021

vsellier committed rDSNIP183077d59de6: grid5000/cassandra: configure a cassandra cluster with ansible (authored by vsellier).

grid5000/cassandra: configure a cassandra cluster with ansible

Jun 4 2021, 6:17 PM

Jun 3 2021

vsellier updated subscribers of T3357: Perform some tests of the cassandra storage on Grid5000.

I played with grid5000 to experiment how the jobs work and how to initialize the reserved nodes.

Jun 3 2021, 7:30 PM · System administration, Storage manager

vsellier committed rDSNIPe39b40412e79: grid5000: test of terraform provisionning (authored by vsellier).

grid5000: test of terraform provisionning

Jun 3 2021, 7:25 PM

vsellier committed rDSNIPd0d0e73961ef: add VPN migration phases (authored by vsellier).

add VPN migration phases

Jun 3 2021, 3:54 PM

vsellier accepted D5814: Dedicate a loader_oneshot service for temporary use.

lgtm

Jun 3 2021, 10:20 AM

Jun 2 2021

vsellier changed the status of T3357: Perform some tests of the cassandra storage on Grid5000 from Open to Work in Progress.

Jun 2 2021, 6:25 PM · System administration, Storage manager

vsellier added a comment to T1526: Install a new VPN endpoint at Rocquencourt.

Actually, we have the old openvpn and ipsec running in parallel of the new opnsenses VPNs:

Jun 2 2021, 5:24 PM · System administration

vsellier closed T3355: Running save code now request are never detected as completed by the webapp as Resolved.

The fix was deployed on webapp1 and moma
The refresh script was manually launched:

root@webapp1:~# /usr/local/bin/refresh-savecodenow-statuses
Successfully updated 140 save request(s).

The previous requests were correctly refreshed and are now displaying the right status.

Jun 2 2021, 3:14 PM · Save Code Now, Web app

vsellier added a comment to T3355: Running save code now request are never detected as completed by the webapp .

Will be deployed with version v0.0.310 of the webapp (build in progress)

Jun 2 2021, 2:22 PM · Save Code Now, Web app

vsellier closed D5810: Update running save origin request status.

Jun 2 2021, 2:17 PM

vsellier committed rDWAPPS267e8365f0d6: Update running save origin request status (authored by vsellier).

Update running save origin request status

Jun 2 2021, 2:17 PM

vsellier updated the diff for D5810: Update running save origin request status.

fix typo in commit message

Jun 2 2021, 12:21 PM

vsellier updated the summary of D5810: Update running save origin request status.

Jun 2 2021, 12:20 PM

vsellier requested review of D5810: Update running save origin request status.

Jun 2 2021, 12:18 PM

vsellier added a revision to T3355: Running save code now request are never detected as completed by the webapp : D5810: Update running save origin request status.

Jun 2 2021, 12:07 PM · Save Code Now, Web app

vsellier renamed T3355: Running save code now request are never detected as completed by the webapp from Running save code now request are never finalized to Running save code now request are never detected as completed by the webapp .

Jun 2 2021, 11:58 AM · Save Code Now, Web app

vsellier changed the status of T3355: Running save code now request are never detected as completed by the webapp from Open to Work in Progress.

Jun 2 2021, 11:57 AM · Save Code Now, Web app

Jun 1 2021

vsellier committed rSPSITEc1f48c4fa734: Increase the limit to write pack files on disk (authored by vsellier).

Increase the limit to write pack files on disk

Jun 1 2021, 4:27 PM

May 28 2021

vsellier closed D5800: network: Declare the new opnsense vpn network range.

May 28 2021, 5:12 PM

vsellier committed rSPSITE991600f4f8df: network: Declare the new opnsense vpn network range (authored by vsellier).

network: Declare the new opnsense vpn network range

May 28 2021, 5:12 PM

vsellier requested review of D5800: network: Declare the new opnsense vpn network range.

May 28 2021, 3:23 PM

vsellier added a revision to T1526: Install a new VPN endpoint at Rocquencourt: D5800: network: Declare the new opnsense vpn network range.

May 28 2021, 3:23 PM · System administration

vsellier added a comment to T1526: Install a new VPN endpoint at Rocquencourt.

The OPNsense firewall configuration was finalized based on the initial configuration olasd has previously done on the OPNsense firewalls.

May 28 2021, 12:00 PM · System administration

May 27 2021

vsellier changed the status of T1526: Install a new VPN endpoint at Rocquencourt from Open to Work in Progress.

May 27 2021, 11:00 AM · System administration

vsellier added a comment to T3129: Reliable monitoring of services: for users and for admins .

The save code now queue statistics are now displayed on the status.io page[1] as an example. The data are refreshed each 5 minutes.

May 27 2021, 10:59 AM · Roadmap 2022, Roadmap 2021, Monitoring, meta-task

vsellier added a comment to T3320: Test rancher pros/cons.

May 27 2021, 10:58 AM · System administration

vsellier committed rSPSITE705f4d26a234: status.io: fix api credentials (authored by vsellier).

status.io: fix api credentials

May 27 2021, 10:46 AM

vsellier closed D5787: status.io: push save code now statistics.

May 27 2021, 9:10 AM · System administration

vsellier committed rSPSITE9c01d2124948: status.io: push save code now statistics (authored by vsellier).

status.io: push save code now statistics

May 27 2021, 9:10 AM

May 26 2021

vsellier updated the diff for D5787: status.io: push save code now statistics.

update python script:

remove some prints
add missing types
use dict access instead of get

May 26 2021, 5:24 PM · System administration

vsellier added a project to D5787: status.io: push save code now statistics: System administration.

May 26 2021, 5:09 PM · System administration

vsellier updated subscribers of D5787: status.io: push save code now statistics.

May 26 2021, 5:09 PM · System administration

vsellier updated subscribers of D5787: status.io: push save code now statistics.

May 26 2021, 5:09 PM · System administration

vsellier requested review of D5787: status.io: push save code now statistics.

May 26 2021, 5:07 PM · System administration

vsellier added a revision to T3129: Reliable monitoring of services: for users and for admins : D5787: status.io: push save code now statistics.

May 26 2021, 5:07 PM · Roadmap 2022, Roadmap 2021, Monitoring, meta-task

vsellier committed rSPPRIVCc5692f8dbafc: Add censored status.io api credentials (authored by vsellier).

Add censored status.io api credentials

May 26 2021, 5:00 PM

vsellier created P1053 (An Untitled Masterwork).

May 26 2021, 3:16 PM

vsellier committed rPSTATIO584726209f06: Force stable rebuild (authored by vsellier).

Force stable rebuild

May 26 2021, 2:22 PM

vsellier committed rCJSWHbe7d718f636b: jobs/dependency-packages: change python3-statusio display name (authored by vsellier).

jobs/dependency-packages: change python3-statusio display name

May 26 2021, 1:50 PM

vsellier committed rCJSWHd241c051ac0d: jobs/dependency-packages: Add statusio-python package (authored by vsellier).

jobs/dependency-packages: Add statusio-python package

May 26 2021, 1:48 PM

vsellier committed rDSNIP8a5d814541aa: status.io: configure the script via parameters (authored by vsellier).

status.io: configure the script via parameters

May 26 2021, 10:03 AM

May 25 2021

vsellier added a comment to T3041: [production] Provision enough space for the search ES cluster to ingest all intrinsic metadata.

The servers should be installed on the rack the 26th May. The network configuration will follow the same day or next day.
They will be installed as it by the "DSI" so we will have to install the system via the iDRAC when they will be reachable.

May 25 2021, 3:07 PM · System administration, Archive search

vsellier added a comment to T3320: Test rancher pros/cons.

With a master declared in the dns, everything seems to work well.
when the docker command is launched on a node, it's status is well detected and the node is correctly configured after a couple of minute.
The cluster explorer is also working now.

May 25 2021, 2:59 PM · System administration

vsellier committed rSPSITEed1df1bc2d17: poc-rancher: add internal in the domain name (authored by vsellier).

poc-rancher: add internal in the domain name

May 25 2021, 12:05 PM

vsellier closed D5775: declare a temporary dns entry for the rancher master.

May 25 2021, 12:01 PM

vsellier committed rSPSITE3b49be29b0ce: declare a temporary dns entry for the rancher master (authored by vsellier).

declare a temporary dns entry for the rancher master

May 25 2021, 12:01 PM

vsellier requested review of D5775: declare a temporary dns entry for the rancher master.

May 25 2021, 12:00 PM

vsellier added a revision to T3320: Test rancher pros/cons: D5775: declare a temporary dns entry for the rancher master.