Page MenuHomeSoftware Heritage

Installation of the new provenance server
Started, Work in Progress, NormalPublic

Description

  • [in progress] Provision the server in the inventory (rack position, idrac ips, ips, ...)
  • [Done] notify the dsi of the delivery + give the information for the installation
  • define the role in puppet
  • Install the server

List of packages amended along the way:

  • python3-virtualenvwrapper
  • libpq-dev
  • arcanist

Some more configuration:

  • install scripts ./create-db.sh and ./drop-db.sh to ease db maintenance for @aeviso to drop/create dbs
ardumont@met:~% cat drop-db.sh
#!/usr/bin/env bash

DBNAME=$1

sudo -i -u postgres dropdb $DBNAME
ardumont@met:~% cat create-db.sh
#!/usr/bin/env bash

DBNAME=$1

sudo -i -u postgres createdb -p 5433 --lc-ctype=C.UTF-8 -T template0 -O swh-provenance $DBNAME
  • Provision 10 dbs through puppet (so pgbouncer is configured as well) (D6378)
  • at some point, dedicate some vms to andres so he can experiment from those, passing through the internal network (without the vpn)

Event Timeline

vsellier changed the task status from Open to Work in Progress.Aug 18 2021, 9:45 AM
vsellier triaged this task as Normal priority.
vsellier created this task.
vsellier updated the task description. (Show Details)

@jayeshv @aeviso @douardda @olasd have you an idea of what should be installed on the server and who will operate what will be on it?

It's not completely clear for me if this server will be a sandbox/staging or a production server.

@vsellier I am not sure about this.
The idea is to use this machine as the production server. (I guess this will host either postgres or mongodb after we decide on a preferred backend. But that is going to take some time)
@olasd or @aeviso will know better.

yes the idea is to have a beefy enough machine to perform full-size experiments on, that can then be (part of) the production infrastructure dedicated to the provenance index.

As see with @aeviso , we will install the following components on the server (the os will be debian11)

  • rabbitmq
  • postgresql:13
    • a default swh-storage database will be managed by puppet
    • 1000 parallel connections allowed
    • shared_buffers 50go
  • docker

\*- WIP -*
Additional standard packages:

  • zfs, the datasets will be configured by sysadms
  • default statsd/prometheus exporter plugged on the main prometheus
  • postgresql:13
    • 1000 parallel connections allowed

wouldn't it better to use pgbouncer (or similar)?

yes pgbouncer will be used and it's configured by default to 2000 // connections
I don't know the kind of load the provenance client will generate but the default 100 connections allowed by postgres will be probably too short and needed to be increased too

The server is installed. It remains few task to perform manually:

  • configure the zfs datasets (will configure 2 mirror pool for ~12To available, tell me if it's not what it's expected)
  • build few missing packages for bullseye (relative to the monitoring: prometheus-rabbitmq-exporter, prometheus-statsd-exporter, journalbeat)
  • configure a rabbitmq admin user

@aeviso you should be able to connect on the server.

The hostname is met.internal.softwareheritage.org

I forgot to mention there is a gift from dell on the server: an additional 600Go 10rpm disk

The zfs pool and dataset are configured:

  • pool configuration
## nvme drives pool
#zpool create data mirror nvme-eui.36315030525005540025384500000003 nvme-eui.36315030525005800025384500000003 mirror nvme-eui.36315030525005620025384500000003 nvme-eui.36315030525005890025384500000003

## bonus pool
# zpool create data-hdd wwn-0x5000c500dea6c533
  • postgresql dataset

move the current postgresql content away and copy it on the new directory after

# zfs create -o mountpoint=/srv/softwareheritage/postgres/13/main -o atime=off -o relatime=on data/postgresql
  • status
root@met:~# zpool list
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
data      11.6T  52.0M  11.6T        -         -     0%     0%  1.00x    ONLINE  -
data-hdd   556G   114K   556G        -         -     0%     0%  1.00x    ONLINE  -
root@met:~# zpool list -v
NAME                                            SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
data                                           11.6T  52.4M  11.6T        -         -     0%     0%  1.00x    ONLINE  -
  mirror                                       5.81T  26.5M  5.81T        -         -     0%  0.00%      -  ONLINE  
    nvme-eui.36315030525005540025384500000003      -      -      -        -         -      -      -      -  ONLINE  
    nvme-eui.36315030525005800025384500000003      -      -      -        -         -      -      -      -  ONLINE  
  mirror                                       5.81T  25.9M  5.81T        -         -     0%  0.00%      -  ONLINE  
    nvme-eui.36315030525005620025384500000003      -      -      -        -         -      -      -      -  ONLINE  
    nvme-eui.36315030525005890025384500000003      -      -      -        -         -      -      -      -  ONLINE  
data-hdd                                        556G   114K   556G        -         -     0%     0%  1.00x    ONLINE  -
  wwn-0x5000c500dea6c533                        556G   114K   556G        -         -     0%  0.00%      -  ONLINE
root@met:~# zfs list
NAME              USED  AVAIL     REFER  MOUNTPOINT
data             51.8M  11.3T       24K  /data
data-hdd          114K   539G       24K  /data-hdd
data/postgresql  51.6M  11.3T     51.6M  /srv/softwareheritage/postgres/13/main

rSPSITE6a233452cd48 fixed the prometheus node exporter.

I've cheated to pull journalbeat and prometheus-statsd-exporter in the bullseye repo: they're both go packages that were statically linked together, so I just had reprepro copy the buster binaries:

swhdebianrepo@pergamon:~$ reprepro -vb /srv/softwareheritage/repository copy bullseye-swh buster-swh journalbeat prometheus-statsd-exporter
Adding 'prometheus-statsd-exporter' '0.8.1-1~swh1~bpo10+1' to 'bullseye-swh|main|amd64'.
Adding 'journalbeat' '5.5.0+git20170727.1-1~swh+1~bpo10+1' to 'bullseye-swh|main|amd64'.
Adding 'prometheus-statsd-exporter' '0.8.1-1~swh1~bpo10+1' to 'bullseye-swh|main|source'.
Adding 'journalbeat' '5.5.0+git20170727.1-1~swh+1~bpo10+1' to 'bullseye-swh|main|source'.
Exporting indices...

Once we upgrade journalbeat to the upstream version, this can go away. Same for the statsd exporter. But it's GoodEnough™ for now.

*old comment not submitted*

  • install scripts ./create-db.sh and ./drop-db.sh to ease db maintenance for andres
ardumont@met:~% cat drop-db.sh
#!/usr/bin/env bash

DBNAME=$1

sudo -i -u postgres dropdb $DBNAME
ardumont@met:~% cat create-db.sh
#!/usr/bin/env bash

DBNAME=$1

sudo -i -u postgres createdb -p 5433 --lc-ctype=C.UTF-8 -T template0 -O swh-provenance $DBNAME
  • Provision 10 dbs through puppet
  • at some point, dedicate some vms to andres so he can experiment from those, passing through the internal network (without the vpn)
17:19:13     +olasd ╡ the postgresql tuning hasn't happened yet, afaict? effective_cache_size isn't set, and shared_buffers is tiny
17:19:46          ⤷ ╡ I'd bump shared_buffers to 128 GB and effective_cache_size to 256 GB, see where that gets you
17:20:19          ⤷ ╡ and probably maintenance_work_mem to something like 16 or 32 GB
17:20:54          ⤷ ╡ as well as random_page_cost to something lower like 1.5

The log is flooded with

2021-10-14 15:24:54.422 UTC [3951720] LOG:  checkpoints are occurring too frequently (28 seconds apart)
2021-10-14 15:24:54.422 UTC [3951720] HINT:  Consider increasing the configuration parameter "max_wal_size".

max_wal_size should be bumped to something more sensible like 32GB (needs a pg restart)

I've run alter system commands to bump these configuration variables in $DATADIR/postgresql.auto.conf, then ran a pg_reload_config():

2021-10-14 15:31:53.579 UTC [3951717] LOG:  received SIGHUP, reloading configuration files
2021-10-14 15:31:53.580 UTC [3951717] LOG:  parameter "max_wal_size" changed to "32GB"
2021-10-14 15:31:53.580 UTC [3951717] LOG:  parameter "effective_cache_size" changed to "256GB"
2021-10-14 15:31:53.580 UTC [3951717] LOG:  parameter "maintenance_work_mem" changed to "32GB"
2021-10-14 15:31:53.580 UTC [3951717] LOG:  parameter "shared_buffers" cannot be changed without restarting the server
2021-10-14 15:31:53.580 UTC [3951717] LOG:  parameter "random_page_cost" changed to "1.5"
2021-10-14 15:31:53.580 UTC [3951717] LOG:  configuration file "/srv/softwareheritage/postgres/13/main/postgresql.auto.conf" contains errors; unaffected changes were applied