Page MenuHomeSoftware Heritage

Add proxmox / ceph monitoring
Closed, MigratedEdits Locked

Description

The ceph and proxmox clusters is not really monitored actually. (Ceph storage is used by
the proxmox cluster in the current configuration to store the vm disks.)

We should add some probes to raise alert on icinga in case of wrong behaviors for both
stack:

  • ceph [1]
  • proxmox [2]

And if possible track some metrics on prometheus/grafana.

[1] https://docs.ceph.com/en/nautilus/mgr/prometheus/

[2] https://blog.zwindler.fr/2020/01/06/proxmox-ve-prometheus/

Event Timeline

vsellier triaged this task as High priority.Aug 4 2021, 6:42 PM
vsellier created this task.
vsellier changed the task status from Open to Work in Progress.Aug 10 2021, 5:16 PM
vsellier moved this task from Backlog to in-progress on the System administration board.
vsellier added a subscriber: ardumont.

Following documentations, activating the ceph manager:

root@branly:~# ceph mgr module enable prometheus
root@branly:~# ss -tan | grep 9283
LISTEN    0      5                             *:9283                         *:*

It's activating the prometheus exporter on all hypervisors member of the ceph cluster:

root@pergamon:~# clush -w @hypervisors "ss -tanp | grep 9283"
clush: pompidou: exited with exit code 1
clush: uffizi: exited with exit code 1
branly: LISTEN    0      5                             *:9283                         *:*                                                                                users:(("ceph-mgr",pid=3564074,fd=30))
beaubourg: LISTEN      0        5                       *:9283                    *:*       users:(("ceph-mgr",pid=2291772,fd=25))
hypervisor3: LISTEN    0      5                            *:9283                          *:*                                                                                users:(("ceph-mgr",pid=1955847,fd=25))

But only one exporter is actually responding with data [1]. It seems to be the active
ceph manager which is currently branly [2].

So that means, we need to make the scraping happens on all managers since there is no
guarantee it will stay consistent across time.

[1]

root@pergamon:~# curl http://beaubourg.internal.softwareheritage.org:9283/metrics
root@pergamon:~# curl http://hypervisor3.internal.softwareheritage.org:9283/metrics
root@pergamon:~# curl -s http://branly.internal.softwareheritage.org:9283/metrics | head -10

# HELP ceph_mds_mem_dir_minus Directories closed
# TYPE ceph_mds_mem_dir_minus counter
ceph_mds_mem_dir_minus{ceph_daemon="mds.beaubourg"} 0.0
# HELP ceph_paxos_commit_latency_count Commit latency Count
# TYPE ceph_paxos_commit_latency_count counter
ceph_paxos_commit_latency_count{ceph_daemon="mon.branly"} 0.0
ceph_paxos_commit_latency_count{ceph_daemon="mon.hypervisor3"} 1099.0
ceph_paxos_commit_latency_count{ceph_daemon="mon.beaubourg"} 187080.0
# HELP ceph_mds_mem_dir_plus Directories opened

[2]

root@branly:~# ceph --status | grep mgr
    mgr: branly(active, since 14m), standbys: beaubourg, hypervisor3

So that means, we need to make the scraping happens on all managers since there is no
guarantee it will stay consistent across time.

D6075 for your reviewing pleasure ;)

The dashboard got improved to:

  • add the usual maintenance and deployment tag
  • filter out only the relevant hypervisors on some panels

The following got tested on uffizi:

Create a dedicated proxmox user:

pveum groupadd monitoring -comment 'Monitoring group'
pveum aclmod / -group monitoring -role PVEAuditor
pveum useradd pve_exporter@pve
pveum usermod pve_exporter@pve -group monitoring
pveum passwd pve_exporter@pve

Checking out the prometheus-pve-exporter through a virtualenv to ensure it's working as
expected.

python3 -m venv venv
. ./venv/bin/activate
pip install wheel
pip install prometheus-pve-exporter

Then install the necessary configuration:

mkdir -p /usr/local/share/pve_exporter/
cat > /usr/local/share/pve_exporter/pve_exporter.yml << EOF
default:
user: pve_exporter@pve
password: <redacted>
verify_ssl: false
EOF

All this in order to check either the local node or pergamon rendered result when
requiring the exporter some data:

root@uffizi:~# curl http://uffizi.internal.softwareheritage.org:9221/pve?target=127.0.0.1
# HELP pve_up Node/VM/CT-Status is online/running
# TYPE pve_up gauge
pve_up{id="cluster/swh-inria-rocq"} 1.0
pve_up{id="node/hypervisor3"} 1.0
pve_up{id="node/uffizi"} 1.0
pve_up{id="node/branly"} 1.0
pve_up{id="node/pompidou"} 1.0
pve_up{id="node/beaubourg"} 1.0
...

So we went on debian packaging prometheus-pve-exporter as it's not yet [1]
It's now built by our jenkins and pushed in our debian repository.

Next step is to actually deploy it with puppet.

[1] https://forge.softwareheritage.org/source/python3-prometheus-pve-exporter/manage/

Grafana dashboard for proxmox installed [1]
But this needs some troubleshooting for now.

[1] https://grafana.softwareheritage.org/goto/lEhp26Gnz?orgId=1

But this needs some troubleshooting for now.

D6082 fixes the deployment and the communication.
Some improvments got also done in the dashboard.
And now we have some data! [1]

[1] https://grafana.softwareheritage.org/goto/iK3JiRn7z?orgId=1

We should add some probes to raise alert on icinga in case of wrong behaviors for both
stack

We don't really know which alerts to add right now, so it's kept out of scope for this
task.

We do have 2 grafana dashboards for ceph and proxmox now:

ardumont claimed this task.
ardumont moved this task from in-progress to done on the System administration board.