Add proxmox / ceph monitoring
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vsellier
	Aug 4 2021, 6:42 PM

Description

The ceph and proxmox clusters is not really monitored actually. (Ceph storage is used by
the proxmox cluster in the current configuration to store the vm disks.)

We should add some probes to raise alert on icinga in case of wrong behaviors for both
stack:

ceph [1]
proxmox [2]

And if possible track some metrics on prometheus/grafana.

[1] https://docs.ceph.com/en/nautilus/mgr/prometheus/

[2] https://blog.zwindler.fr/2020/01/06/proxmox-ve-prometheus/

Revisions and Commits

rCJSWH Jenkins jobs
	D6076	rCJSWH6e302e5d5cd3 Declare debian package build for proxmox-pve-exporter
rSPSITE puppet-swh-site
	D6082	rSPSITEb972034f8ceb prometheus/pve_exporter: Split the metrics_path and the parameters
	D6082	rSPSITE9fe99c19b47a prometheus: Support http parameters in exporter configuration
	D6078	rSPSITE294b3c325485 pve-exporter: Install properly configuration and service
	D6077	rSPSITE9953fe3e5c3e pve-exporter: Scrape the prometheus exporter metrics
	D6077	rSPSITE5c8ac3d27cea pve-exporter: Install prometheus-pve-exporter on hypervisor nodes
	D6075	rSPSITE8a46ad80ad8e subnets/vagrant: Add hypervisors' ip mappings
	D6075	rSPSITEceac3db6d1cd Activate data scraping on hypervisors

Related Objects
Search...

		Status	Assigned	Task
		Migrated	gitlab-migration	T3444 26/07/2021: Unstuck infrastructure outage then post-mortem
		Migrated	gitlab-migration	T3462 Add proxmox / ceph monitoring

Event Timeline

vsellier triaged this task as High priority.Aug 4 2021, 6:42 PM

vsellier created this task.

vsellier changed the task status from Open to Work in Progress.Aug 10 2021, 5:16 PM

vsellier moved this task from Backlog to in-progress on the System administration board.

vsellier added a subscriber: ardumont.

A prometheus exporter for proxmox is available at https://github.com/prometheus-pve/prometheus-pve-exporter
An interesting reading: https://blog.zwindler.fr/2020/01/06/proxmox-ve-prometheus/

Following documentations, activating the ceph manager:

root@branly:~# ceph mgr module enable prometheus
root@branly:~# ss -tan | grep 9283
LISTEN    0      5                             *:9283                         *:*

It's activating the prometheus exporter on all hypervisors member of the ceph cluster:

root@pergamon:~# clush -w @hypervisors "ss -tanp | grep 9283"
clush: pompidou: exited with exit code 1
clush: uffizi: exited with exit code 1
branly: LISTEN    0      5                             *:9283                         *:*                                                                                users:(("ceph-mgr",pid=3564074,fd=30))
beaubourg: LISTEN      0        5                       *:9283                    *:*       users:(("ceph-mgr",pid=2291772,fd=25))
hypervisor3: LISTEN    0      5                            *:9283                          *:*                                                                                users:(("ceph-mgr",pid=1955847,fd=25))

But only one exporter is actually responding with data [1]. It seems to be the active
ceph manager which is currently branly [2].

So that means, we need to make the scraping happens on all managers since there is no
guarantee it will stay consistent across time.

[1]

root@pergamon:~# curl http://beaubourg.internal.softwareheritage.org:9283/metrics
root@pergamon:~# curl http://hypervisor3.internal.softwareheritage.org:9283/metrics
root@pergamon:~# curl -s http://branly.internal.softwareheritage.org:9283/metrics | head -10

# HELP ceph_mds_mem_dir_minus Directories closed
# TYPE ceph_mds_mem_dir_minus counter
ceph_mds_mem_dir_minus{ceph_daemon="mds.beaubourg"} 0.0
# HELP ceph_paxos_commit_latency_count Commit latency Count
# TYPE ceph_paxos_commit_latency_count counter
ceph_paxos_commit_latency_count{ceph_daemon="mon.branly"} 0.0
ceph_paxos_commit_latency_count{ceph_daemon="mon.hypervisor3"} 1099.0
ceph_paxos_commit_latency_count{ceph_daemon="mon.beaubourg"} 187080.0
# HELP ceph_mds_mem_dir_plus Directories opened

[2]

root@branly:~# ceph --status | grep mgr
    mgr: branly(active, since 14m), standbys: beaubourg, hypervisor3

ardumont added a revision: D6075: Activate data scraping on hypervisors.Aug 11 2021, 12:22 PM

So that means, we need to make the scraping happens on all managers since there is no
guarantee it will stay consistent across time.

D6075 for your reviewing pleasure ;)

ardumont mentioned this in rSENV588f533c03f9: Vagrantfile: Add 2 hypervisors node.Aug 11 2021, 12:31 PM

ardumont added a commit: rSPSITEceac3db6d1cd: Activate data scraping on hypervisors.

ardumont added a commit: rSPSITE8a46ad80ad8e: subnets/vagrant: Add hypervisors' ip mappings.

New grafana dashboard for ceph installed [1] out of [2]

[1] https://grafana.softwareheritage.org/goto/HeKEnjG7k?orgId=1

[2] https://grafana.com/grafana/dashboards/7056

The dashboard got improved to:

add the usual maintenance and deployment tag
filter out only the relevant hypervisors on some panels

ardumont updated the task description. (Show Details)Aug 11 2021, 3:06 PM

ardumont added a revision: D6076: Declare debian package build for proxmox-pve-exporter.Aug 11 2021, 5:02 PM

ardumont added a commit: rCJSWH6e302e5d5cd3: Declare debian package build for proxmox-pve-exporter.Aug 11 2021, 5:14 PM

The following got tested on uffizi:

Create a dedicated proxmox user:

pveum groupadd monitoring -comment 'Monitoring group'
pveum aclmod / -group monitoring -role PVEAuditor
pveum useradd pve_exporter@pve
pveum usermod pve_exporter@pve -group monitoring
pveum passwd pve_exporter@pve

Checking out the prometheus-pve-exporter through a virtualenv to ensure it's working as
expected.

python3 -m venv venv
. ./venv/bin/activate
pip install wheel
pip install prometheus-pve-exporter

Then install the necessary configuration:

mkdir -p /usr/local/share/pve_exporter/
cat > /usr/local/share/pve_exporter/pve_exporter.yml << EOF
default:
user: pve_exporter@pve
password: <redacted>
verify_ssl: false
EOF

All this in order to check either the local node or pergamon rendered result when
requiring the exporter some data:

root@uffizi:~# curl http://uffizi.internal.softwareheritage.org:9221/pve?target=127.0.0.1
# HELP pve_up Node/VM/CT-Status is online/running
# TYPE pve_up gauge
pve_up{id="cluster/swh-inria-rocq"} 1.0
pve_up{id="node/hypervisor3"} 1.0
pve_up{id="node/uffizi"} 1.0
pve_up{id="node/branly"} 1.0
pve_up{id="node/pompidou"} 1.0
pve_up{id="node/beaubourg"} 1.0
...

So we went on debian packaging prometheus-pve-exporter as it's not yet [1]
It's now built by our jenkins and pushed in our debian repository.

Next step is to actually deploy it with puppet.

[1] https://forge.softwareheritage.org/source/python3-prometheus-pve-exporter/manage/

ardumont added a revision: D6077: pve-exporter: Install prometheus-pve-exporter on hypervisor nodes.Aug 11 2021, 6:08 PM

ardumont mentioned this in rSPPRIVCf60d61bce7a7: Install pve-exporter's monitoring user password.Aug 11 2021, 6:18 PM

ardumont added a commit: rSPSITE5c8ac3d27cea: pve-exporter: Install prometheus-pve-exporter on hypervisor nodes.Aug 12 2021, 9:01 AM

ardumont added a commit: rSPSITE9953fe3e5c3e: pve-exporter: Scrape the prometheus exporter metrics.

ardumont added a revision: D6078: pve-exporter: Install properly configuration and service.Aug 12 2021, 9:26 AM

ardumont added a commit: rSPSITE294b3c325485: pve-exporter: Install properly configuration and service.Aug 12 2021, 9:54 AM

Grafana dashboard for proxmox installed [1]
But this needs some troubleshooting for now.

[1] https://grafana.softwareheritage.org/goto/lEhp26Gnz?orgId=1

vsellier mentioned this in rSPSITE4b7ac2269de4: pve-exporter: fix the prometheus scrapping url.Aug 12 2021, 10:19 AM

vsellier added a revision: D6082: prometheus: Support http parameters in exporter configuration.Aug 12 2021, 3:50 PM

vsellier added a commit: rSPSITE9fe99c19b47a: prometheus: Support http parameters in exporter configuration.Aug 12 2021, 4:01 PM

vsellier added a commit: rSPSITEb972034f8ceb: prometheus/pve_exporter: Split the metrics_path and the parameters.

But this needs some troubleshooting for now.

D6082 fixes the deployment and the communication.
Some improvments got also done in the dashboard.
And now we have some data! [1]

[1] https://grafana.softwareheritage.org/goto/iK3JiRn7z?orgId=1

ardumont mentioned this in rSPSITEf6542b08f096: Make beaubourg hypervisor also been scraped with its internal fqdn.Aug 12 2021, 4:46 PM

We should add some probes to raise alert on icinga in case of wrong behaviors for both
stack

We don't really know which alerts to add right now, so it's kept out of scope for this
task.

We do have 2 grafana dashboards for ceph and proxmox now:

ardumont closed this task as Resolved.Aug 12 2021, 4:56 PM

ardumont claimed this task.

ardumont moved this task from in-progress to done on the System administration board.

This task has been migrated to GitLab.

Add proxmox / ceph monitoringClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Add proxmox / ceph monitoring
Closed, MigratedEdits Locked
Actions

Related Objects
Search...