Page MenuHomeSoftware Heritage

ftigeot (François Tigeot)
User

Projects

User Details

User Since
Sep 6 2017, 1:06 PM (84 w, 4 d)

Recent Activity

Fri, Apr 19

ftigeot committed rDSNIPe51ca4c1ed77: Grafanalib dashboards: Add disk statistics (authored by ftigeot).
Grafanalib dashboards: Add disk statistics
Fri, Apr 19, 3:33 PM

Thu, Apr 18

ftigeot added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

If we remove Munin before implementing missing graph replacements, we will lack a comparison base and possibly fail to discover bogus data.
Right now, Prometheus disk throughput and iops values are suspiciously low for example.

Thu, Apr 18, 2:44 PM · Sprint 2018 12, System administration

Tue, Apr 16

ftigeot added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

Even though most/all of the Munin metrics are provided by Prometheus, Munin also provides graphs.
It is these graphs we are still missing.

Tue, Apr 16, 5:09 PM · Sprint 2018 12, System administration
ftigeot triaged T1653: Prometheus rate functions considered unreliable as Normal priority.
Tue, Apr 16, 4:55 PM · Sprint 2018 12, System administration
ftigeot added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

Wasn't that what T1428 was about ?
Apart from the list of pending packages, all commonly used Munin metrics should already have Prometheus equivalents.

Tue, Apr 16, 2:05 PM · Sprint 2018 12, System administration
ftigeot added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

When I asked where to put such work-in-progress, you suggested the snippets repository.

Tue, Apr 16, 11:00 AM · Sprint 2018 12, System administration

Mon, Apr 15

ftigeot added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

Work-in-progress Grafanalib dashboards have been added to the https://forge.softwareheritage.org/source/snippets/ repository.

Mon, Apr 15, 4:53 PM · Sprint 2018 12, System administration
ftigeot closed T1392: Add a new hypervisor, a subtask of T1173: Huge slowdowns on louvre since 2018-08-20, as Resolved.
Mon, Apr 15, 4:51 PM · System administration
ftigeot closed T1392: Add a new hypervisor as Resolved.

The new hypervisor has been working without any particular issue since its installation.

Mon, Apr 15, 4:51 PM · System administration
ftigeot committed rDSNIPcde445d33069: Grafanalib dashboards: Add swap and network traffic graphs (authored by ftigeot).
Grafanalib dashboards: Add swap and network traffic graphs
Mon, Apr 15, 11:18 AM

Wed, Apr 10

ftigeot committed rDSNIPce7403f6482c: Snippets: add Grafanalib dashboards (authored by ftigeot).
Snippets: add Grafanalib dashboards
Wed, Apr 10, 4:26 PM

Tue, Mar 26

ftigeot added a comment to T1372: Compare Rsnapshot / BorgBackup / Backuppc.

BorgBackup added to the comparison.

Tue, Mar 26, 2:21 PM · System administration
ftigeot closed T1520: Numerous dm device failures on louvre as Resolved.

Resolved on 2019-02-07.

Tue, Mar 26, 11:45 AM · System administration
ftigeot closed T1486: I/O error on worker06.internal as Resolved.

Resolved on 2019-02-07.

Tue, Mar 26, 11:44 AM · System administration
ftigeot closed T1486: I/O error on worker06.internal, a subtask of T1520: Numerous dm device failures on louvre, as Resolved.
Tue, Mar 26, 11:44 AM · System administration

Mar 20 2019

ftigeot closed T1428: Create an inventory of useful Munin metrics, a subtask of T1408: More/better Metrics, as Resolved.
Mar 20 2019, 11:43 AM · Metrics/monitoring, Sprint 2018 12
ftigeot closed T1428: Create an inventory of useful Munin metrics as Resolved.

Already marked as done on 2018-12-19.

Mar 20 2019, 11:43 AM · Metrics/monitoring, Sprint 2018 12
ftigeot closed T1428: Create an inventory of useful Munin metrics, a subtask of T1356: Kill munin, as Resolved.
Mar 20 2019, 11:43 AM · Sprint 2018 12, System administration

Mar 15 2019

ftigeot accepted D1252: Remove info-level logs about already acknowledged messages.

Looks good to me.

Mar 15 2019, 3:57 PM

Mar 5 2019

ftigeot added a comment to T1372: Compare Rsnapshot / BorgBackup / Backuppc.

Attachment: comparison between Backuppc and Rsnapshot

Mar 5 2019, 4:59 PM · System administration

Mar 4 2019

ftigeot committed rDDOCb4f7c0167ff7: docs: Add an elasticsearch diagram with index types and data sources (authored by ftigeot).
docs: Add an elasticsearch diagram with index types and data sources
Mar 4 2019, 2:45 PM

Feb 27 2019

ftigeot committed rDDOCaf94211f8634: docs: Add a general infrastructure diagram with services and databases relations (authored by ftigeot).
docs: Add a general infrastructure diagram with services and databases relations
Feb 27 2019, 3:29 PM

Feb 26 2019

ftigeot added a comment to T1372: Compare Rsnapshot / BorgBackup / Backuppc.

Comparison between Backuppc and Rsnapshot done, now adding Restic - https://restic.net/ - to the mix.
Borgbackup not tested yet.

Feb 26 2019, 10:55 AM · System administration

Feb 15 2019

ftigeot committed rSPSITE21e8c7f6ce74: facter: Ignore bpf and cgroup2 mount points (authored by ftigeot).
facter: Ignore bpf and cgroup2 mount points
Feb 15 2019, 5:00 PM
ftigeot added a comment to T1467: Slow network transfers from beaubourg.

Network limitation removed via a hotfix (manual route deletion).
Some network downtime will be required in the future to ensure the new /etc network configuration works as expected.

Feb 15 2019, 11:04 AM · System administration

Feb 13 2019

ftigeot added a comment to T1467: Slow network transfers from beaubourg.

Actual content of the vmbr0 interface configuration in beaubourg:/etc/network/interfaces:

auto vmbr0
iface vmbr0 inet static
        bridge_ports vlan440
        address 192.168.100.32
        netmask 255.255.255.0
        up ip route add 192.168.101.0/24 via 192.168.100.1
        up ip route add 192.168.200.0/21 via 192.168.100.1
        up ip rule add from 192.168.100.32 table private
        up ip route add default via 192.168.100.1 dev vmbr0 table private
        up ip route flush cache
        down ip route del default via 192.168.100.1 dev vmbr0 table private
        down ip rule del from 192.168.100.32 table private
        down ip route del 192.168.200.0/21 via 192.168.100.1
        down ip route del 192.168.101.0/24 via 192.168.100.1
        down ip route flush cache
Feb 13 2019, 4:49 PM · System administration
ftigeot added a comment to T1467: Slow network transfers from beaubourg.

Outgoing network traffic from beaubourg to the local private network 192.168.100.0/24 transits via louvre.
Louvre re-emits network packets and sends them to the destination host.

Feb 13 2019, 1:28 PM · System administration

Feb 12 2019

ftigeot added a comment to T1526: Install a new VPN endpoint at Rocquencourt.

Louvre had previously fallen more than once. Some of the events are documented in T1173.

Feb 12 2019, 1:20 PM · System administration
ftigeot triaged T1526: Install a new VPN endpoint at Rocquencourt as Normal priority.
Feb 12 2019, 1:12 PM · System administration

Feb 7 2019

ftigeot added a subtask for T1520: Numerous dm device failures on louvre: T1486: I/O error on worker06.internal.
Feb 7 2019, 4:55 PM · System administration
ftigeot added a parent task for T1486: I/O error on worker06.internal: T1520: Numerous dm device failures on louvre.
Feb 7 2019, 4:55 PM · System administration
ftigeot removed a parent task for T1520: Numerous dm device failures on louvre: T1486: I/O error on worker06.internal.
Feb 7 2019, 4:55 PM · System administration
ftigeot removed a subtask for T1486: I/O error on worker06.internal: T1520: Numerous dm device failures on louvre.
Feb 7 2019, 4:55 PM · System administration
ftigeot added a comment to T1520: Numerous dm device failures on louvre.

After the reboot, existing dm volumes on top of /dev/md3 still reported I/O errors:

[ 5200.552667] Buffer I/O error on dev dm-33, logical block 6999130, async page read
[ 5506.537868] Buffer I/O error on dev dm-35, logical block 2251864, async page read
Feb 7 2019, 4:04 PM · System administration
ftigeot added a comment to T1520: Numerous dm device failures on louvre.

The kind of error reported massively and suddenly when louvre stopped operating properly:

Buffer I/O error on device dm-41, logical block 10474329
Feb 7 2019, 3:51 PM · System administration
ftigeot updated the task description for T1520: Numerous dm device failures on louvre.
Feb 7 2019, 3:48 PM · System administration
ftigeot added a comment to T1520: Numerous dm device failures on louvre.

Related-to: T1486, T1518

Feb 7 2019, 3:46 PM · System administration
ftigeot changed the status of T1520: Numerous dm device failures on louvre from Open to Work in Progress.
Feb 7 2019, 3:45 PM · System administration
ftigeot added a comment to T1486: I/O error on worker06.internal.

A brand new virtual disk was created, skipping bad data blocks:

Feb 7 2019, 3:35 PM · System administration

Feb 6 2019

ftigeot closed T1518: I/O error on louvre:/dev/md3 as Resolved.

RAID volume was successfully rebuilt, closing even though the root cause of the initial error was not found.

Feb 6 2019, 10:43 AM · System administration
ftigeot closed T1518: I/O error on louvre:/dev/md3, a subtask of T1486: I/O error on worker06.internal, as Resolved.
Feb 6 2019, 10:43 AM · System administration
ftigeot added a comment to T1518: I/O error on louvre:/dev/md3.

dm-11 is a device present on top of dm-10, itself backed by /dev/sda:

Feb 6 2019, 10:42 AM · System administration
ftigeot added a comment to T1518: I/O error on louvre:/dev/md3.

More complete list of I/O errors as reported by dmesg(1):

[Tue Feb  5 09:38:53 2019] sd 0:0:2:0: [sda] tag#9 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[Tue Feb  5 09:38:53 2019] sd 0:0:2:0: [sda] tag#9 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00
[Tue Feb  5 09:38:53 2019] print_req_error: 140 callbacks suppressed
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev sda, sector 864969704
[Tue Feb  5 09:38:53 2019] device-mapper: multipath: Failing path 8:0.
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 864969704
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 865388544
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 149394832
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 180742368
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 180742224
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422416
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422368
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422032
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 149394864
[Tue Feb  5 09:38:53 2019] Buffer I/O error on dev dm-10, logical block 937684544, async page read
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516384
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516416
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516472
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516488
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516512
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516544
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516584
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516600
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 4518605200
[Tue Feb  5 09:38:54 2019] md/raid10:md3: dm-11: rescheduling sector 2565880800
...
[Tue Feb  5 09:39:51 2019] md: super_written gets error=10
[Tue Feb  5 09:39:51 2019] md/raid10:md3: Disk failure on dm-11, disabling device.
                           md/raid10:md3: Operation continuing on 1 devices.
[Tue Feb  5 09:39:51 2019] md/raid10:md3: dm-13: redirecting sector 1387516384 to another mirror
Feb 6 2019, 10:36 AM · System administration

Feb 5 2019

ftigeot renamed T1467: Slow network transfers from beaubourg from Network timeout issues in the Proxmox cluster to Slow network transfers from beaubourg.
Feb 5 2019, 5:18 PM · System administration
ftigeot added a comment to T1467: Slow network transfers from beaubourg.

None of the previous timeout issues are visible anymore on the Proxmox web interface.
They were possibly related to bad network quality on the web browser side (INRIA guest wifi).

Feb 5 2019, 5:18 PM · System administration
ftigeot changed the status of T1518: I/O error on louvre:/dev/md3 from Open to Work in Progress.

Forcing a rebuild by removing and re-adding the faulty device:

mdadm --manage /dev/md3 -r /dev/dm-11
mdadm --manage /dev/md3 -a /dev/dm-11
Feb 5 2019, 4:18 PM · System administration
ftigeot changed the status of T1518: I/O error on louvre:/dev/md3, a subtask of T1486: I/O error on worker06.internal, from Open to Work in Progress.
Feb 5 2019, 4:18 PM · System administration
ftigeot added a comment to T1518: I/O error on louvre:/dev/md3.

Like in T1486, we have a dm device reporting I/O errors but no visible errors on the underlying physical device.

Feb 5 2019, 4:08 PM · System administration
ftigeot added a comment to T1518: I/O error on louvre:/dev/md3.

A full read check of /dev/sda did not return any error:

# dd if=/dev/sda of=/dev/null bs=1M
Feb 5 2019, 3:53 PM · System administration
ftigeot added a comment to T1518: I/O error on louvre:/dev/md3.

As shown by lsblk, dm-11 and dm-13 are partitions on multipath devices.
These devices are themselves respectively handled by the physical /dev/sda and /dev/sdb SSDs:

Feb 5 2019, 3:50 PM · System administration
ftigeot triaged T1518: I/O error on louvre:/dev/md3 as High priority.
Feb 5 2019, 3:36 PM · System administration

Feb 1 2019

ftigeot renamed T1501: Phantom device mapper volume usage in Proxmox: logical volume is used by another device from Phantom device mapper volume usage in Proxmox to Phantom device mapper volume usage in Proxmox: logical volume is used by another device.
Feb 1 2019, 11:20 AM · System administration
ftigeot closed T1509: Phantom device mapper volume usage in Proxmox: local storage is not available on target node as Resolved.

Removing the previously used volume allowed VM migration to complete.

Feb 1 2019, 11:06 AM · System administration
ftigeot closed T1509: Phantom device mapper volume usage in Proxmox: local storage is not available on target node, a subtask of T1501: Phantom device mapper volume usage in Proxmox: logical volume is used by another device, as Resolved.
Feb 1 2019, 11:06 AM · System administration
ftigeot added a comment to T1509: Phantom device mapper volume usage in Proxmox: local storage is not available on target node.

Since the previously used drive is not used anymore, I decided to remove it:

# lvchange -a y ssd/vm-102-disk-0
# vremove /dev/ssd/vm-102-disk-0
Feb 1 2019, 11:01 AM · System administration
ftigeot changed the status of T1509: Phantom device mapper volume usage in Proxmox: local storage is not available on target node, a subtask of T1501: Phantom device mapper volume usage in Proxmox: logical volume is used by another device, from Open to Work in Progress.
Feb 1 2019, 10:55 AM · System administration
ftigeot changed the status of T1509: Phantom device mapper volume usage in Proxmox: local storage is not available on target node from Open to Work in Progress.

The previous drive is neither active nor opened:

Feb 1 2019, 10:55 AM · System administration
ftigeot added a comment to T1509: Phantom device mapper volume usage in Proxmox: local storage is not available on target node.

The main VM disk is stored on a "vm-102-disk-1" volume (on Ceph)
There is an inactive lvm volume on "beaubourg-ssd" formerly associated with this VM, it was used as the virtual disk backend before the virtual disk device was migrated to Ceph.

Feb 1 2019, 10:53 AM · System administration
ftigeot added a comment to T1509: Phantom device mapper volume usage in Proxmox: local storage is not available on target node.

No mention of "beaubourg-ssd" is visible in the Proxmox virtual machine management interface.
All virtual disk backends are stored on Ceph.

Feb 1 2019, 10:46 AM · System administration
ftigeot triaged T1509: Phantom device mapper volume usage in Proxmox: local storage is not available on target node as High priority.
Feb 1 2019, 10:40 AM · System administration

Jan 31 2019

ftigeot added a comment to T1501: Phantom device mapper volume usage in Proxmox: logical volume is used by another device.

Trying to manually disable the logical volume in question fails with the same error message

lvchange -a n /dev/ssd/vm-107-disk-0
Logical volume ssd/vm-107-disk-0 is used by another device.
Jan 31 2019, 5:31 PM · System administration
ftigeot updated the task description for T1501: Phantom device mapper volume usage in Proxmox: logical volume is used by another device.
Jan 31 2019, 2:55 PM · System administration

Jan 30 2019

ftigeot closed T1502: Too many postgresql logs on dbreplica0.euwest.azure.internal.softwareheritage.org as Resolved.

Only keep 24 hours of log, and keep rotating on the same file names:

Jan 30 2019, 3:09 PM · System administration
ftigeot added a comment to T1502: Too many postgresql logs on dbreplica0.euwest.azure.internal.softwareheritage.org.

There is no need to log all production queries on this server.
Reducing logged contents to queries taking more than one millisecond to execute:

Jan 30 2019, 2:23 PM · System administration
ftigeot triaged T1503: Rename hypervisor3 to a museum name as Normal priority.
Jan 30 2019, 11:58 AM · System administration
ftigeot added a comment to T1467: Slow network transfers from beaubourg.

It turns out hypervisor3 is not the culprit we thought it was.
Removing T1392 from parent task list.

Jan 30 2019, 11:54 AM · System administration
ftigeot removed a parent task for T1467: Slow network transfers from beaubourg: T1392: Add a new hypervisor.
Jan 30 2019, 11:53 AM · System administration
ftigeot removed a subtask for T1392: Add a new hypervisor: T1467: Slow network transfers from beaubourg.
Jan 30 2019, 11:53 AM · System administration
ftigeot changed the status of T1502: Too many postgresql logs on dbreplica0.euwest.azure.internal.softwareheritage.org from Open to Work in Progress.
Jan 30 2019, 10:35 AM · System administration

Jan 29 2019

ftigeot changed the status of T1501: Phantom device mapper volume usage in Proxmox: logical volume is used by another device from Open to Work in Progress.
Jan 29 2019, 2:08 PM · System administration

Jan 25 2019

ftigeot added a comment to T1467: Slow network transfers from beaubourg.

After running some additional tcp iperf tests, it is obvious beaubourg is the outlier.
Measured bandwidth :

  • from any 10G machine to any 10G machine (except beaubourg): > 9 Gb/s
  • from any 10G machine to beaubourg: > 9Gb/s
  • from beaubourg to ceph-osd1, ceph-osd2 and hypervisor3: 600-800 Mb/s
  • from beaubourg to ceph-mon1: 230 Kb/s
Jan 25 2019, 3:56 PM · System administration

Jan 22 2019

ftigeot added a comment to T1467: Slow network transfers from beaubourg.

Since all these machines are relied to the same pair of switches and these switches are managed by INRIA DSI-SESI, I have asked for their assistance in this ticket:
https://support.inria.fr/Ticket/Display.html?id=127011

Jan 22 2019, 3:18 PM · System administration
ftigeot added a comment to T1486: I/O error on worker06.internal.

The /dev/md3 check completed successfully and did not report any error.

Jan 22 2019, 8:41 AM · System administration
ftigeot claimed T1486: I/O error on worker06.internal.
Jan 22 2019, 8:41 AM · System administration

Jan 21 2019

ftigeot added a comment to T1486: I/O error on worker06.internal.

worker06.internal.softwareheritage.org is a VM running on louvre, Its virtual disk is backed by /dev/dm-36 on the host.

Jan 21 2019, 2:42 PM · System administration
ftigeot changed the status of T1486: I/O error on worker06.internal from Open to Work in Progress.
Jan 21 2019, 2:38 PM · System administration

Jan 16 2019

ftigeot added a comment to T1467: Slow network transfers from beaubourg.

For the previous iperf TCP test and without tuning, we also have:

  • an average transfer speed of 9,388 Mb/s between hypervisor3 and one of the 10G Ceph nodes, ceph-osd1.
  • an average rransfer speed of 8,364 Mb/s between beaubourg and ceph-osd1.
Jan 16 2019, 11:43 AM · System administration

Jan 15 2019

ftigeot added a comment to T1467: Slow network transfers from beaubourg.

Both beaubourg and hypervisor3 network interfaces have a 10Gb/s link layer connection.
Aggregated traffic from multiple iperf streams nevertheless never reaches more than ~= 90% of a 1Gb/s transfer speed.

Jan 15 2019, 5:05 PM · System administration
ftigeot added a comment to T1467: Slow network transfers from beaubourg.

Another thing worth noting is the vmbr0 interface on which the primary IP address is located, has a mtu of only 1500 bytes.
The network interfaces it is built on have a 9000 bytes mtu.

Jan 15 2019, 4:28 PM · System administration
ftigeot added a comment to T1467: Slow network transfers from beaubourg.

iperf tests show

  • network speed never reaches 1Gbps, even between hosts which have 10Gb/s network interfaces and are connected to the same switches
  • 19% of UDP packets get lost at 1Gb/s (less than 0.5% at 100Mb/s)
Jan 15 2019, 2:59 PM · System administration

Jan 14 2019

ftigeot added a comment to T1467: Slow network transfers from beaubourg.

Corosync warnings also routinely appear in the logs:

Jan 14 11:56:13 hypervisor3 corosync[5622]: notice  [TOTEM ] Retransmit List: 282eb9
Jan 14 11:56:13 hypervisor3 corosync[5622]:  [TOTEM ] Retransmit List: 282eb9
Jan 14 11:56:13 hypervisor3 corosync[5622]:  [TOTEM ] Retransmit List: 282eba
Jan 14 2019, 1:26 PM · System administration
ftigeot changed the status of T1467: Slow network transfers from beaubourg, a subtask of T1392: Add a new hypervisor, from Open to Work in Progress.
Jan 14 2019, 1:23 PM · System administration
ftigeot changed the status of T1467: Slow network transfers from beaubourg from Open to Work in Progress.

The network interface hardware on hypervisor3 is relatively new:

i40e: Intel(R) Ethernet Connection XL710 Network Driver - version 2.1.14-k
Jan 14 2019, 1:23 PM · System administration

Jan 11 2019

ftigeot triaged T1467: Slow network transfers from beaubourg as Normal priority.
Jan 11 2019, 4:24 PM · System administration

Dec 21 2018

ftigeot closed T1325: Add SSDs to banco as Resolved.

Two 4TB SSDs added to banco yesterday, exported to Linux as JBODs.

Dec 21 2018, 4:10 PM · System administration
ftigeot added a comment to T1392: Add a new hypervisor.

Proxmox now installed on the machine, hypervisor3.softwareheritage.org.

Dec 21 2018, 4:08 PM · System administration
ftigeot committed rSPSITE59ee68802a76: manifests/site: add a new hypervisor, hypervisor3 (authored by ftigeot).
manifests/site: add a new hypervisor, hypervisor3
Dec 21 2018, 2:14 PM

Dec 13 2018

ftigeot moved T1442: Replace Munin graphs with Grafana/Prometheus dashboards from Backlog to in progress on the Sprint 2018 12 board.
Dec 13 2018, 4:22 PM · Sprint 2018 12, System administration
ftigeot changed the status of T1442: Replace Munin graphs with Grafana/Prometheus dashboards, a subtask of T1356: Kill munin, from Open to Work in Progress.
Dec 13 2018, 4:21 PM · Sprint 2018 12, System administration
ftigeot changed the status of T1442: Replace Munin graphs with Grafana/Prometheus dashboards from Open to Work in Progress.
Dec 13 2018, 4:21 PM · Sprint 2018 12, System administration
ftigeot triaged T1442: Replace Munin graphs with Grafana/Prometheus dashboards as High priority.
Dec 13 2018, 4:19 PM · Sprint 2018 12, System administration
ftigeot added a parent task for T1428: Create an inventory of useful Munin metrics: T1356: Kill munin.
Dec 13 2018, 4:14 PM · Metrics/monitoring, Sprint 2018 12
ftigeot added a subtask for T1356: Kill munin: T1428: Create an inventory of useful Munin metrics.
Dec 13 2018, 4:14 PM · Sprint 2018 12, System administration

Dec 11 2018

ftigeot changed the status of T1338: Change BBUs on orsay from Open to Work in Progress.

Another Perc H700 battery replacement product: http://www.hardware-attitude.com/fiche-1114-batterie-raid-pour-perc5-i-perc6-i---nu209.html
We should buy this one if possible ASAP IMHO.

Dec 11 2018, 4:55 PM · System administration

Dec 7 2018

ftigeot added a comment to T1372: Compare Rsnapshot / BorgBackup / Backuppc.

Borgbackup is unable to pull data from remote hosts to a central location.

I do not understand this assertion.

Dec 7 2018, 10:50 AM · System administration

Dec 4 2018

ftigeot changed the status of T1428: Create an inventory of useful Munin metrics from Open to Work in Progress.

Disk

  • I/Os per device
  • Disk usage in percent
  • Utilization per device is this real ? it could be useful to see if a storage subsystem is overloaded
  • Disk usage in absolute human values. percentages are meaningless if we resize filesystems
Dec 4 2018, 4:11 PM · Metrics/monitoring, Sprint 2018 12
ftigeot changed the status of T1428: Create an inventory of useful Munin metrics, a subtask of T1408: More/better Metrics, from Open to Work in Progress.
Dec 4 2018, 4:11 PM · Metrics/monitoring, Sprint 2018 12
ftigeot updated subscribers of T1428: Create an inventory of useful Munin metrics.
Dec 4 2018, 2:46 PM · Metrics/monitoring, Sprint 2018 12
ftigeot triaged T1428: Create an inventory of useful Munin metrics as Normal priority.
Dec 4 2018, 2:45 PM · Metrics/monitoring, Sprint 2018 12
ftigeot changed the status of T1372: Compare Rsnapshot / BorgBackup / Backuppc, a subtask of T1282: Revisit backups, from Open to Work in Progress.
Dec 4 2018, 2:41 PM · System administration