This behavior appears to be caused by partitions present on top of device mappers devices.
These partitions in turn are used to create other dm devices and these latest device keep an open reference to the base one.

Jun 24 2019, 4:50 PM · System administration

Jun 13 2019

ftigeot committed rSPSITEe1c567a05b72: pgbouncer: sowftwareheritage is now hosted on belvedere (authored by ftigeot).

pgbouncer: sowftwareheritage is now hosted on belvedere

Jun 13 2019, 9:04 PM

Jun 11 2019

ftigeot closed T1: Investigate NFS UID and GID mapping as Resolved.

Jun 11 2019, 2:40 PM · System administration

ftigeot added a parent task for T1: Investigate NFS UID and GID mapping: Unknown Object (Maniphest Task).

Jun 11 2019, 2:39 PM · System administration

Jun 6 2019

ftigeot added a comment to T1: Investigate NFS UID and GID mapping.

The reason of this behavior is Debian uses dynamic UIDs for most of its system users.

Jun 6 2019, 4:47 PM · System administration

ftigeot accepted D1555: pgbouncer: Deal with somerset/prado's instance.

Jun 6 2019, 4:19 PM

ftigeot accepted D1554: Puppetfile: pgbouncer: Use commit with puppet module fix.

Jun 6 2019, 2:55 PM

ftigeot accepted D1550: Add pgbouncer configuration.

Looks good for a first draft.

Jun 6 2019, 8:57 AM · System administration, Puppet recipes

May 29 2019

ftigeot committed rSPSITEdf42a3690522: data/defaults: Change db.internal CNAME target (authored by ftigeot).

data/defaults: Change db.internal CNAME target

May 29 2019, 4:28 PM

May 28 2019

ftigeot accepted D1514: Migrate services to use the new db machine belvedere.

Looks good to me.
Always using the fqdn belvedere.internal.softwareheritage.org would be more consistent though ;-)

May 28 2019, 8:59 AM

May 22 2019

ftigeot committed rSPSITE03cd3d9fe5c6: manifests: Add megacli profile to all database servers (authored by ftigeot).

manifests: Add megacli profile to all database servers

May 22 2019, 1:37 PM

ftigeot committed rSPSITEd39cf18925d9: manifests/site: add a new database server, belvedere (authored by ftigeot).

manifests/site: add a new database server, belvedere

May 22 2019, 11:47 AM

May 16 2019

ftigeot committed R188:a8c494488310: Import existing Grafanalib dashboards (authored by ftigeot).

Import existing Grafanalib dashboards

May 16 2019, 2:48 PM

May 14 2019

ftigeot added a comment to T1711: Create a testing environment.

We will use VMs running on the orsay.softwareinternal.org hypervisor for now.

May 14 2019, 3:26 PM · System administration

ftigeot triaged T1712: Create a separate testing network as Normal priority.

May 14 2019, 3:24 PM · System administration

ftigeot triaged T1711: Create a testing environment as Normal priority.

May 14 2019, 3:21 PM · System administration

May 13 2019

ftigeot removed a parent task for T792: Make the elasticsearch logging cluster actually a cluster: T986: Scheduler: Automate completed oneshot or disabled recurring tasks archival.

May 13 2019, 4:29 PM · System administration (Elasticsearch consolidation (W24/2018))

ftigeot removed a subtask for T986: Scheduler: Automate completed oneshot or disabled recurring tasks archival: T792: Make the elasticsearch logging cluster actually a cluster.

May 13 2019, 4:29 PM · Scheduling utilities

ftigeot removed a subtask for T1005: webapp: Push logs to elasticsearch cluster: T792: Make the elasticsearch logging cluster actually a cluster.

May 13 2019, 4:28 PM · System administration, Web app

ftigeot removed a parent task for T792: Make the elasticsearch logging cluster actually a cluster: T1005: webapp: Push logs to elasticsearch cluster.

May 13 2019, 4:28 PM · System administration (Elasticsearch consolidation (W24/2018))

ftigeot removed a subtask for T1028: deposit: Push logs to elasticsearch: T792: Make the elasticsearch logging cluster actually a cluster.

May 13 2019, 4:26 PM · SWORD deposit

ftigeot removed a parent task for T792: Make the elasticsearch logging cluster actually a cluster: T1028: deposit: Push logs to elasticsearch.

May 13 2019, 4:26 PM · System administration (Elasticsearch consolidation (W24/2018))

Apr 30 2019

ftigeot triaged T1698: Make sure Grafana dashboards are backed up as High priority.

Apr 30 2019, 3:38 PM · Sprint 2018 12, System administration

ftigeot changed the status of T1697: Deploy Grafanalib-based dashboards with Puppet, a subtask of T1442: Replace Munin graphs with Grafana/Prometheus dashboards, from Open to Work in Progress.

Apr 30 2019, 3:37 PM · Sprint 2018 12, System administration

ftigeot changed the status of T1697: Deploy Grafanalib-based dashboards with Puppet from Open to Work in Progress.

Apr 30 2019, 3:37 PM · Sprint 2018 12, System administration

ftigeot triaged T1697: Deploy Grafanalib-based dashboards with Puppet as High priority.

Apr 30 2019, 3:37 PM · Sprint 2018 12, System administration

ftigeot added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

Grafanalib dashboards added to https://grafana.softwareheritage.org/ via the new provisioning mechanism of Grafana 5.x.
Fully automated provisioning is still a work-in-progress.

Apr 30 2019, 3:36 PM · Sprint 2018 12, System administration

ftigeot added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

Prometheus does not provide storage device statistics for Proxmox container-based hosts.
The data can be read from their parent machine dashboards though.

Apr 30 2019, 12:28 PM · Sprint 2018 12, System administration

ftigeot added a comment to T1372: Compare Rsnapshot / BorgBackup / Backuppc.

Some disk space usage statistics with ~= one month of snapshots

Apr 30 2019, 10:57 AM · System administration

Apr 25 2019

ftigeot closed T1007: Monitor nfs mount points on orangerie.internal.softwareheritage.org as Resolved.

Grafanalib based dashboards do not require special handling, the nfs filesystem on orangerie:/srv/softwareheritage is shown by default for example.

Apr 25 2019, 2:36 PM · System administration

ftigeot closed T791: Ship more logs to logstash/elasticsearch as Resolved.

Apr 25 2019, 1:33 PM · System administration

Apr 19 2019

ftigeot committed rDSNIPe51ca4c1ed77: Grafanalib dashboards: Add disk statistics (authored by ftigeot).

Grafanalib dashboards: Add disk statistics

Apr 19 2019, 3:33 PM

Apr 18 2019

ftigeot added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

If we remove Munin before implementing missing graph replacements, we will lack a comparison base and possibly fail to discover bogus data.
Right now, Prometheus disk throughput and iops values are suspiciously low for example.

Apr 18 2019, 2:44 PM · Sprint 2018 12, System administration

Apr 16 2019

ftigeot added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

Even though most/all of the Munin metrics are provided by Prometheus, Munin also provides graphs.
It is these graphs we are still missing.

Apr 16 2019, 5:09 PM · Sprint 2018 12, System administration

ftigeot triaged T1653: Prometheus rate functions considered unreliable as Normal priority.

Apr 16 2019, 4:55 PM · Sprint 2018 12, System administration

ftigeot added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

Wasn't that what T1428 was about ?
Apart from the list of pending packages, all commonly used Munin metrics should already have Prometheus equivalents.

Apr 16 2019, 2:05 PM · Sprint 2018 12, System administration

ftigeot added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

When I asked where to put such work-in-progress, you suggested the snippets repository.

Apr 16 2019, 11:00 AM · Sprint 2018 12, System administration

Apr 15 2019

ftigeot added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

Work-in-progress Grafanalib dashboards have been added to the https://forge.softwareheritage.org/source/snippets/ repository.

Apr 15 2019, 4:53 PM · Sprint 2018 12, System administration

ftigeot closed T1392: Add a new hypervisor, a subtask of T1173: Huge slowdowns on louvre since 2018-08-20, as Resolved.

Apr 15 2019, 4:51 PM · System administration

ftigeot closed T1392: Add a new hypervisor as Resolved.

The new hypervisor has been working without any particular issue since its installation.

Apr 15 2019, 4:51 PM · System administration

ftigeot committed rDSNIPcde445d33069: Grafanalib dashboards: Add swap and network traffic graphs (authored by ftigeot).

Grafanalib dashboards: Add swap and network traffic graphs

Apr 15 2019, 11:18 AM

Apr 10 2019

ftigeot committed rDSNIPce7403f6482c: Snippets: add Grafanalib dashboards (authored by ftigeot).

Snippets: add Grafanalib dashboards

Apr 10 2019, 4:26 PM

Mar 26 2019

ftigeot added a comment to T1372: Compare Rsnapshot / BorgBackup / Backuppc.

BorgBackup added to the comparison.

2019_backup_solutions_comparison.rsnapshot.backuppc.borg.odt29 KBDownload

Mar 26 2019, 2:21 PM · System administration

ftigeot closed T1520: Numerous dm device failures on louvre as Resolved.

Resolved on 2019-02-07.

Mar 26 2019, 11:45 AM · System administration

ftigeot closed T1486: I/O error on worker06.internal as Resolved.

Resolved on 2019-02-07.

Mar 26 2019, 11:44 AM · System administration

ftigeot closed T1486: I/O error on worker06.internal, a subtask of T1520: Numerous dm device failures on louvre, as Resolved.

Mar 26 2019, 11:44 AM · System administration

Mar 20 2019

ftigeot closed T1428: Create an inventory of useful Munin metrics, a subtask of T1408: More/better Metrics, as Resolved.

Mar 20 2019, 11:43 AM · Metrics/monitoring, Sprint 2018 12

ftigeot closed T1428: Create an inventory of useful Munin metrics as Resolved.

Already marked as done on 2018-12-19.

Mar 20 2019, 11:43 AM · Metrics/monitoring, Sprint 2018 12

ftigeot closed T1428: Create an inventory of useful Munin metrics, a subtask of T1356: Kill munin, as Resolved.

Mar 20 2019, 11:43 AM · Sprint 2018 12, System administration

Mar 15 2019

ftigeot accepted D1252: Remove info-level logs about already acknowledged messages.

Looks good to me.

Mar 15 2019, 3:57 PM

Mar 5 2019

ftigeot added a comment to T1372: Compare Rsnapshot / BorgBackup / Backuppc.

Attachment: comparison between Backuppc and Rsnapshot

2019_backup_solutions_comparison.odt24 KBDownload

Mar 5 2019, 4:59 PM · System administration

Mar 4 2019

ftigeot committed rDDOCb4f7c0167ff7: docs: Add an elasticsearch diagram with index types and data sources (authored by ftigeot).

docs: Add an elasticsearch diagram with index types and data sources

Mar 4 2019, 2:45 PM

Feb 27 2019

ftigeot committed rDDOCaf94211f8634: docs: Add a general infrastructure diagram with services and databases relations (authored by ftigeot).

docs: Add a general infrastructure diagram with services and databases relations

Feb 27 2019, 3:29 PM

Feb 26 2019

ftigeot added a comment to T1372: Compare Rsnapshot / BorgBackup / Backuppc.

Comparison between Backuppc and Rsnapshot done, now adding Restic - https://restic.net/ - to the mix.
Borgbackup not tested yet.

Feb 26 2019, 10:55 AM · System administration

Feb 15 2019

ftigeot committed rSPSITE21e8c7f6ce74: facter: Ignore bpf and cgroup2 mount points (authored by ftigeot).

facter: Ignore bpf and cgroup2 mount points

Feb 15 2019, 5:00 PM

ftigeot added a comment to T1467: Slow network transfers from beaubourg.

Network limitation removed via a hotfix (manual route deletion).
Some network downtime will be required in the future to ensure the new /etc network configuration works as expected.

Feb 15 2019, 11:04 AM · System administration

Feb 13 2019

ftigeot added a comment to T1467: Slow network transfers from beaubourg.

Actual content of the vmbr0 interface configuration in beaubourg:/etc/network/interfaces:

auto vmbr0
iface vmbr0 inet static
        bridge_ports vlan440
        address 192.168.100.32
        netmask 255.255.255.0
        up ip route add 192.168.101.0/24 via 192.168.100.1
        up ip route add 192.168.200.0/21 via 192.168.100.1
        up ip rule add from 192.168.100.32 table private
        up ip route add default via 192.168.100.1 dev vmbr0 table private
        up ip route flush cache
        down ip route del default via 192.168.100.1 dev vmbr0 table private
        down ip rule del from 192.168.100.32 table private
        down ip route del 192.168.200.0/21 via 192.168.100.1
        down ip route del 192.168.101.0/24 via 192.168.100.1
        down ip route flush cache

Feb 13 2019, 4:49 PM · System administration

ftigeot added a comment to T1467: Slow network transfers from beaubourg.

Outgoing network traffic from beaubourg to the local private network 192.168.100.0/24 transits via louvre.
Louvre re-emits network packets and sends them to the destination host.

Feb 13 2019, 1:28 PM · System administration

Feb 12 2019

ftigeot added a comment to T1526: Install a new VPN endpoint at Rocquencourt.

Louvre had previously fallen more than once. Some of the events are documented in T1173.

Feb 12 2019, 1:20 PM · System administration

ftigeot triaged T1526: Install a new VPN endpoint at Rocquencourt as Normal priority.

Feb 12 2019, 1:12 PM · System administration

Feb 7 2019

ftigeot added a subtask for T1520: Numerous dm device failures on louvre: T1486: I/O error on worker06.internal.

Feb 7 2019, 4:55 PM · System administration

ftigeot added a parent task for T1486: I/O error on worker06.internal: T1520: Numerous dm device failures on louvre.

Feb 7 2019, 4:55 PM · System administration

ftigeot removed a parent task for T1520: Numerous dm device failures on louvre: T1486: I/O error on worker06.internal.

Feb 7 2019, 4:55 PM · System administration

ftigeot removed a subtask for T1486: I/O error on worker06.internal: T1520: Numerous dm device failures on louvre.

Feb 7 2019, 4:55 PM · System administration

ftigeot added a comment to T1520: Numerous dm device failures on louvre.

After the reboot, existing dm volumes on top of /dev/md3 still reported I/O errors:

[ 5200.552667] Buffer I/O error on dev dm-33, logical block 6999130, async page read
[ 5506.537868] Buffer I/O error on dev dm-35, logical block 2251864, async page read

Feb 7 2019, 4:04 PM · System administration

ftigeot added a comment to T1520: Numerous dm device failures on louvre.

The kind of error reported massively and suddenly when louvre stopped operating properly:

Buffer I/O error on device dm-41, logical block 10474329

Feb 7 2019, 3:51 PM · System administration

ftigeot updated the task description for T1520: Numerous dm device failures on louvre.

Feb 7 2019, 3:48 PM · System administration

ftigeot added a comment to T1520: Numerous dm device failures on louvre.

Related-to: T1486, T1518

Feb 7 2019, 3:46 PM · System administration

ftigeot changed the status of T1520: Numerous dm device failures on louvre from Open to Work in Progress.

Feb 7 2019, 3:45 PM · System administration

ftigeot added a comment to T1486: I/O error on worker06.internal.

A brand new virtual disk was created, skipping bad data blocks:

Feb 7 2019, 3:35 PM · System administration

Feb 6 2019

ftigeot closed T1518: I/O error on louvre:/dev/md3 as Resolved.

RAID volume was successfully rebuilt, closing even though the root cause of the initial error was not found.

Feb 6 2019, 10:43 AM · System administration

ftigeot closed T1518: I/O error on louvre:/dev/md3, a subtask of T1486: I/O error on worker06.internal, as Resolved.

Feb 6 2019, 10:43 AM · System administration

ftigeot added a comment to T1518: I/O error on louvre:/dev/md3.

dm-11 is a device present on top of dm-10, itself backed by /dev/sda:

Feb 6 2019, 10:42 AM · System administration

ftigeot added a comment to T1518: I/O error on louvre:/dev/md3.

More complete list of I/O errors as reported by dmesg(1):

[Tue Feb  5 09:38:53 2019] sd 0:0:2:0: [sda] tag#9 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[Tue Feb  5 09:38:53 2019] sd 0:0:2:0: [sda] tag#9 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00
[Tue Feb  5 09:38:53 2019] print_req_error: 140 callbacks suppressed
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev sda, sector 864969704
[Tue Feb  5 09:38:53 2019] device-mapper: multipath: Failing path 8:0.
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 864969704
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 865388544
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 149394832
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 180742368
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 180742224
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422416
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422368
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422032
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 149394864
[Tue Feb  5 09:38:53 2019] Buffer I/O error on dev dm-10, logical block 937684544, async page read
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516384
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516416
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516472
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516488
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516512
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516544
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516584
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516600
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 4518605200
[Tue Feb  5 09:38:54 2019] md/raid10:md3: dm-11: rescheduling sector 2565880800
...
[Tue Feb  5 09:39:51 2019] md: super_written gets error=10
[Tue Feb  5 09:39:51 2019] md/raid10:md3: Disk failure on dm-11, disabling device.
                           md/raid10:md3: Operation continuing on 1 devices.
[Tue Feb  5 09:39:51 2019] md/raid10:md3: dm-13: redirecting sector 1387516384 to another mirror

Feb 6 2019, 10:36 AM · System administration

Feb 5 2019

ftigeot renamed T1467: Slow network transfers from beaubourg from Network timeout issues in the Proxmox cluster to Slow network transfers from beaubourg.

Feb 5 2019, 5:18 PM · System administration

ftigeot added a comment to T1467: Slow network transfers from beaubourg.

None of the previous timeout issues are visible anymore on the Proxmox web interface.
They were possibly related to bad network quality on the web browser side (INRIA guest wifi).

Feb 5 2019, 5:18 PM · System administration

ftigeot changed the status of T1518: I/O error on louvre:/dev/md3 from Open to Work in Progress.

Forcing a rebuild by removing and re-adding the faulty device:

mdadm --manage /dev/md3 -r /dev/dm-11
mdadm --manage /dev/md3 -a /dev/dm-11

Feb 5 2019, 4:18 PM · System administration

ftigeot changed the status of T1518: I/O error on louvre:/dev/md3, a subtask of T1486: I/O error on worker06.internal, from Open to Work in Progress.

Feb 5 2019, 4:18 PM · System administration

ftigeot added a comment to T1518: I/O error on louvre:/dev/md3.

Like in T1486, we have a dm device reporting I/O errors but no visible errors on the underlying physical device.

Feb 5 2019, 4:08 PM · System administration

ftigeot added a comment to T1518: I/O error on louvre:/dev/md3.

A full read check of /dev/sda did not return any error:

# dd if=/dev/sda of=/dev/null bs=1M

Feb 5 2019, 3:53 PM · System administration

ftigeot added a comment to T1518: I/O error on louvre:/dev/md3.

As shown by lsblk, dm-11 and dm-13 are partitions on multipath devices.
These devices are themselves respectively handled by the physical /dev/sda and /dev/sdb SSDs:

Feb 5 2019, 3:50 PM · System administration

ftigeot triaged T1518: I/O error on louvre:/dev/md3 as High priority.

Feb 5 2019, 3:36 PM · System administration

Feb 1 2019

ftigeot renamed T1501: Phantom device mapper volume usage in Proxmox: logical volume is used by another device from Phantom device mapper volume usage in Proxmox to Phantom device mapper volume usage in Proxmox: logical volume is used by another device.

Feb 1 2019, 11:20 AM · System administration

ftigeot closed T1509: Phantom device mapper volume usage in Proxmox: local storage is not available on target node as Resolved.

Removing the previously used volume allowed VM migration to complete.

Feb 1 2019, 11:06 AM · System administration

ftigeot closed T1509: Phantom device mapper volume usage in Proxmox: local storage is not available on target node, a subtask of T1501: Phantom device mapper volume usage in Proxmox: logical volume is used by another device, as Resolved.

Feb 1 2019, 11:06 AM · System administration

ftigeot added a comment to T1509: Phantom device mapper volume usage in Proxmox: local storage is not available on target node.

Since the previously used drive is not used anymore, I decided to remove it:

# lvchange -a y ssd/vm-102-disk-0
# vremove /dev/ssd/vm-102-disk-0

Feb 1 2019, 11:01 AM · System administration

ftigeot changed the status of T1509: Phantom device mapper volume usage in Proxmox: local storage is not available on target node, a subtask of T1501: Phantom device mapper volume usage in Proxmox: logical volume is used by another device, from Open to Work in Progress.

Feb 1 2019, 10:55 AM · System administration

ftigeot changed the status of T1509: Phantom device mapper volume usage in Proxmox: local storage is not available on target node from Open to Work in Progress.

The previous drive is neither active nor opened:

Feb 1 2019, 10:55 AM · System administration

ftigeot added a comment to T1509: Phantom device mapper volume usage in Proxmox: local storage is not available on target node.

The main VM disk is stored on a "vm-102-disk-1" volume (on Ceph)
There is an inactive lvm volume on "beaubourg-ssd" formerly associated with this VM, it was used as the virtual disk backend before the virtual disk device was migrated to Ceph.

Feb 1 2019, 10:53 AM · System administration

ftigeot added a comment to T1509: Phantom device mapper volume usage in Proxmox: local storage is not available on target node.

No mention of "beaubourg-ssd" is visible in the Proxmox virtual machine management interface.
All virtual disk backends are stored on Ceph.

Feb 1 2019, 10:46 AM · System administration

ftigeot triaged T1509: Phantom device mapper volume usage in Proxmox: local storage is not available on target node as High priority.

Feb 1 2019, 10:40 AM · System administration

Advanced SearchUse ResultsEdit QueryHide Query

Jun 28 2019

Jun 27 2019

Jun 25 2019

Jun 24 2019

Jun 13 2019

Jun 11 2019

Jun 6 2019

May 29 2019

May 28 2019

May 22 2019

May 16 2019

May 14 2019

May 13 2019

Apr 30 2019

Apr 25 2019

Apr 19 2019

Apr 18 2019

Apr 16 2019

Apr 15 2019

Apr 10 2019

Mar 26 2019

Mar 20 2019

Mar 15 2019

Mar 5 2019

Mar 4 2019

Feb 27 2019

Feb 26 2019

Feb 15 2019

Feb 13 2019

Feb 12 2019

Feb 7 2019

Feb 6 2019

Feb 5 2019

Feb 1 2019

Advanced Search
Use Results
Edit Query
Hide Query