Page MenuHomeSoftware Heritage

System administrationFolder
ActivePublic

Members

  • This project does not have any members.

Watchers

  • This project does not have any watchers.

Details

Description

general system administration tasks, not specific to any product

Recent Activity

Yesterday

olasd added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

Grafana postgresql overview: https://grafana.softwareheritage.org/d/PEKz-Ygiz/postgresql-server-overview

Thu, Feb 21, 11:29 AM · Sprint 2018 12, System administration
douardda added a comment to T1442: Replace Munin graphs with Grafana/Prometheus dashboards.

Since the current status of munin's pg monitoring is inconsistent (eg. awure's dbreplica0 has not pg curve), let's get rid of all munin's pg monitors so there is less confusion, and is one step towards T1356.

Thu, Feb 21, 11:06 AM · Sprint 2018 12, System administration

Wed, Feb 20

olasd updated subscribers of T1534: PostgreSQL replication issues between prado and somerset.

After some more stewing and discussion with @zack, we'll be going for the "upgrade to pg 11 and restart replication from scratch" route;

Wed, Feb 20, 2:32 PM · System administration, Archive content

Tue, Feb 19

olasd updated subscribers of T1534: PostgreSQL replication issues between prado and somerset.

After reading some mailing list posts discussing the error message, and discussion with @ftigeot:

Tue, Feb 19, 6:22 PM · System administration, Archive content
olasd added a comment to T1534: PostgreSQL replication issues between prado and somerset.

Logs on primary:

2019-02-19 14:27:44 UTC [15973]: [1-1] user=postgres,db=softwareheritage LOG:  starting logical decoding for slot "pgl_softwareheritage_prado_somerset"
2019-02-19 14:27:44 UTC [15973]: [2-1] user=postgres,db=softwareheritage DETAIL:  streaming transactions committing after 18607/9189A578, reading WAL from 18607/9189A578
2019-02-19 14:27:44 UTC [15973]: [3-1] user=postgres,db=softwareheritage ERROR:  record with incorrect prev-link 5403A/2E2F1829 at 18607/9189A578 
2019-02-19 14:27:44 UTC [15973]: [4-1] user=postgres,db=softwareheritage LOG:  could not receive data from client: Connection reset by peer
Tue, Feb 19, 3:29 PM · System administration, Archive content
olasd renamed T1534: PostgreSQL replication issues between prado and somerset from PostgreSQL replication issues between prado and beaubourg to PostgreSQL replication issues between prado and somerset.
Tue, Feb 19, 3:22 PM · System administration, Archive content
olasd added a comment to T1526: Install a new VPN endpoint at Rocquencourt.

I'm slightly reordering what you wrote here, sorry!

Tue, Feb 19, 2:58 PM · System administration
olasd updated subscribers of T1534: PostgreSQL replication issues between prado and somerset.
Tue, Feb 19, 2:26 PM · System administration, Archive content
olasd added subtasks for T1535: Deploy prometheus-statsd-exporter to gather per-worker metrics: T1460: Add task related metrics to swh-scheduler, T1461: Add loader-related metrics to swh-loader-core.
Tue, Feb 19, 2:25 PM · System administration, Metrics/monitoring
olasd triaged T1535: Deploy prometheus-statsd-exporter to gather per-worker metrics as High priority.
Tue, Feb 19, 2:25 PM · System administration, Metrics/monitoring
olasd triaged T1534: PostgreSQL replication issues between prado and somerset as High priority.
Tue, Feb 19, 2:10 PM · System administration, Archive content
douardda added a comment to T1526: Install a new VPN endpoint at Rocquencourt.

Thanks @olasd for this piece of information.

Tue, Feb 19, 12:17 PM · System administration

Mon, Feb 18

olasd added a comment to T1526: Install a new VPN endpoint at Rocquencourt.

Thanks for recording this task; I'll use the opportunity to document the reasoning behind the current internal networking setup, to try and make sure nothing is forgotten before migrating it.

Mon, Feb 18, 7:30 PM · System administration

Sat, Feb 16

ardumont added a comment to T906: mercurial loader: Debian package.

Heads up.

Sat, Feb 16, 9:57 AM · System administration, Mercurial loader

Fri, Feb 15

ftigeot added a comment to T1467: Slow network transfers from beaubourg.

Network limitation removed via a hotfix (manual route deletion).
Some network downtime will be required in the future to ensure the new /etc network configuration works as expected.

Fri, Feb 15, 11:04 AM · System administration

Wed, Feb 13

ftigeot added a comment to T1467: Slow network transfers from beaubourg.

Actual content of the vmbr0 interface configuration in beaubourg:/etc/network/interfaces:

auto vmbr0
iface vmbr0 inet static
        bridge_ports vlan440
        address 192.168.100.32
        netmask 255.255.255.0
        up ip route add 192.168.101.0/24 via 192.168.100.1
        up ip route add 192.168.200.0/21 via 192.168.100.1
        up ip rule add from 192.168.100.32 table private
        up ip route add default via 192.168.100.1 dev vmbr0 table private
        up ip route flush cache
        down ip route del default via 192.168.100.1 dev vmbr0 table private
        down ip rule del from 192.168.100.32 table private
        down ip route del 192.168.200.0/21 via 192.168.100.1
        down ip route del 192.168.101.0/24 via 192.168.100.1
        down ip route flush cache
Wed, Feb 13, 4:49 PM · System administration
ftigeot added a comment to T1467: Slow network transfers from beaubourg.

Outgoing network traffic from beaubourg to the local private network 192.168.100.0/24 transits via louvre.
Louvre re-emits network packets and sends them to the destination host.

Wed, Feb 13, 1:28 PM · System administration

Tue, Feb 12

ftigeot added a comment to T1526: Install a new VPN endpoint at Rocquencourt.

Louvre had previously fallen more than once. Some of the events are documented in T1173.

Tue, Feb 12, 1:20 PM · System administration
ftigeot triaged T1526: Install a new VPN endpoint at Rocquencourt as Normal priority.
Tue, Feb 12, 1:12 PM · System administration

Mon, Feb 11

vlorentz renamed T1366: Generic service efficiency metrics in prometheus from Generic service efficiency metrics in promehteus to Generic service efficiency metrics in prometheus.
Mon, Feb 11, 2:14 PM · Metrics/monitoring, Restricted Project, System administration

Thu, Feb 7

olasd added a comment to T1520: Numerous dm device failures on louvre.

During the pvmove off of /dev/md3, the root filesystem for uffizi ended up being remounted r/o. I've shut it down, fsck'd it, and booted it back up.

Thu, Feb 7, 9:38 PM · System administration
ftigeot added a subtask for T1520: Numerous dm device failures on louvre: T1486: I/O error on worker06.internal.
Thu, Feb 7, 4:55 PM · System administration
ftigeot added a parent task for T1486: I/O error on worker06.internal: T1520: Numerous dm device failures on louvre.
Thu, Feb 7, 4:55 PM · System administration
ftigeot removed a parent task for T1520: Numerous dm device failures on louvre: T1486: I/O error on worker06.internal.
Thu, Feb 7, 4:55 PM · System administration
ftigeot removed a subtask for T1486: I/O error on worker06.internal: T1520: Numerous dm device failures on louvre.
Thu, Feb 7, 4:55 PM · System administration
ftigeot added a comment to T1520: Numerous dm device failures on louvre.

After the reboot, existing dm volumes on top of /dev/md3 still reported I/O errors:

[ 5200.552667] Buffer I/O error on dev dm-33, logical block 6999130, async page read
[ 5506.537868] Buffer I/O error on dev dm-35, logical block 2251864, async page read
Thu, Feb 7, 4:04 PM · System administration
ftigeot added a comment to T1520: Numerous dm device failures on louvre.

The kind of error reported massively and suddenly when louvre stopped operating properly:

Buffer I/O error on device dm-41, logical block 10474329
Thu, Feb 7, 3:51 PM · System administration
ftigeot updated the task description for T1520: Numerous dm device failures on louvre.
Thu, Feb 7, 3:48 PM · System administration
ftigeot added a comment to T1520: Numerous dm device failures on louvre.

Related-to: T1486, T1518

Thu, Feb 7, 3:46 PM · System administration
ftigeot changed the status of T1520: Numerous dm device failures on louvre from Open to Work in Progress.
Thu, Feb 7, 3:45 PM · System administration
ftigeot added a comment to T1486: I/O error on worker06.internal.

A brand new virtual disk was created, skipping bad data blocks:

Thu, Feb 7, 3:35 PM · System administration

Wed, Feb 6

ftigeot closed T1518: I/O error on louvre:/dev/md3 as Resolved.

RAID volume was successfully rebuilt, closing even though the root cause of the initial error was not found.

Wed, Feb 6, 10:43 AM · System administration
ftigeot closed T1518: I/O error on louvre:/dev/md3, a subtask of T1486: I/O error on worker06.internal, as Resolved.
Wed, Feb 6, 10:43 AM · System administration
ftigeot added a comment to T1518: I/O error on louvre:/dev/md3.

dm-11 is a device present on top of dm-10, itself backed by /dev/sda:

Wed, Feb 6, 10:42 AM · System administration
ftigeot added a comment to T1518: I/O error on louvre:/dev/md3.

More complete list of I/O errors as reported by dmesg(1):

[Tue Feb  5 09:38:53 2019] sd 0:0:2:0: [sda] tag#9 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[Tue Feb  5 09:38:53 2019] sd 0:0:2:0: [sda] tag#9 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00
[Tue Feb  5 09:38:53 2019] print_req_error: 140 callbacks suppressed
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev sda, sector 864969704
[Tue Feb  5 09:38:53 2019] device-mapper: multipath: Failing path 8:0.
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 864969704
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 865388544
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 149394832
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 180742368
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 180742224
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422416
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422368
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 212422032
[Tue Feb  5 09:38:53 2019] print_req_error: I/O error, dev dm-10, sector 149394864
[Tue Feb  5 09:38:53 2019] Buffer I/O error on dev dm-10, logical block 937684544, async page read
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516384
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516416
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516472
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516488
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516512
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516544
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516584
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 1387516600
[Tue Feb  5 09:38:53 2019] md/raid10:md3: dm-11: rescheduling sector 4518605200
[Tue Feb  5 09:38:54 2019] md/raid10:md3: dm-11: rescheduling sector 2565880800
...
[Tue Feb  5 09:39:51 2019] md: super_written gets error=10
[Tue Feb  5 09:39:51 2019] md/raid10:md3: Disk failure on dm-11, disabling device.
                           md/raid10:md3: Operation continuing on 1 devices.
[Tue Feb  5 09:39:51 2019] md/raid10:md3: dm-13: redirecting sector 1387516384 to another mirror
Wed, Feb 6, 10:36 AM · System administration

Tue, Feb 5

ftigeot renamed T1467: Slow network transfers from beaubourg from Network timeout issues in the Proxmox cluster to Slow network transfers from beaubourg.
Tue, Feb 5, 5:18 PM · System administration
ftigeot added a comment to T1467: Slow network transfers from beaubourg.

None of the previous timeout issues are visible anymore on the Proxmox web interface.
They were possibly related to bad network quality on the web browser side (INRIA guest wifi).

Tue, Feb 5, 5:18 PM · System administration
ftigeot changed the status of T1518: I/O error on louvre:/dev/md3 from Open to Work in Progress.

Forcing a rebuild by removing and re-adding the faulty device:

mdadm --manage /dev/md3 -r /dev/dm-11
mdadm --manage /dev/md3 -a /dev/dm-11
Tue, Feb 5, 4:18 PM · System administration
ftigeot changed the status of T1518: I/O error on louvre:/dev/md3, a subtask of T1486: I/O error on worker06.internal, from Open to Work in Progress.
Tue, Feb 5, 4:18 PM · System administration
ftigeot added a comment to T1518: I/O error on louvre:/dev/md3.

Like in T1486, we have a dm device reporting I/O errors but no visible errors on the underlying physical device.

Tue, Feb 5, 4:08 PM · System administration
ftigeot added a comment to T1518: I/O error on louvre:/dev/md3.

A full read check of /dev/sda did not return any error:

# dd if=/dev/sda of=/dev/null bs=1M
Tue, Feb 5, 3:53 PM · System administration
ftigeot added a comment to T1518: I/O error on louvre:/dev/md3.

As shown by lsblk, dm-11 and dm-13 are partitions on multipath devices.
These devices are themselves respectively handled by the physical /dev/sda and /dev/sdb SSDs:

Tue, Feb 5, 3:50 PM · System administration
ftigeot triaged T1518: I/O error on louvre:/dev/md3 as High priority.
Tue, Feb 5, 3:36 PM · System administration
vlorentz updated subscribers of T1516: High disk utilization on dbreplica0.euwest.azure.
Tue, Feb 5, 2:49 PM · System administration
vlorentz triaged T1516: High disk utilization on dbreplica0.euwest.azure as Normal priority.
Tue, Feb 5, 2:48 PM · System administration

Fri, Feb 1

ftigeot renamed T1501: Phantom device mapper volume usage in Proxmox: logical volume is used by another device from Phantom device mapper volume usage in Proxmox to Phantom device mapper volume usage in Proxmox: logical volume is used by another device.
Fri, Feb 1, 11:20 AM · System administration
ftigeot closed T1509: Phantom device mapper volume usage in Proxmox: local storage is not available on target node as Resolved.

Removing the previously used volume allowed VM migration to complete.

Fri, Feb 1, 11:06 AM · System administration
ftigeot closed T1509: Phantom device mapper volume usage in Proxmox: local storage is not available on target node, a subtask of T1501: Phantom device mapper volume usage in Proxmox: logical volume is used by another device, as Resolved.
Fri, Feb 1, 11:06 AM · System administration
ftigeot added a comment to T1509: Phantom device mapper volume usage in Proxmox: local storage is not available on target node.

Since the previously used drive is not used anymore, I decided to remove it:

# lvchange -a y ssd/vm-102-disk-0
# vremove /dev/ssd/vm-102-disk-0
Fri, Feb 1, 11:01 AM · System administration
ftigeot changed the status of T1509: Phantom device mapper volume usage in Proxmox: local storage is not available on target node, a subtask of T1501: Phantom device mapper volume usage in Proxmox: logical volume is used by another device, from Open to Work in Progress.
Fri, Feb 1, 10:55 AM · System administration