Page MenuHomeSoftware Heritage
Feed Advanced Search

Jan 31 2019

ftigeot added a comment to T1501: Phantom device mapper volume usage in Proxmox: logical volume is used by another device.

Trying to manually disable the logical volume in question fails with the same error message

lvchange -a n /dev/ssd/vm-107-disk-0
Logical volume ssd/vm-107-disk-0 is used by another device.
Jan 31 2019, 5:31 PM · System administration
ftigeot updated the task description for T1501: Phantom device mapper volume usage in Proxmox: logical volume is used by another device.
Jan 31 2019, 2:55 PM · System administration

Jan 30 2019

ftigeot closed T1502: Too many postgresql logs on dbreplica0.euwest.azure.internal.softwareheritage.org as Resolved.

Only keep 24 hours of log, and keep rotating on the same file names:

Jan 30 2019, 3:09 PM · System administration
ftigeot added a comment to T1502: Too many postgresql logs on dbreplica0.euwest.azure.internal.softwareheritage.org.

There is no need to log all production queries on this server.
Reducing logged contents to queries taking more than one millisecond to execute:

Jan 30 2019, 2:23 PM · System administration
ftigeot triaged T1503: Rename hypervisor3 to a museum name as Normal priority.
Jan 30 2019, 11:58 AM · System administration
ftigeot added a comment to T1467: Slow network transfers from beaubourg.

It turns out hypervisor3 is not the culprit we thought it was.
Removing T1392 from parent task list.

Jan 30 2019, 11:54 AM · System administration
ftigeot removed a parent task for T1467: Slow network transfers from beaubourg: T1392: Add a new hypervisor.
Jan 30 2019, 11:53 AM · System administration
ftigeot removed a subtask for T1392: Add a new hypervisor: T1467: Slow network transfers from beaubourg.
Jan 30 2019, 11:53 AM · System administration
ftigeot changed the status of T1502: Too many postgresql logs on dbreplica0.euwest.azure.internal.softwareheritage.org from Open to Work in Progress.
Jan 30 2019, 10:35 AM · System administration

Jan 29 2019

ftigeot changed the status of T1501: Phantom device mapper volume usage in Proxmox: logical volume is used by another device from Open to Work in Progress.
Jan 29 2019, 2:08 PM · System administration

Jan 25 2019

ftigeot added a comment to T1467: Slow network transfers from beaubourg.

After running some additional tcp iperf tests, it is obvious beaubourg is the outlier.
Measured bandwidth :

  • from any 10G machine to any 10G machine (except beaubourg): > 9 Gb/s
  • from any 10G machine to beaubourg: > 9Gb/s
  • from beaubourg to ceph-osd1, ceph-osd2 and hypervisor3: 600-800 Mb/s
  • from beaubourg to ceph-mon1: 230 Kb/s
Jan 25 2019, 3:56 PM · System administration

Jan 22 2019

ftigeot added a comment to T1467: Slow network transfers from beaubourg.

Since all these machines are relied to the same pair of switches and these switches are managed by INRIA DSI-SESI, I have asked for their assistance in this ticket:
https://support.inria.fr/Ticket/Display.html?id=127011

Jan 22 2019, 3:18 PM · System administration
ftigeot added a comment to T1486: I/O error on worker06.internal.

The /dev/md3 check completed successfully and did not report any error.

Jan 22 2019, 8:41 AM · System administration
ftigeot claimed T1486: I/O error on worker06.internal.
Jan 22 2019, 8:41 AM · System administration

Jan 21 2019

ftigeot added a comment to T1486: I/O error on worker06.internal.

worker06.internal.softwareheritage.org is a VM running on louvre, Its virtual disk is backed by /dev/dm-36 on the host.

Jan 21 2019, 2:42 PM · System administration
ftigeot changed the status of T1486: I/O error on worker06.internal from Open to Work in Progress.
Jan 21 2019, 2:38 PM · System administration

Jan 16 2019

ftigeot added a comment to T1467: Slow network transfers from beaubourg.

For the previous iperf TCP test and without tuning, we also have:

  • an average transfer speed of 9,388 Mb/s between hypervisor3 and one of the 10G Ceph nodes, ceph-osd1.
  • an average rransfer speed of 8,364 Mb/s between beaubourg and ceph-osd1.
Jan 16 2019, 11:43 AM · System administration

Jan 15 2019

ftigeot added a comment to T1467: Slow network transfers from beaubourg.

Both beaubourg and hypervisor3 network interfaces have a 10Gb/s link layer connection.
Aggregated traffic from multiple iperf streams nevertheless never reaches more than ~= 90% of a 1Gb/s transfer speed.

Jan 15 2019, 5:05 PM · System administration
ftigeot added a comment to T1467: Slow network transfers from beaubourg.

Another thing worth noting is the vmbr0 interface on which the primary IP address is located, has a mtu of only 1500 bytes.
The network interfaces it is built on have a 9000 bytes mtu.

Jan 15 2019, 4:28 PM · System administration
ftigeot added a comment to T1467: Slow network transfers from beaubourg.

iperf tests show

  • network speed never reaches 1Gbps, even between hosts which have 10Gb/s network interfaces and are connected to the same switches
  • 19% of UDP packets get lost at 1Gb/s (less than 0.5% at 100Mb/s)
Jan 15 2019, 2:59 PM · System administration

Jan 14 2019

ftigeot added a comment to T1467: Slow network transfers from beaubourg.

Corosync warnings also routinely appear in the logs:

Jan 14 11:56:13 hypervisor3 corosync[5622]: notice  [TOTEM ] Retransmit List: 282eb9
Jan 14 11:56:13 hypervisor3 corosync[5622]:  [TOTEM ] Retransmit List: 282eb9
Jan 14 11:56:13 hypervisor3 corosync[5622]:  [TOTEM ] Retransmit List: 282eba
Jan 14 2019, 1:26 PM · System administration
ftigeot changed the status of T1467: Slow network transfers from beaubourg, a subtask of T1392: Add a new hypervisor, from Open to Work in Progress.
Jan 14 2019, 1:23 PM · System administration
ftigeot changed the status of T1467: Slow network transfers from beaubourg from Open to Work in Progress.

The network interface hardware on hypervisor3 is relatively new:

i40e: Intel(R) Ethernet Connection XL710 Network Driver - version 2.1.14-k
Jan 14 2019, 1:23 PM · System administration

Jan 11 2019

ftigeot triaged T1467: Slow network transfers from beaubourg as Normal priority.
Jan 11 2019, 4:24 PM · System administration

Dec 21 2018

ftigeot closed T1325: Add SSDs to banco as Resolved.

Two 4TB SSDs added to banco yesterday, exported to Linux as JBODs.

Dec 21 2018, 4:10 PM · System administration
ftigeot added a comment to T1392: Add a new hypervisor.

Proxmox now installed on the machine, hypervisor3.softwareheritage.org.

Dec 21 2018, 4:08 PM · System administration
ftigeot committed rSPSITE59ee68802a76: manifests/site: add a new hypervisor, hypervisor3 (authored by ftigeot).
manifests/site: add a new hypervisor, hypervisor3
Dec 21 2018, 2:14 PM

Dec 13 2018

ftigeot moved T1442: Replace Munin graphs with Grafana/Prometheus dashboards from Backlog to in progress on the Sprint 2018 12 board.
Dec 13 2018, 4:22 PM · Sprint 2018 12, System administration
ftigeot changed the status of T1442: Replace Munin graphs with Grafana/Prometheus dashboards, a subtask of T1356: Kill munin, from Open to Work in Progress.
Dec 13 2018, 4:21 PM · Sprint 2018 12, System administration
ftigeot changed the status of T1442: Replace Munin graphs with Grafana/Prometheus dashboards from Open to Work in Progress.
Dec 13 2018, 4:21 PM · Sprint 2018 12, System administration
ftigeot triaged T1442: Replace Munin graphs with Grafana/Prometheus dashboards as High priority.
Dec 13 2018, 4:19 PM · Sprint 2018 12, System administration
ftigeot added a parent task for T1428: Create an inventory of useful Munin metrics: T1356: Kill munin.
Dec 13 2018, 4:14 PM · Metrics/monitoring, Sprint 2018 12
ftigeot added a subtask for T1356: Kill munin: T1428: Create an inventory of useful Munin metrics.
Dec 13 2018, 4:14 PM · Sprint 2018 12, System administration

Dec 11 2018

ftigeot changed the status of T1338: Change BBUs on orsay from Open to Work in Progress.

Another Perc H700 battery replacement product: http://www.hardware-attitude.com/fiche-1114-batterie-raid-pour-perc5-i-perc6-i---nu209.html
We should buy this one if possible ASAP IMHO.

Dec 11 2018, 4:55 PM · System administration

Dec 7 2018

ftigeot added a comment to T1372: Compare Rsnapshot / BorgBackup / Backuppc.

Borgbackup is unable to pull data from remote hosts to a central location.

I do not understand this assertion.

Dec 7 2018, 10:50 AM · System administration

Dec 4 2018

ftigeot changed the status of T1428: Create an inventory of useful Munin metrics from Open to Work in Progress.

Disk

  • I/Os per device
  • Disk usage in percent
  • Utilization per device is this real ? it could be useful to see if a storage subsystem is overloaded
  • Disk usage in absolute human values. percentages are meaningless if we resize filesystems
Dec 4 2018, 4:11 PM · Metrics/monitoring, Sprint 2018 12
ftigeot changed the status of T1428: Create an inventory of useful Munin metrics, a subtask of T1408: More/better Metrics, from Open to Work in Progress.
Dec 4 2018, 4:11 PM · Metrics/monitoring, Sprint 2018 12
ftigeot updated subscribers of T1428: Create an inventory of useful Munin metrics.
Dec 4 2018, 2:46 PM · Metrics/monitoring, Sprint 2018 12
ftigeot triaged T1428: Create an inventory of useful Munin metrics as Normal priority.
Dec 4 2018, 2:45 PM · Metrics/monitoring, Sprint 2018 12
ftigeot changed the status of T1372: Compare Rsnapshot / BorgBackup / Backuppc, a subtask of T1282: Revisit backups, from Open to Work in Progress.
Dec 4 2018, 2:41 PM · System administration
ftigeot changed the status of T1372: Compare Rsnapshot / BorgBackup / Backuppc from Open to Work in Progress.

There is a huge difference between Borgbackup and Rsnapshot + Backuppc: Borgbackup is unable to pull data from remote hosts to a central location.
Its working model is based on Borgbackup running locally and storing data to a local filesystem.

Dec 4 2018, 2:41 PM · System administration
ftigeot added a comment to T1392: Add a new hypervisor.

New hypervisor hardware has been racked in our bay at Rocquencourt.
The machine's iDrac management interface is accessible on the management network, under the name swh7-adm.inria.fr (details on the wiki).

Dec 4 2018, 11:56 AM · System administration
ftigeot closed T1404: Resolve disk full issue on somerset:/srv/softwareheritage/postgres as Resolved.

Service postgresql@10-indexer.service has been restarted on somerset and database replication is once again operating normally.
Postgres wal files are being removed as expected on the master, slowly freeing disk space.

Dec 4 2018, 11:31 AM · System administration

Dec 3 2018

ftigeot added a comment to T1404: Resolve disk full issue on somerset:/srv/softwareheritage/postgres.

Some no longer useful dump files were removed by seirl@, freeing some space on somerset:/srv/softwareheritage/postgres .

Dec 3 2018, 3:19 PM · System administration
ftigeot added a comment to T1404: Resolve disk full issue on somerset:/srv/softwareheritage/postgres.

somerset:softwareheritage-indexer is the master database for dbreplica1:softwareheritage-indexer.

Dec 3 2018, 3:17 PM · System administration
ftigeot added a parent task for T1395: Enlarge disk on dbreplica1: T1404: Resolve disk full issue on somerset:/srv/softwareheritage/postgres.
Dec 3 2018, 3:11 PM · System administration
ftigeot added a subtask for T1404: Resolve disk full issue on somerset:/srv/softwareheritage/postgres: T1395: Enlarge disk on dbreplica1.
Dec 3 2018, 3:11 PM · System administration
ftigeot changed the status of T1404: Resolve disk full issue on somerset:/srv/softwareheritage/postgres from Open to Work in Progress.
Dec 3 2018, 3:10 PM · System administration
ftigeot closed T1395: Enlarge disk on dbreplica1 as Resolved.

The pvmove command was done this morning.

Dec 3 2018, 3:07 PM · System administration

Nov 27 2018

ftigeot added a parent task for T1372: Compare Rsnapshot / BorgBackup / Backuppc: T1282: Revisit backups.
Nov 27 2018, 4:45 PM · System administration
ftigeot added a subtask for T1282: Revisit backups: T1372: Compare Rsnapshot / BorgBackup / Backuppc.
Nov 27 2018, 4:45 PM · System administration
ftigeot changed the status of T1392: Add a new hypervisor from Open to Work in Progress.
Nov 27 2018, 4:42 PM · System administration

Nov 23 2018

ftigeot added a comment to T1338: Change BBUs on orsay.

At least some of the batteries for PERC H800 adapters use part number KR174 and/or M164C.
Some information leads me to believe they could also be used with PERC H700 adapters.

Nov 23 2018, 3:20 PM · System administration
ftigeot lowered the priority of T979: Migrate TLS certificates away from the *.softwareheritage.org wildcards from High to Wishlist.

I did some experiments with Letsencrypt but other things were more urgent during the September-October 2018 period and in the end a wildcard Digicert certificate was used again instead.

Nov 23 2018, 3:04 PM · System administration
ftigeot committed rSPSITE33fdc25ae44e: Rsnapshot master role: Exclude file patterns from backups (authored by ftigeot).
Rsnapshot master role: Exclude file patterns from backups
Nov 23 2018, 2:06 PM

Nov 22 2018

ftigeot committed rSPSITE57ad56cde817: data/default: Export root@banco's public ssh key (authored by ftigeot).
data/default: Export root@banco's public ssh key
Nov 22 2018, 3:03 PM

Nov 20 2018

ftigeot triaged T1372: Compare Rsnapshot / BorgBackup / Backuppc as Normal priority.
Nov 20 2018, 4:36 PM · System administration
ftigeot committed rSPSITEf5e70254d953: Rsnapshot master role: Do not run rsnapshot hourly every minute (authored by ftigeot).
Rsnapshot master role: Do not run rsnapshot hourly every minute
Nov 20 2018, 4:09 PM

Nov 16 2018

ftigeot added a comment to T1338: Change BBUs on orsay.

Batteries for PERC H700 adapters have the part number U8735 and/or NU209.

Nov 16 2018, 3:55 PM · System administration
ftigeot committed rSPSITEe5b5d5b49b94: Rsnapshot master role: last minute fixes (authored by ftigeot).
Rsnapshot master role: last minute fixes
Nov 16 2018, 2:54 PM
ftigeot committed rSPSITEe740b250680e: Add a new rsnapshot::master role (authored by ftigeot).
Add a new rsnapshot::master role
Nov 16 2018, 2:22 PM

Nov 15 2018

ftigeot added a comment to T1338: Change BBUs on orsay.

Orsay contains two LSI SAS 2108-based RAID adapters:

05:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05)
22:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05)
Nov 15 2018, 12:27 PM · System administration
ftigeot added a comment to T1325: Add SSDs to banco.

Since the SSDs we have are 2.5", we need a special adapter disk tray, which Dell refuses to sell us.

Nov 15 2018, 12:01 PM · System administration

Nov 14 2018

ftigeot triaged T1340: Automate storage BBUs monitoring as Normal priority.
Nov 14 2018, 11:59 AM · System administration
ftigeot added a comment to T1338: Change BBUs on orsay.

Related: T1323

Nov 14 2018, 11:57 AM · System administration
ftigeot triaged T1338: Change BBUs on orsay as Normal priority.
Nov 14 2018, 11:56 AM · System administration

Nov 13 2018

ftigeot closed T1323: Check battery status on storage adapters as Resolved.

In summary, only orsay has a failed BBU.
Given the fact that it contains two identical RAID adapters with old-age, similar BBUs, it could be useful to change both at once.

Nov 13 2018, 2:56 PM · System administration
ftigeot added a comment to T1323: Check battery status on storage adapters.

List of physical machines at Rocquencourt: louvre beaubourg orsay banco

Nov 13 2018, 2:53 PM · System administration
ftigeot added a project to T1323: Check battery status on storage adapters: System administration.
Nov 13 2018, 12:30 PM · System administration
ftigeot added a project to T1325: Add SSDs to banco: System administration.
Nov 13 2018, 12:30 PM · System administration
ftigeot triaged T1325: Add SSDs to banco as Normal priority.
Nov 13 2018, 12:27 PM · System administration
ftigeot triaged T1323: Check battery status on storage adapters as High priority.
Nov 13 2018, 12:16 PM · System administration

Nov 7 2018

ftigeot committed rSENV70336dcb76a7: .mrconfig: Fix a syntax error introduced in 07648123 (authored by ftigeot).
.mrconfig: Fix a syntax error introduced in 07648123
Nov 7 2018, 4:14 PM

Oct 23 2018

ftigeot committed rSPSITE3e371b4d7859: data/banco: exclude new backup tests from dar backups (authored by ftigeot).
data/banco: exclude new backup tests from dar backups
Oct 23 2018, 4:16 PM

Oct 22 2018

ftigeot added a comment to T1282: Revisit backups.

The existing dar(1) based system is not reliable.

Oct 22 2018, 2:36 PM · System administration
ftigeot added a parent task for T1164: Dar backups fill up disk space on client machines: T1282: Revisit backups.
Oct 22 2018, 2:21 PM · System administration
ftigeot added a subtask for T1282: Revisit backups: T1164: Dar backups fill up disk space on client machines.
Oct 22 2018, 2:21 PM · System administration
ftigeot created T1282: Revisit backups.
Oct 22 2018, 2:09 PM · System administration

Oct 19 2018

ftigeot closed T1201: monitor DNS zones on primary/replica to ensure they stay in sync as Resolved.

Icinga2 service monitoring changes pushed in commit rSPSITE76d7d90c51e0, based on the initial script linked by olasd@.

Oct 19 2018, 2:38 PM · System administration
ftigeot closed T1201: monitor DNS zones on primary/replica to ensure they stay in sync, a subtask of T1179: Create an independent DNS resolver on Azure, as Resolved.
Oct 19 2018, 2:38 PM · System administration
ftigeot committed rSPSITE76d7d90c51e0: icinga2: Check the SOA field on i.s.o (authored by ftigeot).
icinga2: Check the SOA field on i.s.o
Oct 19 2018, 2:32 PM
ftigeot committed rSPSITE21df6c9533bb: ELK stack: Use a single version constant for all packages (authored by ftigeot).
ELK stack: Use a single version constant for all packages
Oct 19 2018, 10:40 AM
ftigeot closed D548: ELK stack: Use a single version constant for all packages.
Oct 19 2018, 10:40 AM

Oct 18 2018

Herald added a reviewer for D548: ELK stack: Use a single version constant for all packages: Reviewers.
Oct 18 2018, 3:07 PM
ftigeot added a comment to T1273: elasticsearch: about the elk stack policy upgrade?.

Elasticsearch, Logstash and Kibana are now released together and similar versions are sure to be compatible. It makes sense to have a global Puppet constant defining which general elk stack version to use for packages.

Oct 18 2018, 12:18 PM · System administration
ftigeot added a comment to T1273: elasticsearch: about the elk stack policy upgrade?.

A quick analysis of the 6.4.x family versions show they bring significant bug fixes to the table.
One particularly interesting aspect is general cluster reliability improvements when nodes leaves or come back to the cluster.

Oct 18 2018, 11:40 AM · System administration

Oct 17 2018

ftigeot added a comment to T1273: elasticsearch: about the elk stack policy upgrade?.

Upgrading the Elasticsearch cluster is a somewhat delicate operation since nodes running old Elasticsearch versions can no longer store new data but it is not really difficult to handle properly.
The biggest issue could be with Kibana / Elasticsearch interactions: some old Kibana versions are known to stop displaying dashboards when talking to newer Elasticsearch servers.

Oct 17 2018, 3:49 PM · System administration
ftigeot accepted D544: logstash: Pin version to current 6.4.2.

Looks good.

Oct 17 2018, 2:50 PM

Oct 9 2018

ftigeot triaged T1253: Generate correct SOA records for internal.softwareheritage.org. as Normal priority.
Oct 9 2018, 11:36 AM · System administration

Oct 8 2018

ftigeot added a comment to T1201: monitor DNS zones on primary/replica to ensure they stay in sync.

Correct me if I am wrong, but I do not believe the current Puppet code has the ability to handle more than one NS record per zone.
At the very least, I couldn't find an obvious way to add such a record.

Oct 8 2018, 4:29 PM · System administration
ftigeot closed T1175: renews SSL certificats for {www,}softwareheritage.org as Resolved.

All known SSL services now use updated certificates. Closing.

Oct 8 2018, 12:22 PM · System administration

Oct 3 2018

ftigeot added a comment to T1175: renews SSL certificats for {www,}softwareheritage.org.

www and www-dev.softwareheritage.org now use auto-generated Gandi certificates.

Oct 3 2018, 5:07 PM · System administration
ftigeot added a comment to T1175: renews SSL certificats for {www,}softwareheritage.org.

Updated certificate uploaded to the Puppet repository and internal hosts updated.

Oct 3 2018, 2:43 PM · System administration
ftigeot committed rSPSITEbf45407f6863: data: Update star_softwareheritage_org certificate (authored by ftigeot).
data: Update star_softwareheritage_org certificate
Oct 3 2018, 1:45 PM

Oct 2 2018

ftigeot committed rSPSITE117966345b61: Really pin Elasticsearch packages to 6.3.2 (authored by ftigeot).
Really pin Elasticsearch packages to 6.3.2
Oct 2 2018, 3:21 PM
ftigeot committed rSPSITEaebb91b39d1e: kibana role: Pin version to 5.6.12 (authored by ftigeot).
kibana role: Pin version to 5.6.12
Oct 2 2018, 1:47 PM
ftigeot added a comment to T1173: Huge slowdowns on louvre since 2018-08-20.

PCID option removed on some VMs in order to migrate them to orsay.
The current plan is to completely replace louvre by a more recent and reliable machine for the hypervisor functions.

Oct 2 2018, 10:30 AM · System administration

Sep 25 2018

ftigeot changed the status of T1175: renews SSL certificats for {www,}softwareheritage.org from Open to Work in Progress.

Existing CSR data submitted again today to the secret INRIA/Digicert URL.

Sep 25 2018, 4:21 PM · System administration

Sep 21 2018

ftigeot changed the status of T1201: monitor DNS zones on primary/replica to ensure they stay in sync, a subtask of T1179: Create an independent DNS resolver on Azure, from Open to Work in Progress.
Sep 21 2018, 11:22 AM · System administration
ftigeot changed the status of T1201: monitor DNS zones on primary/replica to ensure they stay in sync from Open to Work in Progress.

Right now, the internal.softwareheritage.org zone contains only a single NS record. This is most likely also the case for the various reverse zones.
There is no explicit notification directive in the master server configuration either.

Sep 21 2018, 11:22 AM · System administration