Page MenuHomeSoftware Heritage

ftigeot (François Tigeot)
User

Projects

User Details

User Since
Sep 6 2017, 1:06 PM (65 w, 5 d)

Recent Activity

Fri, Dec 7

ftigeot added a comment to T1372: Compare Rsnapshot / BorgBackup / Backuppc.

Borgbackup is unable to pull data from remote hosts to a central location.

I do not understand this assertion.

Fri, Dec 7, 10:50 AM · System administration

Tue, Dec 4

ftigeot changed the status of T1428: Create an inventory of useful Munin metrics from Open to Work in Progress.

Disk

  • I/Os per device
  • Disk usage in percent
  • Utilization per device is this real ? it could be useful to see if a storage subsystem is overloaded
  • Disk usage in absolute human values. percentages are meaningless if we resize filesystems
Tue, Dec 4, 4:11 PM · Sprint 2018 12
ftigeot changed the status of T1428: Create an inventory of useful Munin metrics, a subtask of T1408: More/better Metrics, from Open to Work in Progress.
Tue, Dec 4, 4:11 PM · Sprint 2018 12
ftigeot updated subscribers of T1428: Create an inventory of useful Munin metrics.
Tue, Dec 4, 2:46 PM · Sprint 2018 12
ftigeot triaged T1428: Create an inventory of useful Munin metrics as Normal priority.
Tue, Dec 4, 2:45 PM · Sprint 2018 12
ftigeot changed the status of T1372: Compare Rsnapshot / BorgBackup / Backuppc, a subtask of T1282: Revisit backups, from Open to Work in Progress.
Tue, Dec 4, 2:41 PM · System administration
ftigeot changed the status of T1372: Compare Rsnapshot / BorgBackup / Backuppc from Open to Work in Progress.

There is a huge difference between Borgbackup and Rsnapshot + Backuppc: Borgbackup is unable to pull data from remote hosts to a central location.
Its working model is based on Borgbackup running locally and storing data to a local filesystem.

Tue, Dec 4, 2:41 PM · System administration
ftigeot added a comment to T1392: Add a new hypervisor.

New hypervisor hardware has been racked in our bay at Rocquencourt.
The machine's iDrac management interface is accessible on the management network, under the name swh7-adm.inria.fr (details on the wiki).

Tue, Dec 4, 11:56 AM · System administration
ftigeot closed T1404: Resolve disk full issue on somerset:/srv/softwareheritage/postgres as Resolved.

Service postgresql@10-indexer.service has been restarted on somerset and database replication is once again operating normally.
Postgres wal files are being removed as expected on the master, slowly freeing disk space.

Tue, Dec 4, 11:31 AM · System administration

Mon, Dec 3

ftigeot added a comment to T1404: Resolve disk full issue on somerset:/srv/softwareheritage/postgres.

Some no longer useful dump files were removed by seirl@, freeing some space on somerset:/srv/softwareheritage/postgres .

Mon, Dec 3, 3:19 PM · System administration
ftigeot added a comment to T1404: Resolve disk full issue on somerset:/srv/softwareheritage/postgres.

somerset:softwareheritage-indexer is the master database for dbreplica1:softwareheritage-indexer.

Mon, Dec 3, 3:17 PM · System administration
ftigeot added a parent task for T1395: Enlarge disk on dbreplica1: T1404: Resolve disk full issue on somerset:/srv/softwareheritage/postgres.
Mon, Dec 3, 3:11 PM · System administration
ftigeot added a subtask for T1404: Resolve disk full issue on somerset:/srv/softwareheritage/postgres: T1395: Enlarge disk on dbreplica1.
Mon, Dec 3, 3:11 PM · System administration
ftigeot changed the status of T1404: Resolve disk full issue on somerset:/srv/softwareheritage/postgres from Open to Work in Progress.
Mon, Dec 3, 3:10 PM · System administration
ftigeot closed T1395: Enlarge disk on dbreplica1 as Resolved.

The pvmove command was done this morning.

Mon, Dec 3, 3:07 PM · System administration

Tue, Nov 27

ftigeot added a parent task for T1372: Compare Rsnapshot / BorgBackup / Backuppc: T1282: Revisit backups.
Tue, Nov 27, 4:45 PM · System administration
ftigeot added a subtask for T1282: Revisit backups: T1372: Compare Rsnapshot / BorgBackup / Backuppc.
Tue, Nov 27, 4:45 PM · System administration
ftigeot changed the status of T1392: Add a new hypervisor from Open to Work in Progress.
Tue, Nov 27, 4:42 PM · System administration

Fri, Nov 23

ftigeot added a comment to T1338: Change BBUs on orsay.

At least some of the batteries for PERC H800 adapters use part number KR174 and/or M164C.
Some information leads me to believe they could also be used with PERC H700 adapters.

Fri, Nov 23, 3:20 PM · System administration
ftigeot lowered the priority of T979: Migrate TLS certificates away from the *.softwareheritage.org wildcards from High to Wishlist.

I did some experiments with Letsencrypt but other things were more urgent during the September-October 2018 period and in the end a wildcard Digicert certificate was used again instead.

Fri, Nov 23, 3:04 PM · System administration
ftigeot committed rSPSITE33fdc25ae44e: Rsnapshot master role: Exclude file patterns from backups (authored by ftigeot).
Rsnapshot master role: Exclude file patterns from backups
Fri, Nov 23, 2:06 PM

Thu, Nov 22

ftigeot committed rSPSITE57ad56cde817: data/default: Export root@banco's public ssh key (authored by ftigeot).
data/default: Export root@banco's public ssh key
Thu, Nov 22, 3:03 PM

Tue, Nov 20

ftigeot triaged T1372: Compare Rsnapshot / BorgBackup / Backuppc as Normal priority.
Tue, Nov 20, 4:36 PM · System administration
ftigeot committed rSPSITEf5e70254d953: Rsnapshot master role: Do not run rsnapshot hourly every minute (authored by ftigeot).
Rsnapshot master role: Do not run rsnapshot hourly every minute
Tue, Nov 20, 4:09 PM

Fri, Nov 16

ftigeot added a comment to T1338: Change BBUs on orsay.

Batteries for PERC H700 adapters have the part number U8735.

Fri, Nov 16, 3:55 PM · System administration
ftigeot committed rSPSITEe5b5d5b49b94: Rsnapshot master role: last minute fixes (authored by ftigeot).
Rsnapshot master role: last minute fixes
Fri, Nov 16, 2:54 PM
ftigeot committed rSPSITEe740b250680e: Add a new rsnapshot::master role (authored by ftigeot).
Add a new rsnapshot::master role
Fri, Nov 16, 2:22 PM

Thu, Nov 15

ftigeot added a comment to T1338: Change BBUs on orsay.

Orsay contains two LSI SAS 2108-based RAID adapters:

05:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05)
22:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05)
Thu, Nov 15, 12:27 PM · System administration
ftigeot added a comment to T1325: Add SSDs to banco.

Since the SSDs we have are 2.5", we need a special adapter disk tray, which Dell refuses to sell us.

Thu, Nov 15, 12:01 PM · System administration

Wed, Nov 14

ftigeot triaged T1340: Automate storage BBUs monitoring as Normal priority.
Wed, Nov 14, 11:59 AM · System administration
ftigeot added a comment to T1338: Change BBUs on orsay.

Related: T1323

Wed, Nov 14, 11:57 AM · System administration
ftigeot triaged T1338: Change BBUs on orsay as Normal priority.
Wed, Nov 14, 11:56 AM · System administration

Tue, Nov 13

ftigeot closed T1323: Check battery status on storage adapters as Resolved.

In summary, only orsay has a failed BBU.
Given the fact that it contains two identical RAID adapters with old-age, similar BBUs, it could be useful to change both at once.

Tue, Nov 13, 2:56 PM · System administration
ftigeot added a comment to T1323: Check battery status on storage adapters.

List of physical machines at Rocquencourt: louvre beaubourg orsay banco

Tue, Nov 13, 2:53 PM · System administration
ftigeot added a project to T1323: Check battery status on storage adapters: System administration.
Tue, Nov 13, 12:30 PM · System administration
ftigeot added a project to T1325: Add SSDs to banco: System administration.
Tue, Nov 13, 12:30 PM · System administration
ftigeot triaged T1325: Add SSDs to banco as Normal priority.
Tue, Nov 13, 12:27 PM · System administration
ftigeot triaged T1323: Check battery status on storage adapters as High priority.
Tue, Nov 13, 12:16 PM · System administration

Nov 7 2018

ftigeot committed rSENV70336dcb76a7: .mrconfig: Fix a syntax error introduced in 07648123 (authored by ftigeot).
.mrconfig: Fix a syntax error introduced in 07648123
Nov 7 2018, 4:14 PM

Oct 23 2018

ftigeot committed rSPSITE3e371b4d7859: data/banco: exclude new backup tests from dar backups (authored by ftigeot).
data/banco: exclude new backup tests from dar backups
Oct 23 2018, 4:16 PM

Oct 22 2018

ftigeot added a comment to T1282: Revisit backups.

The existing dar(1) based system is not reliable.

Oct 22 2018, 2:36 PM · System administration
ftigeot added a parent task for T1164: Dar backups fill up disk space on client machines: T1282: Revisit backups.
Oct 22 2018, 2:21 PM · System administration
ftigeot added a subtask for T1282: Revisit backups: T1164: Dar backups fill up disk space on client machines.
Oct 22 2018, 2:21 PM · System administration
ftigeot created T1282: Revisit backups.
Oct 22 2018, 2:09 PM · System administration

Oct 19 2018

ftigeot closed T1201: monitor DNS zones on primary/replica to ensure they stay in sync as Resolved.

Icinga2 service monitoring changes pushed in commit rSPSITE76d7d90c51e0, based on the initial script linked by olasd@.

Oct 19 2018, 2:38 PM · System administration
ftigeot closed T1201: monitor DNS zones on primary/replica to ensure they stay in sync, a subtask of T1179: Create an independent DNS resolver on Azure, as Resolved.
Oct 19 2018, 2:38 PM · System administration
ftigeot committed rSPSITE76d7d90c51e0: icinga2: Check the SOA field on i.s.o (authored by ftigeot).
icinga2: Check the SOA field on i.s.o
Oct 19 2018, 2:32 PM
ftigeot committed rSPSITE21df6c9533bb: ELK stack: Use a single version constant for all packages (authored by ftigeot).
ELK stack: Use a single version constant for all packages
Oct 19 2018, 10:40 AM
ftigeot closed D548: ELK stack: Use a single version constant for all packages.
Oct 19 2018, 10:40 AM

Oct 18 2018

Herald added a reviewer for D548: ELK stack: Use a single version constant for all packages: Reviewers.
Oct 18 2018, 3:07 PM
ftigeot added a comment to T1273: elasticsearch: about the elk stack policy upgrade?.

Elasticsearch, Logstash and Kibana are now released together and similar versions are sure to be compatible. It makes sense to have a global Puppet constant defining which general elk stack version to use for packages.

Oct 18 2018, 12:18 PM · System administration
ftigeot added a comment to T1273: elasticsearch: about the elk stack policy upgrade?.

A quick analysis of the 6.4.x family versions show they bring significant bug fixes to the table.
One particularly interesting aspect is general cluster reliability improvements when nodes leaves or come back to the cluster.

Oct 18 2018, 11:40 AM · System administration

Oct 17 2018

ftigeot added a comment to T1273: elasticsearch: about the elk stack policy upgrade?.

Upgrading the Elasticsearch cluster is a somewhat delicate operation since nodes running old Elasticsearch versions can no longer store new data but it is not really difficult to handle properly.
The biggest issue could be with Kibana / Elasticsearch interactions: some old Kibana versions are known to stop displaying dashboards when talking to newer Elasticsearch servers.

Oct 17 2018, 3:49 PM · System administration
ftigeot accepted D544: logstash: Pin version to current 6.4.2.

Looks good.

Oct 17 2018, 2:50 PM

Oct 9 2018

ftigeot triaged T1253: Generate correct SOA records for internal.softwareheritage.org. as Normal priority.
Oct 9 2018, 11:36 AM · System administration

Oct 8 2018

ftigeot added a comment to T1201: monitor DNS zones on primary/replica to ensure they stay in sync.

Correct me if I am wrong, but I do not believe the current Puppet code has the ability to handle more than one NS record per zone.
At the very least, I couldn't find an obvious way to add such a record.

Oct 8 2018, 4:29 PM · System administration
ftigeot closed T1175: renews SSL certificats for {www,}softwareheritage.org as Resolved.

All known SSL services now use updated certificates. Closing.

Oct 8 2018, 12:22 PM · System administration

Oct 3 2018

ftigeot added a comment to T1175: renews SSL certificats for {www,}softwareheritage.org.

www and www-dev.softwareheritage.org now use auto-generated Gandi certificates.

Oct 3 2018, 5:07 PM · System administration
ftigeot added a comment to T1175: renews SSL certificats for {www,}softwareheritage.org.

Updated certificate uploaded to the Puppet repository and internal hosts updated.

Oct 3 2018, 2:43 PM · System administration
ftigeot committed rSPSITEbf45407f6863: data: Update star_softwareheritage_org certificate (authored by ftigeot).
data: Update star_softwareheritage_org certificate
Oct 3 2018, 1:45 PM

Oct 2 2018

ftigeot committed rSPSITE117966345b61: Really pin Elasticsearch packages to 6.3.2 (authored by ftigeot).
Really pin Elasticsearch packages to 6.3.2
Oct 2 2018, 3:21 PM
ftigeot committed rSPSITEaebb91b39d1e: kibana role: Pin version to 5.6.12 (authored by ftigeot).
kibana role: Pin version to 5.6.12
Oct 2 2018, 1:47 PM
ftigeot added a comment to T1173: Huge slowdowns on louvre since 2018-08-20.

PCID option removed on some VMs in order to migrate them to orsay.
The current plan is to completely replace louvre by a more recent and reliable machine for the hypervisor functions.

Oct 2 2018, 10:30 AM · System administration

Sep 25 2018

ftigeot changed the status of T1175: renews SSL certificats for {www,}softwareheritage.org from Open to Work in Progress.

Existing CSR data submitted again today to the secret INRIA/Digicert URL.

Sep 25 2018, 4:21 PM · System administration

Sep 21 2018

ftigeot changed the status of T1201: monitor DNS zones on primary/replica to ensure they stay in sync, a subtask of T1179: Create an independent DNS resolver on Azure, from Open to Work in Progress.
Sep 21 2018, 11:22 AM · System administration
ftigeot changed the status of T1201: monitor DNS zones on primary/replica to ensure they stay in sync from Open to Work in Progress.

Right now, the internal.softwareheritage.org zone contains only a single NS record. This is most likely also the case for the various reverse zones.
There is no explicit notification directive in the master server configuration either.

Sep 21 2018, 11:22 AM · System administration

Sep 20 2018

ftigeot closed T1200: point azure hosts to DNS running on azure, a subtask of T1179: Create an independent DNS resolver on Azure, as Resolved.
Sep 20 2018, 4:18 PM · System administration
ftigeot closed T1200: point azure hosts to DNS running on azure as Resolved.

Puppet configuration changed in rSPSITE62784f5462586adb44541b6382b41c1863f8938c.
Changes applied to Azure hosts.

Sep 20 2018, 4:18 PM · System administration
ftigeot committed rSPSITE62784f546258: DNS resolvers: Make forward_zones location-specific (authored by ftigeot).
DNS resolvers: Make forward_zones location-specific
Sep 20 2018, 2:36 PM

Sep 18 2018

ftigeot closed T1179: Create an independent DNS resolver on Azure as Resolved.

Task finished by @olasd .

Sep 18 2018, 11:08 AM · System administration
ftigeot closed T1179: Create an independent DNS resolver on Azure, a subtask of T1178: Make Azure infrastructure independent from Rocquencourt, as Resolved.
Sep 18 2018, 11:08 AM · System administration

Sep 17 2018

ftigeot committed rSPSITEbc559a13ddd9: Nameserver: allow zone transfers from 192.168.101.0/24 (authored by ftigeot).
Nameserver: allow zone transfers from 192.168.101.0/24
Sep 17 2018, 11:14 AM
ftigeot committed rSPSITE032132f6bb0f: Nameservers: allow zone transfers from 192.168.100.0/24 (authored by ftigeot).
Nameservers: allow zone transfers from 192.168.100.0/24
Sep 17 2018, 10:59 AM

Sep 13 2018

ftigeot committed rSPSITE3e6856c59c89: moma storage db: Temporarily use prado (authored by ftigeot).
moma storage db: Temporarily use prado
Sep 13 2018, 11:38 AM

Sep 12 2018

ftigeot added a comment to T1166: Split up pergamon to smaller VMs.

An Apache instance on pergamon is providing http and/or https services for the following hosts:

  • annex.softwareheritage.org_non-ssl
  • debian.softwareheritage.org
  • docs.softwareheritage.org
  • grafana.softwareheritage.org
  • icinga.softwareheritage.org
  • pergamon:8140 (puppet)
Sep 12 2018, 4:01 PM · System administration
ftigeot claimed T979: Migrate TLS certificates away from the *.softwareheritage.org wildcards.
Sep 12 2018, 3:39 PM · System administration
ftigeot added a subtask for T1175: renews SSL certificats for {www,}softwareheritage.org: T979: Migrate TLS certificates away from the *.softwareheritage.org wildcards.
Sep 12 2018, 3:39 PM · System administration
ftigeot added a parent task for T979: Migrate TLS certificates away from the *.softwareheritage.org wildcards: T1175: renews SSL certificats for {www,}softwareheritage.org.
Sep 12 2018, 3:39 PM · System administration

Sep 10 2018

ftigeot closed T1176: Enable NUMA and PCID options on all VMs, a subtask of T1173: Huge slowdowns on louvre since 2018-08-20, as Resolved.
Sep 10 2018, 4:33 PM · System administration
ftigeot closed T1176: Enable NUMA and PCID options on all VMs as Resolved.

All VMs restarted with PCID and NUMA flags.

Sep 10 2018, 4:33 PM · System administration

Sep 6 2018

ftigeot changed the status of T1179: Create an independent DNS resolver on Azure, a subtask of T1178: Make Azure infrastructure independent from Rocquencourt, from Open to Work in Progress.
Sep 6 2018, 12:33 PM · System administration
ftigeot changed the status of T1179: Create an independent DNS resolver on Azure from Open to Work in Progress.

ns0 VM created on Azure.

Sep 6 2018, 12:33 PM · System administration
ftigeot triaged T1179: Create an independent DNS resolver on Azure as High priority.
Sep 6 2018, 11:17 AM · System administration
ftigeot triaged T1178: Make Azure infrastructure independent from Rocquencourt as Normal priority.
Sep 6 2018, 11:09 AM · System administration

Sep 5 2018

ftigeot changed the status of T1176: Enable NUMA and PCID options on all VMs from Open to Work in Progress.

All worker VMs on beaubourg restarted with the same settings.

Sep 5 2018, 5:02 PM · System administration
ftigeot changed the status of T1176: Enable NUMA and PCID options on all VMs, a subtask of T1173: Huge slowdowns on louvre since 2018-08-20, from Open to Work in Progress.
Sep 5 2018, 5:02 PM · System administration
ftigeot added a comment to T1176: Enable NUMA and PCID options on all VMs.

All worker VMs on louvre restarted with NUMA and PCID flags.
They were resized from 16 to 12 GBs of RAM and from 4 to 3 CPU cores in order to waste less hypervisor resources.

Sep 5 2018, 4:14 PM · System administration
ftigeot renamed T1176: Enable NUMA and PCID options on all VMs from Enable NUMA option on all VMs to Enable NUMA and PCID options on all VMs.
Sep 5 2018, 2:39 PM · System administration
ftigeot added a comment to T1176: Enable NUMA and PCID options on all VMs.

numastat output on orsay, for reference:

                           node0           node1           node2           node3
numa_hit               154258622       106196783       173789251       218914560
numa_miss                      0               0               0               0
numa_foreign                   0               0               0               0
interleave_hit              6864            6821            6872            6826
local_node             154248017       106178817       173773345       218903410
other_node                 10605           17966           15906           11150
Sep 5 2018, 11:49 AM · System administration
ftigeot added a comment to T1176: Enable NUMA and PCID options on all VMs.

numastat output on beaubourg, for reference:

                           node0           node1
numa_hit             24194141993     34805632693
numa_miss             6528825760       313114704
numa_foreign           313114704      6528825760
interleave_hit             44068           43188
local_node           24194499550     34805370119
other_node            6528468203       313377278
Sep 5 2018, 11:42 AM · System administration
ftigeot added a comment to T1176: Enable NUMA and PCID options on all VMs.

numastat output on louvre, for reference:

                           node0           node1           node2           node3
numa_hit             13497023257     14081211989     14852512306     17957276918
numa_miss             8599494310      7372048640      2126471863      2510901890
numa_foreign          4832232163      2059616197      4052656329      9664412014
interleave_hit             21033           20998           21022           20991
local_node           13497008321     14081146133     14852428849     17957202370
other_node            8599509246      7372114496      2126555320      2510976437
Sep 5 2018, 11:38 AM · System administration
ftigeot triaged T1176: Enable NUMA and PCID options on all VMs as Unbreak Now! priority.
Sep 5 2018, 11:01 AM · System administration

Sep 4 2018

ftigeot added a comment to T1173: Huge slowdowns on louvre since 2018-08-20.

Moma storage migrated from Ceph to SSD storage on Beaubourg.
CPU and memory sized were way overkill and have been cut in half.
If this VM has to be migrated to louvre again, it will definitely require less hypervisor resources.

Sep 4 2018, 4:10 PM · System administration
ftigeot added a comment to T1173: Huge slowdowns on louvre since 2018-08-20.

Kibana0 migrated from Ceph to local SSD storage on louvre.
It was the last VM running on louvre and also experiencing visible I/O wait.

Sep 4 2018, 3:39 PM · System administration
ftigeot committed rSPSITE82aa877ec064: dbreplica1.euwest.azure: Make Postgres listen on default port (authored by ftigeot).
dbreplica1.euwest.azure: Make Postgres listen on default port
Sep 4 2018, 2:11 PM
ftigeot changed the status of T1166: Split up pergamon to smaller VMs from Open to Work in Progress.
Sep 4 2018, 12:03 PM · System administration
ftigeot changed the status of T1166: Split up pergamon to smaller VMs, a subtask of T1173: Huge slowdowns on louvre since 2018-08-20, from Open to Work in Progress.
Sep 4 2018, 12:03 PM · System administration

Sep 3 2018

ftigeot added a comment to T1173: Huge slowdowns on louvre since 2018-08-20.

The inability of Ceph storage to sustain random I/O workloads doesn't explain all issues on louvre: many VMs immediately experiences huge performance improvements when migrated to beaubourg, keeping the same storage backend.

Sep 3 2018, 2:29 PM · System administration
ftigeot added a comment to T1173: Huge slowdowns on louvre since 2018-08-20.

munin0 disk image moved to local SSD storage on beaubourg, I/O wait numbers have vastly decreased.

Sep 3 2018, 2:26 PM · System administration
ftigeot changed the status of T1173: Huge slowdowns on louvre since 2018-08-20 from Open to Work in Progress.

Some of the slow-downs are definitely I/O-related and caused by the switch to Ceph for VM disk image storage:

  • Most VMs suffer from I/O wait issues since August 20, 2018
  • Ceph nodes are not network-bandwidth limited and only sustain ~= 120Mb/s of peak bandwidth
  • Ceph nodes suffer from I/O wait

The last point is is not very surprising since their storage mostly consist of rotating disk drives.

Sep 3 2018, 2:25 PM · System administration