Page MenuHomeSoftware Heritage
Feed Advanced Search

Sep 20 2018

ftigeot closed T1200: point azure hosts to DNS running on azure, a subtask of T1179: Create an independent DNS resolver on Azure, as Resolved.
Sep 20 2018, 4:18 PM · System administration
ftigeot closed T1200: point azure hosts to DNS running on azure as Resolved.

Puppet configuration changed in rSPSITE62784f5462586adb44541b6382b41c1863f8938c.
Changes applied to Azure hosts.

Sep 20 2018, 4:18 PM · System administration
ftigeot committed rSPSITE62784f546258: DNS resolvers: Make forward_zones location-specific (authored by ftigeot).
DNS resolvers: Make forward_zones location-specific
Sep 20 2018, 2:36 PM

Sep 18 2018

ftigeot closed T1179: Create an independent DNS resolver on Azure as Resolved.

Task finished by @olasd .

Sep 18 2018, 11:08 AM · System administration
ftigeot closed T1179: Create an independent DNS resolver on Azure, a subtask of T1178: Make Azure infrastructure independent from Rocquencourt, as Resolved.
Sep 18 2018, 11:08 AM · System administration

Sep 17 2018

ftigeot committed rSPSITEbc559a13ddd9: Nameserver: allow zone transfers from 192.168.101.0/24 (authored by ftigeot).
Nameserver: allow zone transfers from 192.168.101.0/24
Sep 17 2018, 11:14 AM
ftigeot committed rSPSITE032132f6bb0f: Nameservers: allow zone transfers from 192.168.100.0/24 (authored by ftigeot).
Nameservers: allow zone transfers from 192.168.100.0/24
Sep 17 2018, 10:59 AM

Sep 13 2018

ftigeot committed rSPSITE3e6856c59c89: moma storage db: Temporarily use prado (authored by ftigeot).
moma storage db: Temporarily use prado
Sep 13 2018, 11:38 AM

Sep 12 2018

ftigeot added a comment to T1166: Split up pergamon to smaller VMs.

An Apache instance on pergamon is providing http and/or https services for the following hosts:

  • annex.softwareheritage.org_non-ssl
  • debian.softwareheritage.org
  • docs.softwareheritage.org
  • grafana.softwareheritage.org
  • icinga.softwareheritage.org
  • pergamon:8140 (puppet)
Sep 12 2018, 4:01 PM · System administration
ftigeot claimed T979: Migrate TLS certificates away from the *.softwareheritage.org wildcards.
Sep 12 2018, 3:39 PM · System administration
ftigeot added a subtask for T1175: renews SSL certificats for {www,}softwareheritage.org: T979: Migrate TLS certificates away from the *.softwareheritage.org wildcards.
Sep 12 2018, 3:39 PM · System administration
ftigeot added a parent task for T979: Migrate TLS certificates away from the *.softwareheritage.org wildcards: T1175: renews SSL certificats for {www,}softwareheritage.org.
Sep 12 2018, 3:39 PM · System administration

Sep 10 2018

ftigeot closed T1176: Enable NUMA and PCID options on all VMs, a subtask of T1173: Huge slowdowns on louvre since 2018-08-20, as Resolved.
Sep 10 2018, 4:33 PM · System administration
ftigeot closed T1176: Enable NUMA and PCID options on all VMs as Resolved.

All VMs restarted with PCID and NUMA flags.

Sep 10 2018, 4:33 PM · System administration

Sep 6 2018

ftigeot changed the status of T1179: Create an independent DNS resolver on Azure, a subtask of T1178: Make Azure infrastructure independent from Rocquencourt, from Open to Work in Progress.
Sep 6 2018, 12:33 PM · System administration
ftigeot changed the status of T1179: Create an independent DNS resolver on Azure from Open to Work in Progress.

ns0 VM created on Azure.

Sep 6 2018, 12:33 PM · System administration
ftigeot triaged T1179: Create an independent DNS resolver on Azure as High priority.
Sep 6 2018, 11:17 AM · System administration
ftigeot triaged T1178: Make Azure infrastructure independent from Rocquencourt as Normal priority.
Sep 6 2018, 11:09 AM · System administration

Sep 5 2018

ftigeot changed the status of T1176: Enable NUMA and PCID options on all VMs from Open to Work in Progress.

All worker VMs on beaubourg restarted with the same settings.

Sep 5 2018, 5:02 PM · System administration
ftigeot changed the status of T1176: Enable NUMA and PCID options on all VMs, a subtask of T1173: Huge slowdowns on louvre since 2018-08-20, from Open to Work in Progress.
Sep 5 2018, 5:02 PM · System administration
ftigeot added a comment to T1176: Enable NUMA and PCID options on all VMs.

All worker VMs on louvre restarted with NUMA and PCID flags.
They were resized from 16 to 12 GBs of RAM and from 4 to 3 CPU cores in order to waste less hypervisor resources.

Sep 5 2018, 4:14 PM · System administration
ftigeot renamed T1176: Enable NUMA and PCID options on all VMs from Enable NUMA option on all VMs to Enable NUMA and PCID options on all VMs.
Sep 5 2018, 2:39 PM · System administration
ftigeot added a comment to T1176: Enable NUMA and PCID options on all VMs.

numastat output on orsay, for reference:

                           node0           node1           node2           node3
numa_hit               154258622       106196783       173789251       218914560
numa_miss                      0               0               0               0
numa_foreign                   0               0               0               0
interleave_hit              6864            6821            6872            6826
local_node             154248017       106178817       173773345       218903410
other_node                 10605           17966           15906           11150
Sep 5 2018, 11:49 AM · System administration
ftigeot added a comment to T1176: Enable NUMA and PCID options on all VMs.

numastat output on beaubourg, for reference:

                           node0           node1
numa_hit             24194141993     34805632693
numa_miss             6528825760       313114704
numa_foreign           313114704      6528825760
interleave_hit             44068           43188
local_node           24194499550     34805370119
other_node            6528468203       313377278
Sep 5 2018, 11:42 AM · System administration
ftigeot added a comment to T1176: Enable NUMA and PCID options on all VMs.

numastat output on louvre, for reference:

                           node0           node1           node2           node3
numa_hit             13497023257     14081211989     14852512306     17957276918
numa_miss             8599494310      7372048640      2126471863      2510901890
numa_foreign          4832232163      2059616197      4052656329      9664412014
interleave_hit             21033           20998           21022           20991
local_node           13497008321     14081146133     14852428849     17957202370
other_node            8599509246      7372114496      2126555320      2510976437
Sep 5 2018, 11:38 AM · System administration
ftigeot triaged T1176: Enable NUMA and PCID options on all VMs as Unbreak Now! priority.
Sep 5 2018, 11:01 AM · System administration

Sep 4 2018

ftigeot added a comment to T1173: Huge slowdowns on louvre since 2018-08-20.

Moma storage migrated from Ceph to SSD storage on Beaubourg.
CPU and memory sized were way overkill and have been cut in half.
If this VM has to be migrated to louvre again, it will definitely require less hypervisor resources.

Sep 4 2018, 4:10 PM · System administration
ftigeot added a comment to T1173: Huge slowdowns on louvre since 2018-08-20.

Kibana0 migrated from Ceph to local SSD storage on louvre.
It was the last VM running on louvre and also experiencing visible I/O wait.

Sep 4 2018, 3:39 PM · System administration
ftigeot committed rSPSITE82aa877ec064: dbreplica1.euwest.azure: Make Postgres listen on default port (authored by ftigeot).
dbreplica1.euwest.azure: Make Postgres listen on default port
Sep 4 2018, 2:11 PM
ftigeot changed the status of T1166: Split up pergamon to smaller VMs from Open to Work in Progress.
Sep 4 2018, 12:03 PM · System administration
ftigeot changed the status of T1166: Split up pergamon to smaller VMs, a subtask of T1173: Huge slowdowns on louvre since 2018-08-20, from Open to Work in Progress.
Sep 4 2018, 12:03 PM · System administration

Sep 3 2018

ftigeot added a comment to T1173: Huge slowdowns on louvre since 2018-08-20.

The inability of Ceph storage to sustain random I/O workloads doesn't explain all issues on louvre: many VMs immediately experiences huge performance improvements when migrated to beaubourg, keeping the same storage backend.

Sep 3 2018, 2:29 PM · System administration
ftigeot added a comment to T1173: Huge slowdowns on louvre since 2018-08-20.

munin0 disk image moved to local SSD storage on beaubourg, I/O wait numbers have vastly decreased.

Sep 3 2018, 2:26 PM · System administration
ftigeot changed the status of T1173: Huge slowdowns on louvre since 2018-08-20 from Open to Work in Progress.

Some of the slow-downs are definitely I/O-related and caused by the switch to Ceph for VM disk image storage:

  • Most VMs suffer from I/O wait issues since August 20, 2018
  • Ceph nodes are not network-bandwidth limited and only sustain ~= 120Mb/s of peak bandwidth
  • Ceph nodes suffer from I/O wait

The last point is is not very surprising since their storage mostly consist of rotating disk drives.

Sep 3 2018, 2:25 PM · System administration

Aug 31 2018

ftigeot closed T1000: Reindex old data on banco to put it into swh_worker indexes as Resolved.

All documents reindexed.
Some legacy logstash-* indexes containing tens of thousands of invalid documents have been kept for further analysis.

Aug 31 2018, 5:37 PM · System administration
ftigeot closed T1000: Reindex old data on banco to put it into swh_worker indexes, a subtask of T792: Make the elasticsearch logging cluster actually a cluster, as Resolved.
Aug 31 2018, 5:37 PM · System administration (Elasticsearch consolidation (W24/2018))
ftigeot added a parent task for T1166: Split up pergamon to smaller VMs: T1173: Huge slowdowns on louvre since 2018-08-20.
Aug 31 2018, 4:23 PM · System administration
ftigeot added a subtask for T1173: Huge slowdowns on louvre since 2018-08-20: T1166: Split up pergamon to smaller VMs.
Aug 31 2018, 4:23 PM · System administration
ftigeot added a comment to T1168: Move away the Munin service from Pergamon.

Some existing Puppet manifests were causing important Apache modules to be removed from former Munin machines no longer providing Munin services.
This briefly caused all http/https services on Pergamon to fail but has since been fixed in rSPSITEa1ecf58d5d157d94bcb019a432e221da9a798f34.

Aug 31 2018, 11:00 AM · System administration
ftigeot closed T1168: Move away the Munin service from Pergamon as Resolved.
  • munin0 VM created
  • munin service added to munin0, with munstrap alternative template
  • legacy pergamon data copied to munin0
  • Puppet manifests changed to use munin0 instead of pergamon for the Munin service
  • Many useless statistics removed in order to reduce I/O load on the new VM
  • All of that done in paring with ardumont@
Aug 31 2018, 10:57 AM · System administration
ftigeot closed T1168: Move away the Munin service from Pergamon, a subtask of T1166: Split up pergamon to smaller VMs, as Resolved.
Aug 31 2018, 10:57 AM · System administration
ftigeot added a comment to T1173: Huge slowdowns on louvre since 2018-08-20.

logstash0 migrated to beaubourg as well (complete shutdown and restart included).

Aug 31 2018, 10:45 AM · System administration
ftigeot added a comment to T1173: Huge slowdowns on louvre since 2018-08-20.

One of the impacted VM is logstash0:

[Mon Aug 20 15:45:38 2018] sd 2:0:0:0: [sda] tag#0 abort
...
Tue Aug 21 02:38:39 2018] INFO: task java:1321 blocked for more than 120 seconds.
[Tue Aug 21 02:38:39 2018]       Not tainted 4.9.0-8-amd64 #1 Debian 4.9.110-3+deb9u3
...
Fri Aug 24 13:26:00 2018] sd 2:0:0:0: [sda] tag#108 abort
...
Aug 31 2018, 10:39 AM · System administration

Aug 30 2018

ftigeot committed rSPSITEa1ecf58d5d15: Pergamon role: :apache::mod::rewrite MUST be included (authored by ftigeot).
Pergamon role: :apache::mod::rewrite MUST be included
Aug 30 2018, 4:37 PM
ftigeot committed rSPSITE467881fbfadd: Pergamon role: remove munin-master service (authored by ftigeot).
Pergamon role: remove munin-master service
Aug 30 2018, 3:59 PM
ftigeot committed rSPSITE2bfec1840036: Munin node: Stop allowing Pergamon connections (authored by ftigeot).
Munin node: Stop allowing Pergamon connections
Aug 30 2018, 3:59 PM
ftigeot committed rSPSITE6ff03f261a21: Munin master: switch to munin0.internal (authored by ftigeot).
Munin master: switch to munin0.internal
Aug 30 2018, 3:42 PM
ftigeot committed rSPSITEc0edc18fa018: Munin node: remove more borderline useless data (authored by ftigeot).
Munin node: remove more borderline useless data
Aug 30 2018, 2:07 PM
ftigeot added a comment to T1173: Huge slowdowns on louvre since 2018-08-20.

At least three important changes were made on 2018-08-20:

  • Uffizi has been morphed from a Qemu VM to a lxc container
  • VM storage has been migrated to a Ceph backend
  • The Proxmox PVE Linux kernel has been updated
Aug 30 2018, 11:08 AM · System administration
ftigeot triaged T1173: Huge slowdowns on louvre since 2018-08-20 as Normal priority.
Aug 30 2018, 11:06 AM · System administration

Aug 29 2018

ftigeot committed rSPSITE03d62de868ee: Munin node: remove more useless statistics by default (authored by ftigeot).
Munin node: remove more useless statistics by default
Aug 29 2018, 5:49 PM
ftigeot committed rSPSITEda7f7d8afd31: Azure euwest: some DNS forwarders do not support DNSSEC (authored by ftigeot).
Azure euwest: some DNS forwarders do not support DNSSEC
Aug 29 2018, 1:16 PM

Aug 28 2018

ftigeot committed rSPSITEecd79a590cdd: Git loader: decrease concurrency to 2 (authored by ftigeot).
Git loader: decrease concurrency to 2
Aug 28 2018, 3:49 PM
ftigeot committed rSPSITEcf2bd41d01c4: Deposit backend workers: increase concurrency to 8 (authored by ftigeot).
Deposit backend workers: increase concurrency to 8
Aug 28 2018, 3:44 PM
ftigeot committed rSPSITEbd51a2fecd05: Scheduler backend workers: increase concurrency to 16 (authored by ftigeot).
Scheduler backend workers: increase concurrency to 16
Aug 28 2018, 3:38 PM
ftigeot committed rSPSITEf78b9e1d1b64: Munin master: Stop auto-generating CNAME records (authored by ftigeot).
Munin master: Stop auto-generating CNAME records
Aug 28 2018, 2:56 PM
ftigeot committed rSPSITE782fabad1b99: Munin: Allow munin0 connections to nodes (authored by ftigeot).
Munin: Allow munin0 connections to nodes
Aug 28 2018, 2:38 PM
ftigeot committed rSPSITE3b5b70f73d6e: swh_munin_master role: No need for public IP address (authored by ftigeot).
swh_munin_master role: No need for public IP address
Aug 28 2018, 2:27 PM
ftigeot committed rSPSITE262f6e29e433: Fix munin_master role (authored by ftigeot).
Fix munin_master role
Aug 28 2018, 2:19 PM
ftigeot committed rSPSITE586486b4d057: Add a munin_master role (authored by ftigeot).
Add a munin_master role
Aug 28 2018, 2:15 PM
ftigeot committed rSPSITE680470cbc810: data/defaults.yaml: Add munin0.internal.softwareheritge.org (authored by ftigeot).
data/defaults.yaml: Add munin0.internal.softwareheritge.org
Aug 28 2018, 1:14 PM

Aug 27 2018

ftigeot triaged T1168: Move away the Munin service from Pergamon as Normal priority.
Aug 27 2018, 2:29 PM · System administration

Aug 24 2018

ftigeot renamed T1166: Split up pergamon to smaller VMs from Split up pergamon in smaller VMs to Split up pergamon to smaller VMs.
Aug 24 2018, 3:01 PM · System administration
ftigeot triaged T1166: Split up pergamon to smaller VMs as Normal priority.
Aug 24 2018, 2:48 PM · System administration
ftigeot closed T1165: Fix lack of disk space on louvre:/ as Resolved.

The current situation looks good enough for now, closing.

Aug 24 2018, 2:37 PM · System administration
ftigeot closed T1165: Fix lack of disk space on louvre:/, a subtask of T1164: Dar backups fill up disk space on client machines, as Resolved.
Aug 24 2018, 2:37 PM · System administration

Aug 23 2018

ftigeot changed the status of T1165: Fix lack of disk space on louvre:/ from Open to Work in Progress.

Root filesystem successfully resized:

lvextend -v -L+20G /dev/vg-louvre/louvre-root
resize2fs /dev/mapper/vg--louvre-louvre--root
Aug 23 2018, 2:18 PM · System administration
ftigeot changed the status of T1165: Fix lack of disk space on louvre:/, a subtask of T1164: Dar backups fill up disk space on client machines, from Open to Work in Progress.
Aug 23 2018, 2:18 PM · System administration
ftigeot triaged T1165: Fix lack of disk space on louvre:/ as High priority.
Aug 23 2018, 1:49 PM · System administration

Aug 20 2018

ftigeot triaged T1164: Dar backups fill up disk space on client machines as High priority.
Aug 20 2018, 11:36 AM · System administration
ftigeot added a comment to T1000: Reindex old data on banco to put it into swh_worker indexes.

logstash appears to be a good substitute to the failing reindex Elasticsearch API.
It naturally skips documents whose contents appear to be invalid when sent to an Elasticsearch cluster.

Aug 20 2018, 11:36 AM · System administration

Aug 3 2018

ftigeot closed T1126: Move away non-gunicorn services from banco as Resolved.

A new Kibana-dedicated VM, kibana0.internal.softwareheritage.org has also been created.
Closing.

Aug 3 2018, 5:42 PM · System administration
ftigeot added a comment to T1048: Clean striped object storages from objects they should not be containing.

Given Uffizi already uses 64GB of RAM (more than some physical machines), this should be a no brainer.
I am not sure if this would really improve I/O performance, though.

Aug 3 2018, 3:03 PM · Object storage

Aug 2 2018

ftigeot committed rSPSITE154ffdabc829: kibana role: Add configuration file (authored by ftigeot).
kibana role: Add configuration file
Aug 2 2018, 4:06 PM
ftigeot committed rSPSITEf483b3c33c2b: groups: Add anlambert to swhwebapp (authored by ftigeot).
groups: Add anlambert to swhwebapp
Aug 2 2018, 2:20 PM

Aug 1 2018

ftigeot committed rSPSITE84fa04e727e0: kibana role: trivial bugfix (authored by ftigeot).
kibana role: trivial bugfix
Aug 1 2018, 2:32 PM
ftigeot committed rSPSITEcbfbe17d06ac: Add a kibana role (authored by ftigeot).
Add a kibana role
Aug 1 2018, 2:24 PM
ftigeot committed rSPSITEb09ce754d02e: dns: Add kibana0.internal.softwareheritage.org (authored by ftigeot).
dns: Add kibana0.internal.softwareheritage.org
Aug 1 2018, 1:30 PM
ftigeot changed the status of T1126: Move away non-gunicorn services from banco from Open to Work in Progress.

Logstash service moved to a new VM, logstash0.internal.softwareheritage.org.

Aug 1 2018, 11:13 AM · System administration

Jul 31 2018

ftigeot committed rSPSITEa8b60051ddc6: Apply the swh_lsi_storage_adapter role to all relevant hosts (authored by ftigeot).
Apply the swh_lsi_storage_adapter role to all relevant hosts
Jul 31 2018, 4:29 PM
ftigeot added a parent task for T1028: deposit: Push logs to elasticsearch: T791: Ship more logs to logstash/elasticsearch.
Jul 31 2018, 4:19 PM · SWORD deposit
ftigeot added a subtask for T791: Ship more logs to logstash/elasticsearch: T1028: deposit: Push logs to elasticsearch.
Jul 31 2018, 4:19 PM · System administration
ftigeot added a parent task for T1005: webapp: Push logs to elasticsearch cluster: T791: Ship more logs to logstash/elasticsearch.
Jul 31 2018, 4:18 PM · System administration, Web app
ftigeot added a subtask for T791: Ship more logs to logstash/elasticsearch: T1005: webapp: Push logs to elasticsearch cluster.
Jul 31 2018, 4:18 PM · System administration
ftigeot changed the status of T792: Make the elasticsearch logging cluster actually a cluster, a subtask of T1028: deposit: Push logs to elasticsearch, from Open to Work in Progress.
Jul 31 2018, 4:13 PM · SWORD deposit
ftigeot changed the status of T792: Make the elasticsearch logging cluster actually a cluster from Open to Work in Progress.
Jul 31 2018, 4:13 PM · System administration (Elasticsearch consolidation (W24/2018))
ftigeot changed the status of T792: Make the elasticsearch logging cluster actually a cluster, a subtask of T1005: webapp: Push logs to elasticsearch cluster, from Open to Work in Progress.
Jul 31 2018, 4:13 PM · System administration, Web app
ftigeot changed the status of T792: Make the elasticsearch logging cluster actually a cluster, a subtask of T986: Scheduler: Automate completed oneshot or disabled recurring tasks archival, from Open to Work in Progress.
Jul 31 2018, 4:13 PM · Scheduling utilities
ftigeot closed T1127: dbreplica1 2018-06-30 event postmortem as Resolved.

No new problem noticed since ~= a month, closing.

Jul 31 2018, 4:03 PM · Web app, System administration
ftigeot closed T1127: dbreplica1 2018-06-30 event postmortem, a subtask of T1069: fully host the web UI on Azure, as Resolved.
Jul 31 2018, 4:03 PM · Web app, System administration
ftigeot added a comment to T791: Ship more logs to logstash/elasticsearch.

With the new VM and its logstash-6.3.2 service,

/var/log/apache2/archive.softwareheritage.org_non-ssl_access.log

contents are now successfully stored into Elasticsearch indexes.

Jul 31 2018, 4:02 PM · System administration
ftigeot closed T1160: Create a dedicated logstash VM as Resolved.

A 6.3.x version of logstash is now running on logstash0.internal.softwareheritage.org , a brand new VM created on Proxmox/louvre.

Jul 31 2018, 2:38 PM · System administration
ftigeot closed T1160: Create a dedicated logstash VM, a subtask of T791: Ship more logs to logstash/elasticsearch, as Resolved.
Jul 31 2018, 2:38 PM · System administration
ftigeot closed T1160: Create a dedicated logstash VM, a subtask of T1126: Move away non-gunicorn services from banco, as Resolved.
Jul 31 2018, 2:38 PM · System administration
ftigeot committed rSPSITE36ecf9cf4631: Elastic packages: Upgrade to version 6.3.2 (authored by ftigeot).
Elastic packages: Upgrade to version 6.3.2
Jul 31 2018, 2:05 PM
ftigeot committed rSPSITEdd33071544f5: dns: Make logstash0 the new default logstash instance (authored by ftigeot).
dns: Make logstash0 the new default logstash instance
Jul 31 2018, 11:55 AM
ftigeot committed rSPSITE4f76c0fb1d21: logstash: Add configuration files, enable service (authored by ftigeot).
logstash: Add configuration files, enable service
Jul 31 2018, 11:14 AM

Jul 30 2018

ftigeot committed rSPSITE61cdc0325944: Add a logstash role (authored by ftigeot).
Add a logstash role
Jul 30 2018, 11:10 AM

Jul 27 2018

ftigeot changed the status of T1160: Create a dedicated logstash VM, a subtask of T791: Ship more logs to logstash/elasticsearch, from Open to Work in Progress.
Jul 27 2018, 4:58 PM · System administration
ftigeot changed the status of T1160: Create a dedicated logstash VM from Open to Work in Progress.
Jul 27 2018, 4:58 PM · System administration