Page MenuHomeSoftware Heritage
Feed Advanced Search

May 25 2018

ftigeot added a comment to T977: Delete old system log data from the Elasticsearch cluster.

Some of the first logstash-${date} indexes became empty (zero non-deleted documents) and could simply be deleted as-is.

May 25 2018, 2:55 PM · System administration (Elasticsearch consolidation (W24/2018))
ftigeot added a comment to T1000: Reindex old data on banco to put it into swh_worker indexes.

Many logstash-$date indexes effectively have 0 documents left after the initial systemlog data deletion phase of T977 and can simply be deleted.

May 25 2018, 2:37 PM · System administration
ftigeot committed rSPSITEb2c449aaa238: data/defaults: Remove unused elasticsearch-swh-log CNAME (authored by ftigeot).
data/defaults: Remove unused elasticsearch-swh-log CNAME
May 25 2018, 12:35 PM

May 24 2018

ftigeot closed T983: Make logstash on banco store documents on Elasticsearch version 6.x nodes, a subtask of T792: Make the elasticsearch logging cluster actually a cluster, as Resolved.
May 24 2018, 12:47 PM · System administration (Elasticsearch consolidation (W24/2018))
ftigeot closed T983: Make logstash on banco store documents on Elasticsearch version 6.x nodes as Resolved.

Logstash configuration on banco changed to inject data on the esnode1 and 2 Elasticsearch instances:

May 24 2018, 12:47 PM · System administration
ftigeot committed rSPSITE04690023893c: data/defaults: Do not backup Elasticsearch data at all (authored by ftigeot).
data/defaults: Do not backup Elasticsearch data at all
May 24 2018, 10:38 AM

May 23 2018

ftigeot added a comment to T983: Make logstash on banco store documents on Elasticsearch version 6.x nodes.

Elasticsearch 6.x is also unable to write new data to indexes created with more than one mapping type (the default on previous versions).

May 23 2018, 4:03 PM · System administration
ftigeot renamed T983: Make logstash on banco store documents on Elasticsearch version 6.x nodes from Upgrade logstash on banco to version 6.x to Make logstash on banco store documents on Elasticsearch version 6.x nodes.
May 23 2018, 3:50 PM · System administration
ftigeot added a comment to T792: Make the elasticsearch logging cluster actually a cluster.

Two new cluster nodes have been added to the swh-logging-prod cluster: esnode1 and esnode2.internal.softwareheritage.org.
Due to the Kafka requirement, only three disks in RAID0 are used per new node.

May 23 2018, 3:37 PM · System administration (Elasticsearch consolidation (W24/2018))

May 22 2018

ftigeot committed rSPSITE93ef72d007a5: Revert "data/defaults: add apt-transport-https package" (authored by ftigeot).
Revert "data/defaults: add apt-transport-https package"
May 22 2018, 4:19 PM
ftigeot added a reverting change for rSPSITE3294fe96630f: data/defaults: add apt-transport-https package: rSPSITE93ef72d007a5: Revert "data/defaults: add apt-transport-https package".
May 22 2018, 4:19 PM
ftigeot committed rSPSITE3294fe96630f: data/defaults: add apt-transport-https package (authored by ftigeot).
data/defaults: add apt-transport-https package
May 22 2018, 4:13 PM

May 17 2018

ftigeot committed rSPSITEe5b1ea39247f: data/defaults.yaml: Add esnode1/2/3 records (authored by ftigeot).
data/defaults.yaml: Add esnode1/2/3 records
May 17 2018, 3:05 PM

May 14 2018

ftigeot added a comment to T1007: Monitor nfs mount points on orangerie.internal.softwareheritage.org.

Hacked the df_inode Munin plugin on orangerie.internal.softwareheritage.org in the same way since the remote filesystem on /srv/softwareheritage is currently experiencing a lack of free inodes crisis.

May 14 2018, 3:44 PM · System administration

Apr 20 2018

ftigeot added a comment to T792: Make the elasticsearch logging cluster actually a cluster.

Elasticsearch disk requirements should thus be modified to only use 3 of the 4 disks in a RAID0 volume.

Apr 20 2018, 3:08 PM · System administration (Elasticsearch consolidation (W24/2018))
ftigeot added a comment to T1017: Estimate for Kafka cluster specifications.

The "buffer cache" is managed by the operating system and, as far as I know, there isn't a way to dedicate some of it to a particular application.
This will be one more shared resource.

Apr 20 2018, 3:02 PM · System administration
ftigeot added a comment to T1017: Estimate for Kafka cluster specifications.

We have 3x 1U servers which will also be used for an Elasticsearch cluster.
Sharing hardware with Elasticsearch is generally a bad idea, especially for storage.
I propose the following setup:

  • One separate Kafka instance per server
  • One dedicated 2TB Kafka HDD per server
  • 2GB of JVM memory per Kafka instance
Apr 20 2018, 2:59 PM · System administration
ftigeot added a comment to T792: Make the elasticsearch logging cluster actually a cluster.

Adding T1017 since there is no choice but to use the same underlying hardware for both Kafka and Elasticsearch.

Apr 20 2018, 2:43 PM · System administration (Elasticsearch consolidation (W24/2018))
ftigeot added a comment to T1017: Estimate for Kafka cluster specifications.

Adding a relation to T792 since there is no choice but to use the same underlying hardware for both Kafka and Elasticsearch.

Apr 20 2018, 2:42 PM · System administration
ftigeot added a subtask for T792: Make the elasticsearch logging cluster actually a cluster: T1017: Estimate for Kafka cluster specifications.
Apr 20 2018, 2:41 PM · System administration (Elasticsearch consolidation (W24/2018))
ftigeot added a parent task for T1017: Estimate for Kafka cluster specifications: T792: Make the elasticsearch logging cluster actually a cluster.
Apr 20 2018, 2:41 PM · System administration
ftigeot closed T883: set up a replica of the main DB on azure as Resolved.
Apr 20 2018, 11:30 AM · Restricted Project, System administration
ftigeot closed T883: set up a replica of the main DB on azure, a subtask of T888: Deploy the Vault and a DB replica on Azure, as Resolved.
Apr 20 2018, 11:30 AM · System administration, Restricted Project, Vault
ftigeot added a comment to T883: set up a replica of the main DB on azure.

Rough setup steps:

Apr 20 2018, 11:27 AM · Restricted Project, System administration
ftigeot added a comment to T883: set up a replica of the main DB on azure.

Replication has been running fine since yesterday.
PostgreSQL master server is somerset.internal.softwareheritage.org:5433 .

Apr 20 2018, 11:02 AM · Restricted Project, System administration

Apr 13 2018

ftigeot added a comment to T883: set up a replica of the main DB on azure.

The existing replica using pglogical is unable to stay in sync with its master database
Replication technology changed to streaming replication (wal shipping).

Apr 13 2018, 11:31 AM · Restricted Project, System administration
ftigeot added a parent task for T969: Azure database replica doesn't sustain writes: T883: set up a replica of the main DB on azure.
Apr 13 2018, 11:29 AM · System administration
ftigeot added a subtask for T883: set up a replica of the main DB on azure: T969: Azure database replica doesn't sustain writes.
Apr 13 2018, 11:29 AM · Restricted Project, System administration
ftigeot claimed T791: Ship more logs to logstash/elasticsearch.
Apr 13 2018, 11:18 AM · System administration
ftigeot changed the status of T883: set up a replica of the main DB on azure from Open to Work in Progress.
Apr 13 2018, 10:57 AM · Restricted Project, System administration
ftigeot changed the status of T883: set up a replica of the main DB on azure, a subtask of T888: Deploy the Vault and a DB replica on Azure, from Open to Work in Progress.
Apr 13 2018, 10:57 AM · System administration, Restricted Project, Vault
ftigeot closed T755: icinga notification spam: pending package upgrades as Resolved.
Apr 13 2018, 10:53 AM · System administration

Apr 12 2018

ftigeot added a comment to T1023: scheduler: tasks archival: Test run on test elasticsearch cluster.

Shard corruption issues were caused by manually restarting one of the cluster nodes during an heavy indexing period. Nothing unexpected.
Now that mmap(2) is no longer used by this particular node, shard corruption risks should also be lower.

Apr 12 2018, 10:09 AM · Scheduling utilities
ftigeot added a comment to T792: Make the elasticsearch logging cluster actually a cluster.

Some of the test nodes exhibited memory leak symptoms. It seems they were related to the use of mmap() to access files.
Adding "index.store.type: niofs" in elasticsearch.yml seemed to fix this particular problem.

Apr 12 2018, 10:02 AM · System administration (Elasticsearch consolidation (W24/2018))

Apr 11 2018

ftigeot committed rSPSITEaa443628b110: storage0: Switch database back to somerset (authored by ftigeot).
storage0: Switch database back to somerset
Apr 11 2018, 4:49 PM

Apr 5 2018

ftigeot added a comment to T977: Delete old system log data from the Elasticsearch cluster.

All remaining system logs from 2017 cleaned up this day. 31,214,858 documents deleted.

Apr 5 2018, 2:24 PM · System administration (Elasticsearch consolidation (W24/2018))
ftigeot added a comment to T977: Delete old system log data from the Elasticsearch cluster.

System logs up to 2017-11 purged. 46,110,900 documents deleted.

Apr 5 2018, 2:24 PM · System administration (Elasticsearch consolidation (W24/2018))
ftigeot added a comment to T977: Delete old system log data from the Elasticsearch cluster.

System logs up to 2017-10 cleaned up this day. 23,889,499 documents deleted.

Apr 5 2018, 2:24 PM · System administration (Elasticsearch consolidation (W24/2018))

Apr 3 2018

ftigeot added a comment to T977: Delete old system log data from the Elasticsearch cluster.

System logs up to 2017-09 cleaned up this day. 15,998,132 documents deleted.

Apr 3 2018, 5:56 PM · System administration (Elasticsearch consolidation (W24/2018))
ftigeot added a comment to T977: Delete old system log data from the Elasticsearch cluster.

System logs up to 2017-08 cleaned up this day. 13,175,880 documents deleted.

Apr 3 2018, 3:56 PM · System administration (Elasticsearch consolidation (W24/2018))
ftigeot added a comment to T977: Delete old system log data from the Elasticsearch cluster.

System logs up to 2017-07 cleaned up this day. 24,191,557 documents deleted.

Apr 3 2018, 2:02 PM · System administration (Elasticsearch consolidation (W24/2018))

Mar 30 2018

ftigeot added a comment to T792: Make the elasticsearch logging cluster actually a cluster.

This ticket is about an Elasticsearch cluster, any other service is out of scope here.

Mar 30 2018, 3:12 PM · System administration (Elasticsearch consolidation (W24/2018))

Mar 29 2018

ftigeot added a comment to T1007: Monitor nfs mount points on orangerie.internal.softwareheritage.org.

Hacked the df_ Munin plugin as a first step to remove the local filesystems limitation:

-my $dfopts  = "-P -l ".join(' -x ',('',split('\s+',$exclude)));
+my $dfopts  = "-P ".join(' -x ',('',split('\s+',$exclude)));
Mar 29 2018, 5:08 PM · System administration
ftigeot changed the status of T1007: Monitor nfs mount points on orangerie.internal.softwareheritage.org from Open to Work in Progress.
Mar 29 2018, 4:53 PM · System administration
ftigeot added a comment to T977: Delete old system log data from the Elasticsearch cluster.

System logs up to 2017-06 cleaned up this day. 25,492,157 documents deleted.

Mar 29 2018, 2:23 PM · System administration (Elasticsearch consolidation (W24/2018))

Mar 28 2018

ftigeot added a comment to T977: Delete old system log data from the Elasticsearch cluster.

System logs up to 2017-05 cleaned up this day. 25,844,268 documents deleted.

Mar 28 2018, 1:00 PM · System administration (Elasticsearch consolidation (W24/2018))

Mar 26 2018

ftigeot added a comment to T792: Make the elasticsearch logging cluster actually a cluster.

Detailled cluster creation proposal

Mar 26 2018, 3:51 PM · System administration (Elasticsearch consolidation (W24/2018))
ftigeot added a comment to T977: Delete old system log data from the Elasticsearch cluster.

System logs from 2017-03 cleaned up this day. 13,474,622 documents deleted.

Mar 26 2018, 1:29 PM · System administration (Elasticsearch consolidation (W24/2018))

Mar 21 2018

ftigeot added a comment to T1000: Reindex old data on banco to put it into swh_worker indexes.

If we do it in batch and stop+reapply the regular template before midnight there shouldn't be any issue.
Templates are only used at index creation time.

Mar 21 2018, 5:28 PM · System administration
ftigeot added a comment to T1000: Reindex old data on banco to put it into swh_worker indexes.

In order to improve reindexation speed, replicas are initially disabled for new indexes.
Do not delete historical logstash-* indexes before being sure the new indexes have been reconfigured to use at least one replica per shard and the shards properly replicated.

Mar 21 2018, 5:16 PM · System administration
ftigeot triaged T1000: Reindex old data on banco to put it into swh_worker indexes as Normal priority.
Mar 21 2018, 5:13 PM · System administration
ftigeot closed T990: Tune index parameters as Resolved.
Mar 21 2018, 4:38 PM · System administration
ftigeot closed T990: Tune index parameters, a subtask of T792: Make the elasticsearch logging cluster actually a cluster, as Resolved.
Mar 21 2018, 4:38 PM · System administration (Elasticsearch consolidation (W24/2018))
ftigeot added a comment to T990: Tune index parameters.

This template has been applied to the existing cluster on banco.internal.softwareheritage.org.

Mar 21 2018, 4:38 PM · System administration
ftigeot added a comment to T990: Tune index parameters.

Since we do not need to perform immediate analysis on incoming data, we can relax the refresh_interval parameter. This will create less Lucene segments per index and ultimately reduce the amount of IOPS a bit.

Mar 21 2018, 4:36 PM · System administration
ftigeot renamed T990: Tune index parameters from Tune the number of shards per index to Tune index parameters.
Mar 21 2018, 4:23 PM · System administration
ftigeot added a comment to T977: Delete old system log data from the Elasticsearch cluster.

System logs from 2017-02 cleaned up this day. 13,435,713 documents deleted.

Mar 21 2018, 2:10 PM · System administration (Elasticsearch consolidation (W24/2018))

Mar 12 2018

ftigeot added a comment to T990: Tune index parameters.

Change applied on banco today.

Mar 12 2018, 10:08 AM · System administration

Mar 9 2018

ftigeot added a comment to T990: Tune index parameters.

systemlogs-* indexes are supposed to be deleted after three months, so their total number of shards will stay limited and they shouldn't have a strong detrimental impact on future cluster health.
There is no need to bother changing the default number of shards for them.

Mar 9 2018, 4:33 PM · System administration
ftigeot triaged T990: Tune index parameters as Normal priority.
Mar 9 2018, 4:31 PM · System administration
ftigeot added a comment to T983: Make logstash on banco store documents on Elasticsearch version 6.x nodes.

Reference documentation wrt document type removal in Elasticsearch:
https://www.elastic.co/guide/en/elasticsearch/reference/master/removal-of-types.html

Mar 9 2018, 4:21 PM · System administration
ftigeot closed T945: Separate system logs from application logs as Resolved.
Mar 9 2018, 4:03 PM · System administration
ftigeot closed T945: Separate system logs from application logs, a subtask of T792: Make the elasticsearch logging cluster actually a cluster, as Resolved.
Mar 9 2018, 4:03 PM · System administration (Elasticsearch consolidation (W24/2018))

Mar 7 2018

ftigeot changed the status of T977: Delete old system log data from the Elasticsearch cluster from Open to Work in Progress.
Mar 7 2018, 4:58 PM · System administration (Elasticsearch consolidation (W24/2018))
ftigeot changed the status of T977: Delete old system log data from the Elasticsearch cluster, a subtask of T792: Make the elasticsearch logging cluster actually a cluster, from Open to Work in Progress.
Mar 7 2018, 4:58 PM · System administration (Elasticsearch consolidation (W24/2018))
ftigeot created T987: Add an Icinga alert for high queue levels on saatchi.
Mar 7 2018, 4:06 PM · System administration
ftigeot added a comment to T945: Separate system logs from application logs.

Production logstash configuration on banco.internal.softwareheritage.org changed today according to the above pattern.

Mar 7 2018, 3:22 PM · System administration

Mar 6 2018

ftigeot added a comment to T977: Delete old system log data from the Elasticsearch cluster.

System logs from 2017-01 cleaned up this day. 15530 documents deleted.

Mar 6 2018, 4:35 PM · System administration (Elasticsearch consolidation (W24/2018))
ftigeot added a comment to T977: Delete old system log data from the Elasticsearch cluster.

Test data cleaned up this day.

Mar 6 2018, 4:27 PM · System administration (Elasticsearch consolidation (W24/2018))
ftigeot claimed T945: Separate system logs from application logs.
Mar 6 2018, 3:37 PM · System administration
ftigeot triaged T983: Make logstash on banco store documents on Elasticsearch version 6.x nodes as Normal priority.
Mar 6 2018, 3:35 PM · System administration
ftigeot added a comment to T945: Separate system logs from application logs.

This Logstash configuration appears to behave as expected:

output {
    if "swh-worker@" in [systemd_unit] {
        elasticsearch {
                hosts => ["petitpalais.internal.softwareheritage.org:9200"]
                index => "swh_workers-%{+YYYY.MM.dd}"
        }
    } else {
        elasticsearch {
                hosts => ["petitpalais.internal.softwareheritage.org:9200"]
                index => "systemlogs-%{+YYYY.MM.dd}"
        }
    }
}

Howewer, Logstash applies a default template to logstash-* indices and does no such thing for indices named differently.
It is possible systemlogs-* and swh_workers-* indices will end up with suboptimal mappings without further configuration.

Mar 6 2018, 3:27 PM · System administration

Feb 21 2018

ftigeot updated the task description for T977: Delete old system log data from the Elasticsearch cluster.
Feb 21 2018, 4:49 PM · System administration (Elasticsearch consolidation (W24/2018))
ftigeot created T977: Delete old system log data from the Elasticsearch cluster.
Feb 21 2018, 4:22 PM · System administration (Elasticsearch consolidation (W24/2018))

Feb 16 2018

ftigeot updated the task description for T964: 2018-02-16 worker disk full postmortem.
Feb 16 2018, 11:25 AM · Mercurial loader
ftigeot created T964: 2018-02-16 worker disk full postmortem.
Feb 16 2018, 11:24 AM · Mercurial loader

Feb 1 2018

ftigeot added a parent task for T945: Separate system logs from application logs: T792: Make the elasticsearch logging cluster actually a cluster.
Feb 1 2018, 3:02 PM · System administration
ftigeot added a subtask for T792: Make the elasticsearch logging cluster actually a cluster: T945: Separate system logs from application logs.
Feb 1 2018, 3:02 PM · System administration (Elasticsearch consolidation (W24/2018))
ftigeot created T945: Separate system logs from application logs.
Feb 1 2018, 3:00 PM · System administration

Jan 24 2018

ftigeot committed rSPSITEec4306facbd2: data/defaults: Do not backup scratch spaces on Giverny (authored by ftigeot).
data/defaults: Do not backup scratch spaces on Giverny
Jan 24 2018, 4:47 PM

Jan 23 2018

ftigeot added a comment to T883: set up a replica of the main DB on azure.

Wiki Pglogical documentation added.

Jan 23 2018, 11:56 AM · Restricted Project, System administration

Jan 18 2018

ftigeot committed rSPSITEb663c414a614: data/defaults: Do not try to directly backup PostgreSQL files (authored by ftigeot).
data/defaults: Do not try to directly backup PostgreSQL files
Jan 18 2018, 2:24 PM
ftigeot committed rSPPROF8c7a4af1adde: rsyslog: Match lines independently of whitespace (authored by ftigeot).
rsyslog: Match lines independently of whitespace
Jan 18 2018, 1:22 PM
ftigeot committed rSPPROF10c8a51dd87b: rsyslog: Add some whitespace (authored by ftigeot).
rsyslog: Add some whitespace
Jan 18 2018, 12:10 PM

Jan 17 2018

ftigeot closed T895: Limit size of most common log files as Resolved.

Should be fixed by these commits:

Jan 17 2018, 5:31 PM · Easy hack, System administration
ftigeot committed rSPROLEa61fb89f87db: roles/swh_server: Add rsyslog profile (authored by ftigeot).
roles/swh_server: Add rsyslog profile
Jan 17 2018, 5:10 PM
ftigeot committed rSPPROFc05e0b64acd7: swh-profile: Create rsyslog profile (authored by ftigeot).
swh-profile: Create rsyslog profile
Jan 17 2018, 5:09 PM
ftigeot added a comment to T883: set up a replica of the main DB on azure.

Subscriber setup:

Jan 17 2018, 11:17 AM · Restricted Project, System administration
ftigeot added a comment to T883: set up a replica of the main DB on azure.

Provider setup (was already done, instructions may be unreliable):

createuser replicator --replication
su - postgres
psql template1
\c softwareheritage postgres localhost 5433
create extension pglogical;
select pglogical.create_node(
	node_name := 'prado',
	dsn := 'host=prado.internal.softwareheritage.org port=5433 dbname=softwareheritage'
);
select pglogical.replication_set_add_table('default', 'content', true);
Jan 17 2018, 11:11 AM · Restricted Project, System administration
ftigeot added a comment to T883: set up a replica of the main DB on azure.

Pglogical replication requirements:

Jan 17 2018, 11:09 AM · Restricted Project, System administration

Jan 16 2018

ftigeot committed rSPSITEd153de0e5fef: Revert "data/defaults: Also resolve euwest.azure... names by default" (authored by ftigeot).
Revert "data/defaults: Also resolve euwest.azure... names by default"
Jan 16 2018, 3:05 PM
ftigeot added a reverting change for rSPSITE87912869285b: data/defaults: Also resolve euwest.azure... names by default: rSPSITEd153de0e5fef: Revert "data/defaults: Also resolve euwest.azure... names by default".
Jan 16 2018, 3:05 PM
ftigeot committed rSPSITE87912869285b: data/defaults: Also resolve euwest.azure... names by default (authored by ftigeot).
data/defaults: Also resolve euwest.azure... names by default
Jan 16 2018, 2:57 PM

Jan 12 2018

ftigeot updated subscribers of T883: set up a replica of the main DB on azure.
  • Built-in Postgres replication can't replicate schema changes (as of PostgreSQL 10.1)
  • For this reason, it is best to use pglogical
  • it has a pglogical.replicate_ddl_command function which creates a synchronization point to pause replication, does a DDL change on the primary, then sends the ddl change to replicas within the replication stream and then resumes replication (so said @olasd)
Jan 12 2018, 4:22 PM · Restricted Project, System administration

Jan 11 2018

ftigeot committed rSPPROF38d5e2940fe0: Revert a4ac4669eff6332e1f69c1211c1db26cb5e8d207 systemd_journal: Limit size to… (authored by ftigeot).
Revert a4ac4669eff6332e1f69c1211c1db26cb5e8d207 systemd_journal: Limit size to…
Jan 11 2018, 2:00 PM
ftigeot added a reverting change for rSPPROFa4ac4669eff6: systemd_journal: Limit size to 400MB: rSPPROF38d5e2940fe0: Revert a4ac4669eff6332e1f69c1211c1db26cb5e8d207 systemd_journal: Limit size to….
Jan 11 2018, 2:00 PM
ftigeot committed rSPPROFcc77d4c4d913: systemd_journal: Fix indentation (authored by ftigeot).
systemd_journal: Fix indentation
Jan 11 2018, 11:03 AM
ftigeot added a comment to T789: Logrotate for systemd-journald.

Systemd journal size now limited t 400MB by puppet configuration
Reference commit: https://forge.softwareheritage.org/rSPPROFa4ac4669eff6

Jan 11 2018, 11:02 AM · System administration
ftigeot committed rSPPROFa4ac4669eff6: systemd_journal: Limit size to 400MB (authored by ftigeot).
systemd_journal: Limit size to 400MB
Jan 11 2018, 10:47 AM

Jan 10 2018

ftigeot committed rSPSITEae33c6ffd631: site.pp: Add a new database server, dbreplica0.euwest.azure (authored by ftigeot).
site.pp: Add a new database server, dbreplica0.euwest.azure
Jan 10 2018, 3:18 PM