Some of the first logstash-${date} indexes became empty (zero non-deleted documents) and could simply be deleted as-is.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
May 25 2018
Many logstash-$date indexes effectively have 0 documents left after the initial systemlog data deletion phase of T977 and can simply be deleted.
May 24 2018
Logstash configuration on banco changed to inject data on the esnode1 and 2 Elasticsearch instances:
May 23 2018
Elasticsearch 6.x is also unable to write new data to indexes created with more than one mapping type (the default on previous versions).
Two new cluster nodes have been added to the swh-logging-prod cluster: esnode1 and esnode2.internal.softwareheritage.org.
Due to the Kafka requirement, only three disks in RAID0 are used per new node.
May 22 2018
May 17 2018
May 14 2018
Hacked the df_inode Munin plugin on orangerie.internal.softwareheritage.org in the same way since the remote filesystem on /srv/softwareheritage is currently experiencing a lack of free inodes crisis.
Apr 20 2018
Elasticsearch disk requirements should thus be modified to only use 3 of the 4 disks in a RAID0 volume.
The "buffer cache" is managed by the operating system and, as far as I know, there isn't a way to dedicate some of it to a particular application.
This will be one more shared resource.
We have 3x 1U servers which will also be used for an Elasticsearch cluster.
Sharing hardware with Elasticsearch is generally a bad idea, especially for storage.
I propose the following setup:
- One separate Kafka instance per server
- One dedicated 2TB Kafka HDD per server
- 2GB of JVM memory per Kafka instance
Adding T1017 since there is no choice but to use the same underlying hardware for both Kafka and Elasticsearch.
Adding a relation to T792 since there is no choice but to use the same underlying hardware for both Kafka and Elasticsearch.
Rough setup steps:
Replication has been running fine since yesterday.
PostgreSQL master server is somerset.internal.softwareheritage.org:5433 .
Apr 13 2018
The existing replica using pglogical is unable to stay in sync with its master database
Replication technology changed to streaming replication (wal shipping).
Apr 12 2018
Shard corruption issues were caused by manually restarting one of the cluster nodes during an heavy indexing period. Nothing unexpected.
Now that mmap(2) is no longer used by this particular node, shard corruption risks should also be lower.
Some of the test nodes exhibited memory leak symptoms. It seems they were related to the use of mmap() to access files.
Adding "index.store.type: niofs" in elasticsearch.yml seemed to fix this particular problem.
Apr 11 2018
Apr 5 2018
All remaining system logs from 2017 cleaned up this day. 31,214,858 documents deleted.
System logs up to 2017-11 purged. 46,110,900 documents deleted.
System logs up to 2017-10 cleaned up this day. 23,889,499 documents deleted.
Apr 3 2018
System logs up to 2017-09 cleaned up this day. 15,998,132 documents deleted.
System logs up to 2017-08 cleaned up this day. 13,175,880 documents deleted.
System logs up to 2017-07 cleaned up this day. 24,191,557 documents deleted.
Mar 30 2018
This ticket is about an Elasticsearch cluster, any other service is out of scope here.
Mar 29 2018
Hacked the df_ Munin plugin as a first step to remove the local filesystems limitation:
-my $dfopts = "-P -l ".join(' -x ',('',split('\s+',$exclude))); +my $dfopts = "-P ".join(' -x ',('',split('\s+',$exclude)));
System logs up to 2017-06 cleaned up this day. 25,492,157 documents deleted.
Mar 28 2018
System logs up to 2017-05 cleaned up this day. 25,844,268 documents deleted.
Mar 26 2018
Detailled cluster creation proposal
System logs from 2017-03 cleaned up this day. 13,474,622 documents deleted.
Mar 21 2018
If we do it in batch and stop+reapply the regular template before midnight there shouldn't be any issue.
Templates are only used at index creation time.
In order to improve reindexation speed, replicas are initially disabled for new indexes.
Do not delete historical logstash-* indexes before being sure the new indexes have been reconfigured to use at least one replica per shard and the shards properly replicated.
This template has been applied to the existing cluster on banco.internal.softwareheritage.org.
Since we do not need to perform immediate analysis on incoming data, we can relax the refresh_interval parameter. This will create less Lucene segments per index and ultimately reduce the amount of IOPS a bit.
System logs from 2017-02 cleaned up this day. 13,435,713 documents deleted.
Mar 12 2018
Change applied on banco today.
Mar 9 2018
systemlogs-* indexes are supposed to be deleted after three months, so their total number of shards will stay limited and they shouldn't have a strong detrimental impact on future cluster health.
There is no need to bother changing the default number of shards for them.
Reference documentation wrt document type removal in Elasticsearch:
https://www.elastic.co/guide/en/elasticsearch/reference/master/removal-of-types.html
Mar 7 2018
Production logstash configuration on banco.internal.softwareheritage.org changed today according to the above pattern.
Mar 6 2018
System logs from 2017-01 cleaned up this day. 15530 documents deleted.
Test data cleaned up this day.
This Logstash configuration appears to behave as expected:
output { if "swh-worker@" in [systemd_unit] { elasticsearch { hosts => ["petitpalais.internal.softwareheritage.org:9200"] index => "swh_workers-%{+YYYY.MM.dd}" } } else { elasticsearch { hosts => ["petitpalais.internal.softwareheritage.org:9200"] index => "systemlogs-%{+YYYY.MM.dd}" } } }
Howewer, Logstash applies a default template to logstash-* indices and does no such thing for indices named differently.
It is possible systemlogs-* and swh_workers-* indices will end up with suboptimal mappings without further configuration.
Feb 21 2018
Feb 16 2018
Feb 1 2018
Jan 24 2018
Jan 23 2018
Wiki Pglogical documentation added.
Jan 18 2018
Jan 17 2018
Should be fixed by these commits:
Subscriber setup:
Provider setup (was already done, instructions may be unreliable):
createuser replicator --replication su - postgres psql template1 \c softwareheritage postgres localhost 5433 create extension pglogical; select pglogical.create_node( node_name := 'prado', dsn := 'host=prado.internal.softwareheritage.org port=5433 dbname=softwareheritage' ); select pglogical.replication_set_add_table('default', 'content', true);
Pglogical replication requirements:
Jan 16 2018
Jan 12 2018
- Built-in Postgres replication can't replicate schema changes (as of PostgreSQL 10.1)
- For this reason, it is best to use pglogical
- it has a pglogical.replicate_ddl_command function which creates a synchronization point to pause replication, does a DDL change on the primary, then sends the ddl change to replicas within the replication stream and then resumes replication (so said @olasd)
Jan 11 2018
Systemd journal size now limited t 400MB by puppet configuration
Reference commit: https://forge.softwareheritage.org/rSPPROFa4ac4669eff6