Page MenuHomeSoftware Heritage

Make the elasticsearch logging cluster actually a cluster
Started, Work in Progress, NormalPublic

Description

The current elasticsearch logging cluster is configured by hand on a single machine, banco.

We should make that an actual cluster with at least a replica.

Current data usage is 470GiB of logs spanning six months of history (february 2017 - now), which means we probably want to size it around 1 TiB of storage space to account for growth.

Event Timeline

olasd created this task.Oct 3 2017, 1:00 PM
ftigeot changed the status of subtask T977: Delete old system log data from the Elasticsearch cluster from Open to Work in Progress.Mar 7 2018, 4:58 PM

Detailled cluster creation proposal

1. Install new nodes

The machines have 4x 2TB HDDs
Partitions:
- RAID 1	/			40GB
- RAID 0	/srv/elasticsearch	all the rest

Using a RAID 0 volume will improve I/O performance
We don't need redundancy on /srv/elasticsearch since
Elasticsearch itself will provide it

Name the hosts with easy to remember names in the DNS like "esnode0.internal.softwareheritage.org"

2. Install Elasticsearch on the new nodes

[This stage could be puppetized]

The new hosts have 32GB of RAM.

Change etc/elasticsearch/jvmoptions :
	# Use a maximum of 50% of available memory for the jvm heap
	# Keep it <32GB in order to benefit from jvm pointer compression
	# Around 50% of memory should be kept free for disk caching purposes
	-Xms15g
	-Xmx15g

	# The default openjdk 1.8 garbage collector allows itself to stop all
	# operations for indefinite amounts of time, causing cluster issues after
	# a while.
	# Use G1GC instead, it has more realtime performance characteristics and
	# will help to keep the cluster green and responsive
	-XX:+UseG1GC 

	# Comment out or remove -XX CMS options

Change etc/elasticsearch/elasticsearch.yml

	# Store indexes in /srv/elasticsearch
	path.data:	/srv/elasticsearch

	# Since the es cluster is mostly used as an archive and few
	# search requests are being made daily, increase available
	# memory for indexing. (default is 10% of heap size)
	indices.memory.index_buffer_size: 50%

3. install logstash on one of the new nodes

We should be using a dedicated hardware node but budget is tight
and performance requirements low.
Having logstash running on one of the Elasticsearch nodes shouldn't
cause many issues with the current workload as of 2018-03.

Logstash 6.x is the new version
We need to use it for new indexes in order to avoid compatibility
issues with Logstash/Journalbeat 5.x and Elasticsearch 6.x

Configure it to send data to the Elasticsearch instance on Banco

4. Configure Journalbeat on Banco to send data to the new logstash instance

Edit banco:/etc/journalbeat
Change the hosts: parameter

5. Configure the new Elasticsearch nodes in a cluster with Banco

Change etc/elasticsearch.yml

cluster.name:	swh-logging-prod
node.name:	esnode${node}
network.host:	192.168.100.${node_last_ip_digit}
discovery.zen.ping.unicast.hosts: [
	"banco.internal.softwareheritage.org",
	"esnode0.internal.softwareheritage.org",
	"esnode1.internal.softwareheritage.org",
	"esnode2.internal.softwareheritage.org"
]

# We have three nodes, avoid split-brains
discovery.zen.minimum_master_nodes: 2

This parameter is extremely important. Without it, we can get split-brains.

Wait for the cluster to turn green

6. Configure the cluster

# Limit the maximum amount of I/Os dedicated to Lucene segment mergings
# (because we use slow HDDs)
curl -XPUT 'http://localhost:9200/_all/_settings?preserve_existing=true' -d '{
	"index.merge.scheduler.max_thread_count" : "1"
}'

7. Configure the new logstash instance to send data to the new nodes

Some load-balancer / error detection system could be useful if the cluster
gets big.
For three nodes, hardcoding all of them should be fine.

Logstash outputs accepts a list of hosts, this should be enough for
a small cluster.

8. Remove Banco from the cluster

Turn off Elasticsearch on Banco
Get back former Elasticsearch disk space

Do not do this until the cluster gets green

9. Reindex old logstash-* indexes

Details in T1000 .

olasd added a comment.Mar 27 2018, 6:36 PM

Just a reminder that those machines will not be dedicated to elasticsearch: they are also intended to be used to setup a kafka cluster for the journal / mirroring system.

The only adaptations I see from your plan are:

  • make a LVM physical volume on top of the RAID0 so we can manage partitions on the fly
  • don't allocate all the space on the RAID0 for the elasticsearch cluster (start with 25% or so?)
  • don't allocate all the RAM for elasticsearch (again, probably start with 25% of the nodes' ram for the JVM heap)

This ticket is about an Elasticsearch cluster, any other service is out of scope here.

Besides, the 1U machines we just received were budgeted for Elasticsearch.
Since nothing is known about resource requirements of an eventual Kafka service, the hardware was only tailored according to Elasticsearch needs.

Given the excessive prices of Dell SSD options, the new servers are also using old-school HDDs, which severely impacts I/O performance under load.
Elasticsearch requires as much memory as possible in order to compensate for the slow disks and not degrade operations too much.

Why should another service be allowed to run on the same machines ?
I suggest opening a separate ticket for Kafka and budget the required hardware.

Some of the test nodes exhibited memory leak symptoms. It seems they were related to the use of mmap() to access files.
Adding "index.store.type: niofs" in elasticsearch.yml seemed to fix this particular problem.

Adding T1017 since there is no choice but to use the same underlying hardware for both Kafka and Elasticsearch.

Elasticsearch disk requirements should thus be modified to only use 3 of the 4 disks in a RAID0 volume.

Two new cluster nodes have been added to the swh-logging-prod cluster: esnode1 and esnode2.internal.softwareheritage.org.
Due to the Kafka requirement, only three disks in RAID0 are used per new node.

The cluster now has three brand new nodes; all existing data has been copied and the original Banco node removed.
Issue technically fixed but left open in order to track related Elasticsearch requests.

ftigeot changed the status of subtask T793: Move elasticsearch log cluster configuration inside puppet from Open to Work in Progress.Jun 13 2018, 5:04 PM
ftigeot claimed this task.Jun 20 2018, 10:43 AM
ftigeot changed the task status from Open to Work in Progress.Jul 31 2018, 4:13 PM