Page MenuHomeSoftware Heritage

Tune index parameters
Closed, MigratedEdits Locked

Description

Given the goal of conserving swh_workers indexes for ever, we will have > 1000 indexes after 3 years.
With the default number of shards per index, this will mean more than 5400 shards.
Each shard consumes Elasticsearch resources; having so many is unreasonable.

After collecting a few days of statistics with the new index patterns, we can see that:

  • Each swh_worker index only uses 6-7GB maximum, way below the recommended maximum shard size of 40-50GB.
  • Some swh_worker indexes contain more than 6 million documents, a number which could still grow in the future.

It is recommanded not to have more more than 3-5 million documents per shard.

Given all the above data, having two shards per swh_worker index reasonable.
The following shell code adds a template to the Elasticsearch cluster and sets the number of shards per swh_worker index to 2:

curl -i -H'Content-Type: application/json' -XPUT http://localhost:9200/_template/template_swh_workers -d '
{
    "template" : "swh_workers-*",
    "settings" : { "number_of_shards" : 2 }
}'

Event Timeline

ftigeot triaged this task as Normal priority.Mar 9 2018, 4:31 PM
ftigeot created this task.

systemlogs-* indexes are supposed to be deleted after three months, so their total number of shards will stay limited and they shouldn't have a strong detrimental impact on future cluster health.
There is no need to bother changing the default number of shards for them.

Change applied on banco today.

Shards only consume resources whem the indexes are active. We don't need indexes with historical data to be open at all times, just when we (punctually) need to query them.

In addition to this change, I think it would make sense to configure https://github.com/elastic/curator to close old indexes automatically, keeping live resources for recent data.

ftigeot renamed this task from Tune the number of shards per index to Tune index parameters.Mar 21 2018, 4:23 PM

Since we do not need to perform immediate analysis on incoming data, we can relax the refresh_interval parameter. This will create less Lucene segments per index and ultimately reduce the amount of IOPS a bit.

Data is also compressed with a lz4 algorithm by default. We can use the best_compression (deflate) algorithm and get smaller indexes at the end of the day.

Shell code describing the new template:

curl -i -H'Content-Type: application/json' -XPUT http://192.168.101.58:9200/_template/template_swh_workers -d '
{
    "template" : "swh_workers-*",
    "settings" : {
	"number_of_shards" : 2,
	"number_of_replicas" : 1,
	"refresh_interval" : "30s",
	"codec" : "best_compression"
    }
}'

This template has been applied to the existing cluster on banco.internal.softwareheritage.org.