Page MenuHomeSoftware Heritage

Reindex old data on banco to put it into swh_worker indexes
Closed, MigratedEdits Locked

Description

Historical data on banco has been stored into generic logstash-${date} indexes.
These indexes now contain deleted documents and are not compressed as much as they could, wasting precious storage space.

This is the proposed reindexation process:

1. Change index template to improve reindexation speed
------------------------------------------------------

curl -i -H'Content-Type: application/json' -XPUT http://192.168.101.58:9200/_template/template_swh_workers -d '
{
    "template" : "swh_workers-*",
    "settings" : {
	"number_of_shards" : 2,
	"number_of_replicas" : 0,
	"refresh_interval" : -1,
	"codec" : "best_compression"
    }
}'

2. Reindex
----------

time curl -i -H'Content-Type: application/json' -XPOST http://192.168.101.58:9200/_reindex -d '
{                                                       
        "source": { "index": "logstash-2017.03.08" }, 
        "dest":   { "index": "swh_workers-2017.03.08" }
}'

3. Add back replicas to index shards
------------------------------------

curl -i -H'Content-Type: application/json' -XPUT http://192.168.101.58:9200/swh_workers-2017.03.08/_settings -d '
{
    "index" : { "number_of_replicas" : 1 }
}'

4. Change index template back to sane defaults
----------------------------------------------

curl -i -H'Content-Type: application/json' -XPUT http://192.168.101.58:9200/_template/template_swh_workers -d '
{
    "template" : "swh_workers-*",
    "settings" : {
	"number_of_shards" : 2,
	"number_of_replicas" : 1,
	"refresh_interval" : "30s",
	"codec" : "best_compression"
    }
}'

Event Timeline

ftigeot triaged this task as Normal priority.Mar 21 2018, 5:13 PM
ftigeot created this task.

In order to improve reindexation speed, replicas are initially disabled for new indexes.
Do not delete historical logstash-* indexes before being sure the new indexes have been reconfigured to use at least one replica per shard and the shards properly replicated.

Do we really care that much about reindexation speed? This reindexation is a one-shot, unattended process.

If we really do, could we make the replica-less template pattern narrower, to apply only to indexes for historical data? As the reindexation will likely take days, it'd be nice to avoid creating the new daily indexes with an unsafe configuration.

If we do it in batch and stop+reapply the regular template before midnight there shouldn't be any issue.
Templates are only used at index creation time.

Many logstash-$date indexes effectively have 0 documents left after the initial systemlog data deletion phase of T977 and can simply be deleted.

Some logstash-xxx indexes appear to still contain non-swh_workers related information.
Blindly reindexing them according to this issue description will not be enough.

Example of an Elasticsearch query showing non-swh_worker data:

curl -i -H'Content-Type: application/json' -XGET http://esnode1.internal.softwareheritage.org:9200/swh_workers-2018.03.07/_search -d '
{
	"query" : { "bool" : { "must_not" : [{ "match" : { "systemd_unit" : "swh-worker@" }}] }}
}'
ftigeot changed the task status from Open to Work in Progress.Jun 20 2018, 11:14 AM

Some logstash indexes cannot be reindexed.
Part of the Elasticsearch error message:

"mapper_parsing_exception","reason":"failed to parse [swh_logging_args_return_value]","caused_by":{"type":"illegal_argument_exception","reason":"Failed to parse value [None] as only [true] or [false] are allowed."}}

logstash appears to be a good substitute to the failing reindex Elasticsearch API.
It naturally skips documents whose contents appear to be invalid when sent to an Elasticsearch cluster.

Logstash invocation:

logstash -f logstash-reindex.conf

Configuration used:

input {
	elasticsearch {
		hosts => "http://esnode1.internal.softwareheritage.org:9200"
		index => "logstash-2017.04.19"
		query => '{ "query": { "match_all": {} } }'
		scroll => "30m"
		docinfo => true
	}
}

output {
	elasticsearch {
		hosts => "http://esnode2.internal.softwareheritage.org:9200"
		index => "swh_workers-2017.04.19"
		document_id => "%{[@metadata][_id]}"
	}
}

All documents reindexed.
Some legacy logstash-* indexes containing tens of thousands of invalid documents have been kept for further analysis.