Page MenuHomeSoftware Heritage

Delete old system log data from the Elasticsearch cluster
Closed, MigratedEdits Locked

Description

The elasticsearch cluster on banco.internal.softwareheritage.org contains

  • old system log data
  • test data injected by error from my laptop

It would be nice to delete unneeded documents at some point.

Proposed request to clean up test data:

curl -i -H'Content-Type: application/json' -XPOST "http://localhost:9200/_all/_delete_by_query/?pretty=true" -d '
{
    "query" : {
	"match" : { "hostname" : "hplaptopft0" }}
    }
}'

Proposed request to clean up old system log data:

curl -i -H'Content-Type: application/json' -XPOST "http://localhost:9200/_all/_delete_by_query/?pretty=true" -d '
{
    "query" : {
	"bool": {
	    "must_not": [{ "match" : { "systemd_unit" : "swh-worker@" } }],
	    "must": { "range" : { "@timestamp" : { "lt" : "now-3M" }}}
	}
    }
}'

Remark: closed Elasticsearch indices are not processed.
In order to delete documents from closed indices, we have to reopen them first.

Event Timeline

The queries look reasonable, you should go ahead with them.

Test data cleaned up this day.

System logs from 2017-01 cleaned up this day. 15530 documents deleted.

ftigeot changed the task status from Open to Work in Progress.Mar 7 2018, 4:58 PM

System logs from 2017-02 cleaned up this day. 13,435,713 documents deleted.

System logs from 2017-03 cleaned up this day. 13,474,622 documents deleted.

System logs up to 2017-05 cleaned up this day. 25,844,268 documents deleted.

System logs up to 2017-06 cleaned up this day. 25,492,157 documents deleted.

System logs up to 2017-07 cleaned up this day. 24,191,557 documents deleted.

System logs up to 2017-08 cleaned up this day. 13,175,880 documents deleted.

System logs up to 2017-09 cleaned up this day. 15,998,132 documents deleted.

System logs up to 2017-10 cleaned up this day. 23,889,499 documents deleted.

System logs up to 2017-11 purged. 46,110,900 documents deleted.

All remaining system logs from 2017 cleaned up this day. 31,214,858 documents deleted.

Some of the first logstash-${date} indexes became empty (zero non-deleted documents) and could simply be deleted as-is.

The swh_workers-2018.03.07 index contained non-swh-workers documents and was cleaned this way:

curl -i -H'Content-Type: application/json' \
     -XPOST "http://esnode3.internal.softwareheritage.org:9200/swh_workers-2018.03.07/_delete_by_query?pretty=true" -d '
{
	"query" : { "bool" : { "must_not" : [{ "match" : { "systemd_unit" : "swh-worker@" }}] }}
}'

Even though all delete requests were previously successfully processed, non-swh-workers data remain in the legacy logstash-* indexes.
This is not an entirely unexpected behavior. It is possible resource limitations prevented the old Banco node from processing all deletion requests in a bounded time frame.
Deletion queries will be rerun index by index in this way:

 curl -i -H'Content-Type: application/json' \
     -XPOST "http://esnode2.internal.softwareheritage.org:9200/logstash-2018.02.31/_delete_by_query?pretty=true" -d '
{
	"query" : { "bool" : { "must_not" : [{ "match" : { "systemd_unit" : "swh-worker@" }}] }}
}'

It seems like deleting old documents takes a heavy toll on the cluster.
So far, for every month of old logstash indexes cleaned, at least one node member started to misbehave and had to be restarted after excessive timeouts and/or other issues including constant garbage collection and disk trashing.

All remaining non-swh-worker logs deleted from legacy logstash-* indexes.