Page MenuHomeSoftware Heritage

Monitor daily indexes are present on the log cluster and logs are correctly ingested
Closed, MigratedEdits Locked

Description

Raise an alert if the daily indexes were not created on the elasticsearch cluster or if there are no logs in the last XXmn

Event Timeline

vsellier triaged this task as Normal priority.Apr 8 2021, 4:32 PM
vsellier created this task.
vsellier changed the task status from Open to Work in Progress.Apr 23 2021, 4:09 PM
vsellier claimed this task.
vsellier edited projects, added System administration; removed System administrators.
vsellier moved this task from Backlog to in-progress on the System administration board.
vsellier removed a subscriber: vsellier.

I checked the icinga_logstash plugin[1] to see if it can be helpful but it's more oriented to logastash instances used to ingest data from log files. There is no options to check the number of events received/sent for example.

[1] https://exchange.icinga.com/twidhalm/check_logstash

logstash now exposes an api server[1] which seems to return some interesting metrics on the plugin behaviors.
For example, there is a section for the elasticsearch output plugin:

  "outputs": [
    {
      "id": "62d11c4234b8981da77a97955da92ac9de92b9a6dcd4582f407face31fd5c664",
      "events": {
        "duration_in_millis": 160089636,
        "in": 72818126,
        "out": 72818046
      },
      "bulk_requests": {
        "responses": {
          "200": 3860888
        },
        "successes": 3860888
      },
      "documents": {
        "successes": 72818046
      },
      "name": "elasticsearch"
    }
  ]
},

I'll try to implement a small python script checking if there is other response code than 200 in a first time to identify the behavior
Perhaps it will be also interesting to check other properties like queue size :

"queue": {
  "type": "memory",
  "events_count": 0,
  "queue_size_in_bytes": 0,
  "max_queue_size_in_bytes": 0
},

[1] curl http://localhost:9600/_node/stats/ | jq ''

I have simulated different situations locally on the vagrant environment:

root@logstash0:~# curl -s http://localhost:9600/_node/stats/pipelines | jq '.pipelines.main.plugins.outputs'
[
  {
    "id": "c49a6902391a456022af4c89f0972781900d01d70cd5f312b292cb20c0d345eb",
    "documents": {
      "non_retryable_failures": 112,
      "successes": 103692
    },
    "events": {
      "out": 103804,
      "in": 103804,
      "duration_in_millis": 3529049
    },
    "name": "elasticsearch",
    "bulk_requests": {
      "responses": {
        "200": 2028
      },
      "failures": 3,
      "with_errors": 110,
      "successes": 1918
    }
  }
]

When the logs can't be ingested due to the elasticsearch unavailability, the failures counter is increasing
When ES is available but the index is not, for example a closed or frozen index, or in the T3219 case, the counters documents.non_retryable_failures and `with_errors' are increasing

after searching how it can be integrated with the inciga checks, the simplest way I have found is to create a script that periodically query logstash to get the statistics and return this status in this cases:

  • GREEN: neither non_retryable_failures, with_errors or failures fields founds on the json
  • WARNING: failures field found
  • CRITICAL: non_retryable_failures or with_errors field found

The inconvenient is logstash will need to be restarted to reset the status of the status. If there will be to many alerts, we could improve the script to introduce some kind of state to raise the alert only when the counters are increasing.

The new probe is deployed but nothing is displayed in icinga. Let's start a configuration debug session.

According to the API (TIL the catalog can be requested like that), journal0 doesn't have the new plugins declared. So the check should be disabled as the filter is using this field

curl -k -s -u root:<in the cred store> https://pergamon.internal.softwareheritage.org:5665/v1/objects/hosts\?hosts\=logstash0.internal.softwareheritage.org | jq '.results[].attrs.vars.plugins' 
[
  "check_journal",
  "check_newest_file_age"
]

The check is now active.
An alert will be raised by icinga if :

  • logstash is not responding to the api call
  • at least one error is detected when the logs are sent to elasticseach (ES responding, but an error is detected when the log is stored on the index).