Monitor daily indexes are present on the log cluster and logs are correctly ingested
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vsellier
	Apr 8 2021, 4:32 PM

Description

Raise an alert if the daily indexes were not created on the elasticsearch cluster or if there are no logs in the last XXmn

Revisions and Commits

rSPSITE puppet-swh-site
	D5718	rSPSITE39dd72bd4820 Concatenate global and agent plugins list
	D5716	rSPSITE4dd46189457a monitoring: activate the logstash probe via a filter on the plugins
	D5709	rSPSITE6657bf88897d Add a monitoring alert when logstash is failing to send logs to ES

Related Objects
Search...

		Status	Assigned	Task
		Migrated	gitlab-migration	T3219 No logs are ingested on elasticsearch since 2021-03-26
		Migrated	gitlab-migration	T3222 Monitor daily indexes are present on the log cluster and logs are correctly ingested

Event Timeline

vsellier triaged this task as Normal priority.Apr 8 2021, 4:32 PM

vsellier created this task.

vsellier changed the task status from Open to Work in Progress.Apr 23 2021, 4:09 PM

vsellier claimed this task.

vsellier edited projects, added System administration; removed System administrators.

vsellier moved this task from Backlog to in-progress on the System administration board.

vsellier removed a subscriber: vsellier.

I checked the icinga_logstash plugin[1] to see if it can be helpful but it's more oriented to logastash instances used to ingest data from log files. There is no options to check the number of events received/sent for example.

[1] https://exchange.icinga.com/twidhalm/check_logstash

logstash now exposes an api server[1] which seems to return some interesting metrics on the plugin behaviors.
For example, there is a section for the elasticsearch output plugin:

  "outputs": [
    {
      "id": "62d11c4234b8981da77a97955da92ac9de92b9a6dcd4582f407face31fd5c664",
      "events": {
        "duration_in_millis": 160089636,
        "in": 72818126,
        "out": 72818046
      },
      "bulk_requests": {
        "responses": {
          "200": 3860888
        },
        "successes": 3860888
      },
      "documents": {
        "successes": 72818046
      },
      "name": "elasticsearch"
    }
  ]
},

I'll try to implement a small python script checking if there is other response code than 200 in a first time to identify the behavior
Perhaps it will be also interesting to check other properties like queue size :

"queue": {
  "type": "memory",
  "events_count": 0,
  "queue_size_in_bytes": 0,
  "max_queue_size_in_bytes": 0
},

[1] curl http://localhost:9600/_node/stats/ | jq ''

I have simulated different situations locally on the vagrant environment:

root@logstash0:~# curl -s http://localhost:9600/_node/stats/pipelines | jq '.pipelines.main.plugins.outputs'
[
  {
    "id": "c49a6902391a456022af4c89f0972781900d01d70cd5f312b292cb20c0d345eb",
    "documents": {
      "non_retryable_failures": 112,
      "successes": 103692
    },
    "events": {
      "out": 103804,
      "in": 103804,
      "duration_in_millis": 3529049
    },
    "name": "elasticsearch",
    "bulk_requests": {
      "responses": {
        "200": 2028
      },
      "failures": 3,
      "with_errors": 110,
      "successes": 1918
    }
  }
]

When the logs can't be ingested due to the elasticsearch unavailability, the failures counter is increasing
When ES is available but the index is not, for example a closed or frozen index, or in the T3219 case, the counters documents.non_retryable_failures and `with_errors' are increasing

after searching how it can be integrated with the inciga checks, the simplest way I have found is to create a script that periodically query logstash to get the statistics and return this status in this cases:

GREEN: neither non_retryable_failures, with_errors or failures fields founds on the json
WARNING: failures field found
CRITICAL: non_retryable_failures or with_errors field found

The inconvenient is logstash will need to be restarted to reset the status of the status. If there will be to many alerts, we could improve the script to introduce some kind of state to raise the alert only when the counters are increasing.

vsellier added a revision: D5709: Add a monitoring alert when logstash is failing to send logs to ES.May 7 2021, 12:09 PM

vsellier added a commit: rSPSITE6657bf88897d: Add a monitoring alert when logstash is failing to send logs to ES.May 7 2021, 2:54 PM

The new probe is deployed but nothing is displayed in icinga. Let's start a configuration debug session.

According to the API (TIL the catalog can be requested like that), journal0 doesn't have the new plugins declared. So the check should be disabled as the filter is using this field

curl -k -s -u root:<in the cred store> https://pergamon.internal.softwareheritage.org:5665/v1/objects/hosts\?hosts\=logstash0.internal.softwareheritage.org | jq '.results[].attrs.vars.plugins' 
[
  "check_journal",
  "check_newest_file_age"
]

vsellier added a revision: D5716: monitoring: activate the logstash probe via a filter on the plugins.May 8 2021, 3:42 PM

vsellier added a commit: rSPSITE4dd46189457a: monitoring: activate the logstash probe via a filter on the plugins.May 10 2021, 11:27 AM

vsellier added a revision: D5718: Concatenate global and agent plugins list.May 10 2021, 12:06 PM

vsellier added a commit: rSPSITE39dd72bd4820: Concatenate global and agent plugins list.May 10 2021, 2:11 PM

vsellier mentioned this in rSPSITE12982f3dcbf9: Declare the icinga check command also on the master.May 10 2021, 2:53 PM

The check is now active.
An alert will be raised by icinga if :

logstash is not responding to the api call
at least one error is detected when the logs are sent to elasticseach (ES responding, but an error is detected when the log is stored on the index).

vsellier mentioned this in T3223: Elasticsearch: Monitor the max opened shards on a cluster.May 10 2021, 3:02 PM

ardumont moved this task from deployed/landed/monitoring to done on the System administration board.Jul 29 2021, 1:22 PM

This task has been migrated to GitLab.

Monitor daily indexes are present on the log cluster and logs are correctly ingestedClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Monitor daily indexes are present on the log cluster and logs are correctly ingested
Closed, MigratedEdits Locked
Actions

Related Objects
Search...