Page MenuHomeSoftware Heritage

Ensure icinga alerts are raised if the scheduler journal client is down
AbandonedPublic

Authored by ardumont on Aug 27 2021, 5:01 PM.

Details

Summary

That service is critical for runners to schedule correctly origins to visit. So raise an
alert if something is wrong with its startup.

Related to T3502

Test Plan

octo-diff:

bin/octocatalog-diff --octocatalog-diff-args --no-truncate-details --to staging saatchi
...
+ Concat::Fragment[icinga2::object::Service::check_scheduler_journal_client] =>
   parameters =>
      "order": 60
      "target": "/etc/icinga2/zones.d/global-templates/services.conf"
      "content": >>>

apply Service "check_scheduler_journal_client" {
  import "generic-service"

  check_command = "check_systemd"
  command_endpoint = host.name
  vars.check_systemd_unit = "swh-scheduler-journal-client.service"
  assign where host.vars.os == "Linux"
  ignore where host.vars.noagent
}
<<<
*******************************************
+ Concat[/etc/icinga2/zones.d/global-templates/services.conf] =>
   parameters =>
      "backup": "puppet"
      "ensure": "present"
      "ensure_newline": false
      "force": false
      "format": "plain"
      "group": "nagios"
      "mode": "0640"
      "notify": ["Class[Icinga2::Service]"]
      "order": "alpha"
      "owner": "root"
      "path": "/etc/icinga2/zones.d/global-templates/services.conf"
      "replace": true
      "show_diff": true
      "tag": "icinga2::config::file"
      "warn": true
*******************************************
+ Concat_file[/etc/icinga2/zones.d/global-templates/services.conf] =>
   parameters =>
      "backup": "puppet"
      "ensure_newline": false
      "force": false
      "format": "plain"
      "group": "nagios"
      "mode": "0640"
      "order": "alpha"
      "owner": "root"
      "replace": true
      "show_diff": true
      "tag": "_etc_icinga2_zones.d_global-templates_services.conf"
*******************************************
+ Concat_fragment[/etc/icinga2/zones.d/global-templates/services.conf_header] =>
   parameters =>
      "order": "0"
      "tag": "_etc_icinga2_zones.d_global-templates_services.conf"
      "target": "/etc/icinga2/zones.d/global-templates/services.conf"
      "content": >>>
# This file is managed by Puppet. DO NOT EDIT.
<<<
*******************************************
+ Concat_fragment[icinga2::object::Service::check_scheduler_journal_client] =>
   parameters =>
      "order": 60
      "tag": "_etc_icinga2_zones.d_global-templates_services.conf"
      "target": "/etc/icinga2/zones.d/global-templates/services.conf"
      "content": >>>

apply Service "check_scheduler_journal_client" {
  import "generic-service"

  check_command = "check_systemd"
  command_endpoint = host.name
  vars.check_systemd_unit = "swh-scheduler-journal-client.service"
  assign where host.vars.os == "Linux"
  ignore where host.vars.noagent
}
<<<
*******************************************
+ Icinga2::Object::Service[check_scheduler_journal_client] =>
   parameters =>
      "apply": true
      "assign": ["host.vars.os == Linux"]
      "check_command": "check_systemd"
      "command_endpoint": "host.name"
      "ensure": "present"
      "ignore": ["host.vars.noagent"]
      "import": ["generic-service"]
      "name": "Check swh scheduler journal client service"
      "order": 60
      "prefix": false
      "service_name": "check_scheduler_journal_client"
      "target": "/etc/icinga2/zones.d/global-templates/services.conf"
      "template": false
      "vars": {"check_systemd_unit"=>"swh-scheduler-journal-client.service"}
*******************************************
+ Icinga2::Object[icinga2::object::Service::check_scheduler_journal_client] =>
   parameters =>
      "apply": true
      "assign": ["host.vars.os == Linux"]
      "attrs": {"check_command"=>"check_systemd", "command_endpoint"=>"host.name", "vars"=>{"check_systemd_unit"=>"swh-scheduler-journal-client.service"}}
      "attrs_list": ["display_name", "host_name", "check_command", "check_timeout", "check_interval", "check_period", "retry_interval", "max_check_attempts", "groups", "enable_notifications", "enable_active_checks", "enable_passive_checks", "enable_event_handler", "enable_flapping", "enable_perfdata", "event_command", "flapping_threshold_low", "flapping_threshold_high", "volatile", "zone", "command_endpoint", "notes", "notes_url", "action_url", "icon_image", "icon_image_alt", "vars", "Acknowledgement", "ApiBindHost", "ApiBindPort", "ApiEnvironment", "ApplicationType", "AttachDebugger", "BuildCompilerName", "BuildCompilerVersion", "BuildHostName", "Concurrency", "Critical", "Custom", "Deprecated", "Down", "DowntimeEnd", "DowntimeRemoved", "DowntimeStart", "Environment", "FlappingEnd", "FlappingStart", "HostDown", "HostUp", "IncludeConfDir", "Internal", "Json", "LocalStateDir", "LogCritical", "LogDebug", "LogInformation", "LogNotice", "LogWarning", "Math", "MaxConcurrentChecks", "ModAttrPath", "NodeName", "OK", "ObjectsPath", "PidPath", "PkgDataDir", "PlatformArchitecture", "PlatformKernel", "PlatformKernelVersion", "PlatformName", "PlatformVersion", "PrefixDir", "Problem", "Recovery", "RunAsGroup", "RunAsUser", "RunDir", "ServiceCritical", "ServiceOK", "ServiceUnknown", "ServiceWarning", "StatePath", "SysconfDir", "System", "Types", "Unknown", "Up", "UseVfork", "VarsPath", "Warning", "ZonesDir", "NodeName", "ZoneName", "TicketSalt", "PluginDir", "PluginContribDir", "ManubulonPluginDir", "name", "NodeName", "ZoneName", "TicketSalt", "PluginDir", "PluginContribDir", "ManubulonPluginDir", "name"]
      "ensure": "present"
      "ignore": ["host.vars.noagent"]
      "import": ["generic-service"]
      "object_name": "check_scheduler_journal_client"
      "object_type": "Service"
      "order": 60
      "prefix": false
      "target": "/etc/icinga2/zones.d/global-templates/services.conf"
      "template": false
*******************************************
*** End octocatalog-diff on saatchi.internal.softwareheritage.org

Diff Detail

Repository
rSPSITE puppet-swh-site
Branch
staging
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 23240
Build 36260: arc lint + arc unit

Event Timeline

ardumont created this revision.

For information, the generic check_systemd does not detect when the
swh-scheduler-journal-client is not in a happy place (process underneath killed):

root@scheduler0:~# /usr/lib/nagios/plugins/check_systemd
SYSTEMD CRITICAL - startup_time is 176.4 (outside range 0:120) | count_units=223 startup_time=176.353;60;120 units_activating=0 units_active=142 units_failed=0 units_inactive=81  # <------ current state
root@scheduler0:~# /usr/lib/nagios/plugins/check_systemd --unit swh-scheduler-journal-client.service
SYSTEMD OK - swh-scheduler-journal-client.service: active
root@scheduler0:~# systemctl status swh-scheduler-journal-client.service
● swh-scheduler-journal-client.service - Software Heritage Scheduler Journal Client
     Loaded: loaded (/etc/systemd/system/swh-scheduler-journal-client.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2021-08-30 08:56:28 UTC; 32min ago
   Main PID: 1890467 (swh)
      Tasks: 5 (limit: 9537)
     Memory: 36.7M
        CPU: 12.903s
     CGroup: /system.slice/swh-scheduler-journal-client.service
             └─1890467 /usr/bin/python3 /usr/bin/swh scheduler --config-file /etc/softwareheritage/scheduler/journal-client.yml journal-client

Aug 30 08:56:28 scheduler0 systemd[1]: Started Software Heritage Scheduler Journal Client.
root@scheduler0:~# kill -9 1890467
root@scheduler0:~# systemctl status swh-scheduler-journal-client.service
● swh-scheduler-journal-client.service - Software Heritage Scheduler Journal Client
     Loaded: loaded (/etc/systemd/system/swh-scheduler-journal-client.service; enabled; vendor preset: enabled)
     Active: activating (auto-restart) (Result: signal) since Mon 2021-08-30 09:29:31 UTC; 992ms ago
    Process: 1890467 ExecStart=/usr/bin/swh scheduler --config-file /etc/softwareheritage/scheduler/journal-client.yml journal-client (code=killed, signal=KILL)
   Main PID: 1890467 (code=killed, signal=KILL)
        CPU: 12.941s

Aug 30 09:29:31 scheduler0 systemd[1]: swh-scheduler-journal-client.service: Failed with result 'signal'.
Aug 30 09:29:31 scheduler0 systemd[1]: swh-scheduler-journal-client.service: Consumed 12.941s CPU time.
root@scheduler0:~# /usr/lib/nagios/plugins/check_systemd
SYSTEMD CRITICAL - startup_time is 176.4 (outside range 0:120) | count_units=223 startup_time=176.353;60;120 units_activating=1 units_active=141 units_failed=0 units_inactive=81  # <---- undetected

root@scheduler0:~# /usr/lib/nagios/plugins/check_systemd --unit swh-scheduler-journal-client.service
SYSTEMD CRITICAL - swh-scheduler-journal-client.service: activating  # <---- detected
ardumont planned changes to this revision.EditedAug 30 2021, 3:58 PM

Actual test in vm says "no", so investigating why.

Debugging this to make it work properly...

I've been able to filter out on the host.name (and nothing else seems interesting to install it properly)
I found this redundant with our puppet profile though so i'm unsure about this.

apply Service "check_scheduler_journal_client" {
  import "generic-service"

  check_command = "check_systemd"
  command_endpoint = host.name
  vars.check_systemd_unit = "swh-scheduler-journal-client.service"
  assign where host.vars.os == "Linux" && host.name in ["scheduler0.internal.staging.swh.network", "saatchi.internal.softwareheritage.org", "pergamon.softwareheritage.org"]  # <--- pergamon for testing purposes in vagrant vms
  ignore where host.vars.noagent
}

I'd like to try back to install collected resources but as far as i recall my previous attempts
seemed to work in surface but the checks ended up being executed on pergamon (icinga2 master)
and not the nodes...

I'd like to try back to install collected resources but as far as i recall my previous attempts
seemed to work in surface but the checks ended up being executed on pergamon (icinga2 master)
and not the nodes...

D6163 confirms this ^

I'd like to try back to install collected resources but as far as i recall my previous attempts
seemed to work in surface but the checks ended up being executed on pergamon (icinga2 master)
and not the nodes...

D6163 confirms this ^

Using the proper missing instruction, it now works.
So i'm closing this diff as D6163 now supersedes this!