Page MenuHomeSoftware Heritage

Investigate end-to-end monitoring which no longer reports issues
Closed, MigratedEdits Locked

Description

I no longer see any icinga alerts about those.

I noticed an issue on latest deposit which got not properly reloaded/restarted after latest deployment (my bad probably).
In any case, it should have been brought up by those icinga notifications (timeout on the deposit end-to-end or some such).

Checking the icinga node (pergamon), the following logs can be seen:

ardumont@pergamon:/var/log/icinga2% tail -f icinga2.log icinga2.log.1 error.log error.log.1 | grep -i deposit
[2020-10-21 21:28:54 +0000] information/Checkable: Checkable 'deposit.internal.staging.swh.network!apt' has 1 notification(s). Checking filters for type 'Problem', sends will be logged.
[2020-10-21 21:28:54 +0000] information/Notification: Sending 'Problem' notification 'deposit.internal.staging.swh.network!apt!irc-notify-all-services' for user 'root'
[2020-10-21 21:28:54 +0000] information/Notification: Completed sending 'Problem' notification 'deposit.internal.staging.swh.network!apt!irc-notify-all-services' for checkable 'deposit.internal.staging.swh.network!apt' and user 'root' using command 'irc-service-notification'.
[2020-10-21 22:58:01 +0000] information/Notification: Sending reminder 'Problem' notification 'pergamon.softwareheritage.org!Check deposit end-to-end!irc-notify-all-services' for user 'root'
[2020-10-21 22:58:01 +0000] information/Notification: Completed sending 'Problem' notification 'pergamon.softwareheritage.org!Check deposit end-to-end!irc-notify-all-services' for checkable 'pergamon.softwareheritage.org!Check deposit end-to-end' and user 'root' using command 'irc-service-notification'.

Event Timeline

ardumont created this task.
ardumont renamed this task from Investigate deposit monitoring no longer running end-to-end tests to Investigate end-to-end monitoring which no longer reports issues.Oct 22 2020, 10:30 AM

finally found the service through the ui... [1]
The issue is also related to the vault end-to-end (so i renamed the issue).

There is a traceback in there:

Traceback (most recent call last): File "/usr/bin/swh", line 11, in load_entry_point('swh.core==0.2.2', 'console_scripts', 'swh')() File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 122, in main return swh(auto_envvar_prefix="SWH") File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__ return self.main(*args, **kwargs) File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/usr/lib/python3/dist-packages/click/core.py", line 1135, in invoke sub_ctx = cmd.make_context(cmd_name, args, parent=ctx) AttributeError: module 'swh.icinga_plugins.cli' has no attribute 'make_context'

[1] https://icinga.softwareheritage.org/monitoring/list/services?service_problem=1&sort=service_severity&dir=desc&page=2#!/monitoring/service/show?host=pergamon.softwareheritage.org&service=Check%20deposit%20end-to-end

swh.core was stuck in an old version which triggered the stacktrace ^.
saatchi, the scheduler, was also in pain (swh-scheduler-runner and listener) due to that missed upgrade.

Fixed.

and now the icinga alerts are back on track (and green!, no more yellow) [1]

[1] excerpt from irc channel #swh-sysadm (where icinga alerts are reported):

10:46 <swhbot> icinga RECOVERY: service Check deposit end-to-end on pergamon.softwareheritage.org is OK: DEPOSIT OK - Deposit took 8.22s and succeeded.DEPOSIT OK - Deposit Metadata update took 0.59s and succeeded.
10:46 <swhbot> icinga RECOVERY: service Check vault end-to-end on pergamon.softwareheritage.org is OK: VAULT OK - cooking directory 9f262422500704cd54678394d9c8be4fab1be1a6 took 51.14s and succeeded.
ardumont claimed this task.