Page MenuHomeSoftware Heritage

Date overflow error in scheduler journal client
Closed, ResolvedPublic

Description

In production and staging, the scheduler journal clients are failing with this error:

Aug 25 16:02:32 scheduler0 systemd[1]: Stopped Software Heritage Scheduler Journal Client.
Aug 25 16:02:32 scheduler0 systemd[1]: swh-scheduler-journal-client.service: Consumed 1.006s CPU time.
Aug 25 16:02:32 scheduler0 systemd[1]: Started Software Heritage Scheduler Journal Client.
Aug 25 16:02:36 scheduler0 swh[419915]: Traceback (most recent call last):
Aug 25 16:02:36 scheduler0 swh[419915]:   File "/usr/bin/swh", line 11, in <module>
Aug 25 16:02:36 scheduler0 swh[419915]:     load_entry_point('swh.core==0.14.4', 'console_scripts', 'swh')()
Aug 25 16:02:36 scheduler0 swh[419915]:   File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main
Aug 25 16:02:36 scheduler0 swh[419915]:     return swh(auto_envvar_prefix="SWH")
Aug 25 16:02:36 scheduler0 swh[419915]:   File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
Aug 25 16:02:36 scheduler0 swh[419915]:     return self.main(*args, **kwargs)
Aug 25 16:02:36 scheduler0 swh[419915]:   File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
Aug 25 16:02:36 scheduler0 swh[419915]:     rv = self.invoke(ctx)
Aug 25 16:02:36 scheduler0 swh[419915]:   File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
Aug 25 16:02:36 scheduler0 swh[419915]:     return _process_result(sub_ctx.command.invoke(sub_ctx))
Aug 25 16:02:36 scheduler0 swh[419915]:   File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
Aug 25 16:02:36 scheduler0 swh[419915]:     return _process_result(sub_ctx.command.invoke(sub_ctx))
Aug 25 16:02:36 scheduler0 swh[419915]:   File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
Aug 25 16:02:36 scheduler0 swh[419915]:     return ctx.invoke(self.callback, **ctx.params)
Aug 25 16:02:36 scheduler0 swh[419915]:   File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
Aug 25 16:02:36 scheduler0 swh[419915]:     return callback(*args, **kwargs)
Aug 25 16:02:36 scheduler0 swh[419915]:   File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
Aug 25 16:02:36 scheduler0 swh[419915]:     return f(get_current_context(), *args, **kwargs)
Aug 25 16:02:36 scheduler0 swh[419915]:   File "/usr/lib/python3/dist-packages/swh/scheduler/cli/journal.py", line 52, in visit_stats_journal_client
Aug 25 16:02:36 scheduler0 swh[419915]:     nb_messages = client.process(worker_fn)
Aug 25 16:02:36 scheduler0 swh[419915]:   File "/usr/lib/python3/dist-packages/swh/journal/client.py", line 265, in process
Aug 25 16:02:36 scheduler0 swh[419915]:     batch_processed, at_eof = self.handle_messages(messages, worker_fn)
Aug 25 16:02:36 scheduler0 swh[419915]:   File "/usr/lib/python3/dist-packages/swh/journal/client.py", line 292, in handle_messages
Aug 25 16:02:36 scheduler0 swh[419915]:     worker_fn(dict(objects))
Aug 25 16:02:36 scheduler0 swh[419915]:   File "/usr/lib/python3/dist-packages/swh/scheduler/journal_client.py", line 263, in process_journal_objects
Aug 25 16:02:36 scheduler0 swh[419915]:     queue_position_per_visit_type, visit_stats_d
Aug 25 16:02:36 scheduler0 swh[419915]:   File "/usr/lib/python3/dist-packages/swh/scheduler/journal_client.py", line 95, in next_visit_queue_position
Aug 25 16:02:36 scheduler0 swh[419915]:     return current_position + visit_interval
Aug 25 16:02:36 scheduler0 swh[419915]: OverflowError: date value out of range
Aug 25 16:02:36 scheduler0 systemd[1]: swh-scheduler-journal-client.service: Main process exited, code=exited, status=1/FAILURE

there is a lot of lag in the message consumption in both environments (11025670 messages in production, 888188 in staging)

Event Timeline

vsellier created this task.
ardumont changed the task status from Open to Work in Progress.Aug 26 2021, 9:57 AM
ardumont raised the priority of this task from High to Unbreak Now!.

I bumped the priority since scheduler runners (next-gen) are depending on the results of journal client (scheduler metrics as well).


In any case, status on staging:

  • scheduler0.staging patched with D6136 (db migrated [1]).
  • swh-scheduler-journal-client restarted without issue.

As a result, the consumer lag subsided [2]

[1]

10:04:32 swh-scheduler@db1:5432=> update origin_visit_stats
  set next_position_offset = 10
  where next_position_offset > 10;
UPDATE 278513
Time: 8047.875 ms (00:08.048)

[2] https://grafana.softwareheritage.org/goto/WSSj0WV7z?orgId=1

status:

  • scheduler v0.17.1 deployed on production [1] (db migrated) and staging.
  • then swh-scheduler-journal-client service restarted.

[1] Monitoring consumer lag with
https://grafana.softwareheritage.org/goto/3cOsBZVnk?orgId=1

Gist of it is deployed.
I'll close it when it's monitored by at least that diff ^.

ardumont claimed this task.

The monitoring icinga alerts have been deployed:

+object Service "check_scheduler_journal_client_saatchi.internal.softwareheritage.org" {
+  import "generic-service"
+
+  host_name = "saatchi.internal.softwareheritage.org"
+  check_command = "check_systemd"
+  command_endpoint = "saatchi.internal.softwareheritage.org"
+  vars.check_systemd_unit = "swh-scheduler-journal-client.service"
+}
+
+object Service "check_scheduler_journal_client_scheduler0.internal.staging.swh.network" {
+  import "generic-service"
+
+  host_name = "scheduler0.internal.staging.swh.network"
+  check_command = "check_systemd"
+  command_endpoint = "scheduler0.internal.staging.swh.network"
+  vars.check_systemd_unit = "swh-scheduler-journal-client.service"
+}
+

This can be closed now.