Page MenuHomeSoftware Heritage
Paste P231

scheduler-listener: Issue during update, service repeats trying starting and failing until giving up
ActivePublic

Authored by ardumont on Mar 8 2018, 9:52 AM.
Mar 08 00:39:41 saatchi python3[16096]: Traceback (most recent call last):
Mar 08 00:39:41 saatchi python3[16096]: File "/usr/lib/python3.5/runpy.py", line 193, in _run_module_as_main
Mar 08 00:39:41 saatchi python3[16096]: "__main__", mod_spec)
Mar 08 00:39:41 saatchi python3[16096]: File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
Mar 08 00:39:41 saatchi python3[16096]: exec(code, run_globals)
Mar 08 00:39:41 saatchi python3[16096]: File "/usr/lib/python3/dist-packages/swh/scheduler/celery_backend/listener.py", line 156, in <module>
Mar 08 00:39:41 saatchi python3[16096]: event_monitor(main_app, main_backend)
Mar 08 00:39:41 saatchi python3[16096]: File "/usr/lib/python3/dist-packages/swh/scheduler/celery_backend/listener.py", line 151, in event_monitor
Mar 08 00:39:41 saatchi python3[16096]: recv.capture(limit=None, timeout=None, wakeup=True)
Mar 08 00:39:41 saatchi python3[16096]: File "/usr/lib/python3/dist-packages/celery/events/__init__.py", line 339, in capture
Mar 08 00:39:41 saatchi python3[16096]: return list(self.consume(limit=limit, timeout=timeout, wakeup=wakeup))
Mar 08 00:39:41 saatchi python3[16096]: File "/usr/lib/python3/dist-packages/kombu/mixins.py", line 199, in consume
Mar 08 00:39:41 saatchi python3[16096]: conn.drain_events(timeout=safety_interval)
Mar 08 00:39:41 saatchi python3[16096]: File "/usr/lib/python3/dist-packages/kombu/connection.py", line 275, in drain_events
Mar 08 00:39:41 saatchi python3[16096]: return self.transport.drain_events(self.connection, **kwargs)
Mar 08 00:39:41 saatchi python3[16096]: File "/usr/lib/python3/dist-packages/kombu/transport/pyamqp.py", line 95, in drain_events
Mar 08 00:39:41 saatchi python3[16096]: return connection.drain_events(**kwargs)
Mar 08 00:39:41 saatchi python3[16096]: File "/usr/lib/python3/dist-packages/amqp/connection.py", line 326, in drain_events
Mar 08 00:39:41 saatchi python3[16096]: return amqp_method(channel, args, content)
Mar 08 00:39:41 saatchi python3[16096]: File "/usr/lib/python3/dist-packages/amqp/channel.py", line 1906, in _basic_deliver
Mar 08 00:39:41 saatchi python3[16096]: fun(msg)
Mar 08 00:39:41 saatchi python3[16096]: File "/usr/lib/python3/dist-packages/kombu/messaging.py", line 598, in _receive_callback
Mar 08 00:39:41 saatchi python3[16096]: return on_m(message) if on_m else self.receive(decoded, message)
Mar 08 00:39:41 saatchi python3[16096]: File "/usr/lib/python3/dist-packages/kombu/messaging.py", line 564, in receive
Mar 08 00:39:41 saatchi python3[16096]: [callback(body, message) for callback in callbacks]
Mar 08 00:39:41 saatchi python3[16096]: File "/usr/lib/python3/dist-packages/kombu/messaging.py", line 564, in <listcomp>
Mar 08 00:39:41 saatchi python3[16096]: [callback(body, message) for callback in callbacks]
Mar 08 00:39:41 saatchi python3[16096]: File "/usr/lib/python3/dist-packages/swh/scheduler/celery_backend/listener.py", line 38, in _receive
Mar 08 00:39:41 saatchi python3[16096]: self.process(type, body, message)
Mar 08 00:39:41 saatchi python3[16096]: File "/usr/lib/python3/dist-packages/swh/scheduler/celery_backend/listener.py", line 44, in process
Mar 08 00:39:41 saatchi python3[16096]: handler and handler(event, message)
Mar 08 00:39:41 saatchi python3[16096]: File "/usr/lib/python3/dist-packages/swh/scheduler/celery_backend/listener.py", line 125, in task_succeeded
Mar 08 00:39:41 saatchi python3[16096]: 'message': message,
Mar 08 00:39:41 saatchi python3[16096]: File "/usr/lib/python3/dist-packages/swh/scheduler/celery_backend/listener.py", line 88, in queue_action
Mar 08 00:39:41 saatchi python3[16096]: try_perform_actions()
Mar 08 00:39:41 saatchi python3[16096]: File "/usr/lib/python3/dist-packages/swh/scheduler/celery_backend/listener.py", line 62, in try_perform_actions
Mar 08 00:39:41 saatchi python3[16096]: perform_actions(actions)
Mar 08 00:39:41 saatchi python3[16096]: File "/usr/lib/python3/dist-packages/swh/scheduler/celery_backend/listener.py", line 78, in perform_actions
Mar 08 00:39:41 saatchi python3[16096]: function(*args, **kwargs)
Mar 08 00:39:41 saatchi python3[16096]: File "/usr/lib/python3/dist-packages/swh/scheduler/backend.py", line 37, in wrapped
Mar 08 00:39:41 saatchi python3[16096]: ret = fn(self, *args, **kwargs)
Mar 08 00:39:41 saatchi python3[16096]: File "/usr/lib/python3/dist-packages/swh/scheduler/backend.py", line 432, in end_task_run
Mar 08 00:39:41 saatchi python3[16096]: (backend_id, status, metadata, timestamp)
Mar 08 00:39:41 saatchi python3[16096]: File "/usr/lib/python3/dist-packages/psycopg2/extras.py", line 223, in execute
Mar 08 00:39:41 saatchi python3[16096]: return super(RealDictCursor, self).execute(query, vars)
Mar 08 00:39:41 saatchi python3[16096]: psycopg2.IntegrityError: null value in column "next_run" violates not-null constraint
Mar 08 00:39:41 saatchi python3[16096]: DETAIL: Failing row contains (8599, origin-update-git, {"args": ["https://github.com/torvalds/linux"], "kwargs": {}}, null, 12:00:00, next_run_not_scheduled, recurring, 4).
Mar 08 00:39:41 saatchi python3[16096]: CONTEXT: SQL statement "update task
Mar 08 00:39:41 saatchi python3[16096]: set status = 'next_run_not_scheduled',
Mar 08 00:39:41 saatchi python3[16096]: next_run = now() + cur_task_type.retry_delay,
Mar 08 00:39:41 saatchi python3[16096]: retries_left = cur_task.retries_left - 1
Mar 08 00:39:41 saatchi python3[16096]: where id = cur_task.id"
Mar 08 00:39:41 saatchi python3[16096]: PL/pgSQL function swh_scheduler_update_task_on_task_end() line 42 at SQL statement
Mar 08 00:39:41 saatchi python3[16096]: SQL function "swh_scheduler_end_task_run" statement 1
Mar 08 00:39:41 saatchi systemd[1]: swh-scheduler-listener.service: Main process exited, code=exited, status=1/FAILURE
Mar 08 00:39:41 saatchi systemd[1]: swh-scheduler-listener.service: Unit entered failed state.
Mar 08 00:39:41 saatchi systemd[1]: swh-scheduler-listener.service: Failed with result 'exit-code'.
Mar 08 00:39:41 saatchi systemd[1]: swh-scheduler-listener.service: Service hold-off time over, scheduling restart.
Mar 08 00:39:41 saatchi systemd[1]: Stopped Software Heritage scheduler listener.
-- Subject: Unit swh-scheduler-listener.service has finished shutting down
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- Unit swh-scheduler-listener.service has finished shutting down.
Mar 08 00:39:41 saatchi systemd[1]: swh-scheduler-listener.service: Start request repeated too quickly.
Mar 08 00:39:41 saatchi systemd[1]: Failed to start Software Heritage scheduler listener.
-- Subject: Unit swh-scheduler-listener.service has failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- Unit swh-scheduler-listener.service has failed.
--
-- The result is failed.
Mar 08 00:39:41 saatchi systemd[1]: swh-scheduler-listener.service: Unit entered failed state.
Mar 08 00:39:41 saatchi systemd[1]: swh-scheduler-listener.service: Failed with result 'exit-code'.

Event Timeline

ardumont changed the title of this paste from scheduler-listener: Issue during update, repeats all night then service stops trying to restart to scheduler-listener: Issue during update, service repeats trying starting and failing until giving up.Mar 8 2018, 10:41 AM

Try and reproduce:

softwareheritage-scheduler=> begin;
softwareheritage-scheduler=> select * from swh_scheduler_end_task_run('5924ead8-55ec-4212-ad70-4237a29cd7c8', 'failed', '{}'::jsonb, '2018-03-08T10:04:01.193193+00:00');
ERROR:  null value in column "next_run" violates not-null constraint
DETAIL:  Failing row contains (8599, origin-update-git, {"args": ["https://github.com/torvalds/linux"], "kwargs": {}}, null, 12:00:00, next_run_not_scheduled, recurring, 4).
CONTEXT:  SQL statement "update task
          set status = 'next_run_not_scheduled',
              next_run = now() + cur_task_type.retry_delay,
              retries_left = cur_task.retries_left - 1
          where id = cur_task.id"
PL/pgSQL function swh_scheduler_update_task_on_task_end() line 42 at SQL statement
SQL function "swh_scheduler_end_task_run" statement 1

Irc discussion:

11:45:30    +olasd | ardumont: I'm not sure this fix is correct, recurring tasks shouldn't be retried like that
11:49:12    +olasd | (they'll be retried eventually, there's no obvious need to make them jump the queue)
12:03:33 +ardumont | i was mainly trying to unstuck the listener, now it is and yes, if a better solution exists, i'm all for ti
12:03:34 +ardumont | it
12:04:12 +ardumont | does that mean that we need, in that function to discriminate with the policy as well
12:04:12 +ardumont | ?
12:06:40 +ardumont | (also, btw, that must not have happened often, since it's the first time we have this; the way i read the db, only the debian loader would have been ok, all other task types would have broken the same way)
12:07:20    +olasd | ardumont: I don't know why that task had a number of retries, it shouldn't have

And indeed:

softwareheritage-scheduler=> select type, count(*) from task where policy='recurring' and retries_left != 0 group by type;
       type        | count
-------------------+-------
 origin-update-git |     1
(1 row)

That's the only one!

12:13:40 +ardumont | olasd: so the real fix is, revert the default retry_delay in the trigger updated stored proc, and update that specific task with retries_left to 0
12:13:59    +olasd | I don't know off hand

So, a posteriori, I was on the right track.

As it seems my initial attempt seems to hold its peace anyway, i am keeping it as is for now.
And reference this for later.