Page MenuHomeSoftware Heritage

Scheduler listener doesn't notice tasks failing with an uncaught exception
Open, HighPublic

Description

There is something wrong with the recent submitted save code now requests in production: associated tasks are marked as scheduled since at least a week and should already have been executed (see https://archive.softwareheritage.org/browse/origin/save/#requests).

Looks like the messages from tasks that fail with an exception never reach the scheduler listener and they end up in limbo

Event Timeline

anlambert triaged this task as High priority.
olasd added a subscriber: olasd.Feb 27 2019, 4:41 PM

There's clearly an issue in prod which I haven't pinpointed yet. The scheduler database records that the task was sent to RabbitMQ, but no worker logs show them exiting the other end.

For the dev issue, the commit referenced depends on changes that happened in (all) the repos that provide tasks; have you updated them?

If so, and the issue is still here, could you send a trace of the celery worker showing which tasks have been setup and which queues it's listening to?

I have updated the configuration of my git loader with the following:

cls: remote
  args:
    url: http://localhost:5002/

send_contents: true
send_directories: true
send_revisions: true
send_releases: true
send_occurrences: true

content_packet_size: 10000
content_packet_size_bytes: 104857600
directory_packet_size: 25000
revision_packet_size: 100000
release_packet_size: 100000
occurrence_packet_size: 100000

celery:
  task_broker: amqp://guest@localhost//
  task_modules:
    - swh.loader.git.tasks
  task_queues: 
    - swh_loader_git
  task_soft_time_limit: 0

When I start my celery worker, I got the following debug output:

 -------------- celery@guggenheim v4.2.1 (windowlicker)
---- **** ----- 
--- * ***  * -- Linux-4.19.0-0.bpo.2-amd64-x86_64-with-debian-9.8 2019-02-27 16:46:13
-- * - **** --- 
- ** ---------- [config]
- ** ---------- .> app:         __main__:0x7f82658110f0
- ** ---------- .> transport:   amqp://guest:**@localhost:5672//
- ** ---------- .> results:     disabled://
- *** --- * --- .> concurrency: 4 (prefork)
-- ******* ---- .> task events: ON
--- ***** ----- 
 -------------- [queues]
                .> celery           exchange=celery(direct) key=celery
                .> swh_loader_git   exchange=swh_loader_git(direct) key=swh_loader_git
                .> swh_loader_mercurial exchange=swh_loader_mercurial(direct) key=swh_loader_mercurial
                .> swh_loader_pypi  exchange=swh_loader_pypi(direct) key=swh_loader_pypi
                .> swh_loader_svn   exchange=swh_loader_svn(direct) key=swh_loader_svn
                .> swh_loader_svn_dump_mount_and_load exchange=swh_loader_svn_dump_mount_and_load(direct) key=swh_loader_svn_dump_mount_and_load
                .> swh_loader_tar   exchange=swh_loader_tar(direct) key=swh_loader_tar
                .> swh_vault_batch_cooking exchange=swh_vault_batch_cooking(direct) key=swh_vault_batch_cooking
                .> swh_vault_cooking exchange=swh_vault_cooking(direct) key=swh_vault_cooking

[tasks]
  . celery.accumulate
  . celery.backend_cleanup
  . celery.chain
  . celery.chord
  . celery.chord_unlock
  . celery.chunks
  . celery.group
  . celery.map
  . celery.starmap
  . swh.deposit.loader.tasks.ChecksDepositTsk
  . swh.deposit.loader.tasks.LoadDepositArchiveTsk
  . swh.loader.git.tasks.LoadDiskGitRepository
  . swh.loader.git.tasks.UncompressAndLoadDiskGitRepository
  . swh.loader.git.tasks.UpdateGitRepository
  . swh.loader.mercurial.tasks.LoadArchiveMercurial
  . swh.loader.mercurial.tasks.LoadMercurial
  . swh.loader.pypi.tasks.LoadPyPI
  . swh.loader.svn.tasks.DumpMountAndLoadSvnRepository
  . swh.loader.svn.tasks.LoadSvnRepository
  . swh.loader.svn.tasks.MountAndLoadSvnRepository
  . swh.loader.tar.tasks.LoadTarRepository
  . swh.vault.cooking_tasks.SWHBatchCookingTask
  . swh.vault.cooking_tasks.SWHCookingTask

[2019-02-27 16:46:13,828: DEBUG/MainProcess] Start from server, version: 0.9, properties: {'information': 'Licensed under the MPL.  See http://www.rabbitmq.com/', 'capabilities': {'per_consumer_qos': True, 'publisher_confirms': True, 'consumer_cancel_notify': True, 'basic.nack': True, 'direct_reply_to': True, 'consumer_priorities': True, 'connection.blocked': True, 'exchange_exchange_bindings': True, 'authentication_failure_close': True}, 'copyright': 'Copyright (C) 2007-2016 Pivotal Software, Inc.', 'version': '3.6.6', 'cluster_name': 'rabbit@guggenheim', 'platform': 'Erlang/OTP', 'product': 'RabbitMQ'}, mechanisms: [b'AMQPLAIN', b'PLAIN'], locales: ['en_US']
[2019-02-27 16:46:13,830: INFO/MainProcess] Connected to amqp://guest:**@127.0.0.1:5672//
[2019-02-27 16:46:13,838: DEBUG/MainProcess] Start from server, version: 0.9, properties: {'information': 'Licensed under the MPL.  See http://www.rabbitmq.com/', 'capabilities': {'per_consumer_qos': True, 'publisher_confirms': True, 'consumer_cancel_notify': True, 'basic.nack': True, 'direct_reply_to': True, 'consumer_priorities': True, 'connection.blocked': True, 'exchange_exchange_bindings': True, 'authentication_failure_close': True}, 'copyright': 'Copyright (C) 2007-2016 Pivotal Software, Inc.', 'version': '3.6.6', 'cluster_name': 'rabbit@guggenheim', 'platform': 'Erlang/OTP', 'product': 'RabbitMQ'}, mechanisms: [b'AMQPLAIN', b'PLAIN'], locales: ['en_US']
[2019-02-27 16:46:13,840: INFO/MainProcess] mingle: searching for neighbors
[2019-02-27 16:46:13,841: DEBUG/MainProcess] using channel_id: 1
[2019-02-27 16:46:13,843: DEBUG/MainProcess] Channel open
[2019-02-27 16:46:13,855: DEBUG/MainProcess] Start from server, version: 0.9, properties: {'information': 'Licensed under the MPL.  See http://www.rabbitmq.com/', 'capabilities': {'per_consumer_qos': True, 'publisher_confirms': True, 'consumer_cancel_notify': True, 'basic.nack': True, 'direct_reply_to': True, 'consumer_priorities': True, 'connection.blocked': True, 'exchange_exchange_bindings': True, 'authentication_failure_close': True}, 'copyright': 'Copyright (C) 2007-2016 Pivotal Software, Inc.', 'version': '3.6.6', 'cluster_name': 'rabbit@guggenheim', 'platform': 'Erlang/OTP', 'product': 'RabbitMQ'}, mechanisms: [b'AMQPLAIN', b'PLAIN'], locales: ['en_US']
[2019-02-27 16:46:13,856: DEBUG/MainProcess] using channel_id: 1
[2019-02-27 16:46:13,857: DEBUG/MainProcess] Channel open
[2019-02-27 16:46:14,865: INFO/MainProcess] mingle: all alone
[2019-02-27 16:46:14,866: DEBUG/MainProcess] using channel_id: 2
[2019-02-27 16:46:14,869: DEBUG/MainProcess] Channel open
[2019-02-27 16:46:14,880: DEBUG/MainProcess] using channel_id: 1
[2019-02-27 16:46:14,891: DEBUG/MainProcess] Channel open
[2019-02-27 16:46:14,916: DEBUG/MainProcess] using channel_id: 3
[2019-02-27 16:46:14,917: DEBUG/MainProcess] Channel open
[2019-02-27 16:46:14,923: INFO/MainProcess] celery@guggenheim ready.

After having created a git loading task through the webapp, I got the following output when trying to send the task to the worker:

$ swh-scheduler -C ~/.config/swh/runner.yml -l DEBUG runner
[INFO] swh.core.config -- Loading config file /home/antoine/.config/swh/runner.yml
[DEBUG] swh.scheduler.cli -- Instanciating scheduler with {'args': {'url': 'http://localhost:5008/'}, 'cls': 'remote'}
[DEBUG] swh.scheduler.celery_backend.config -- Creating a Celery app with {'task_queues': [<unbound Queue celery -> <unbound Exchange celery(direct)> -> celery>, <unbound Queue swh_loader_git -> <unbound Exchange swh_loader_git(direct)> -> swh_loader_git>, <unbound Queue swh_loader_svn -> <unbound Exchange swh_loader_svn(direct)> -> swh_loader_svn>, <unbound Queue swh_loader_tar -> <unbound Exchange swh_loader_tar(direct)> -> swh_loader_tar>, <unbound Queue swh_loader_mercurial -> <unbound Exchange swh_loader_mercurial(direct)> -> swh_loader_mercurial>, <unbound Queue swh_loader_svn_dump_mount_and_load -> <unbound Exchange swh_loader_svn_dump_mount_and_load(direct)> -> swh_loader_svn_dump_mount_and_load>, <unbound Queue swh_vault_cooking -> <unbound Exchange swh_vault_cooking(direct)> -> swh_vault_cooking>, <unbound Queue swh_vault_batch_cooking -> <unbound Exchange swh_vault_batch_cooking(direct)> -> swh_vault_batch_cooking>, <unbound Queue swh_loader_pypi -> <unbound Exchange swh_loader_pypi(direct)> -> swh_loader_pypi>], 'task_broker': 'amqp://guest@localhost//', 'task_modules': ['swh.loader.git.tasks', 'swh.loader.mercurial.tasks', 'swh.loader.svn.tasks', 'swh.deposit.loader.tasks', 'swh.vault.cooking_tasks', 'swh.loader.tar.tasks', 'swh.loader.pypi.tasks'], 'task_soft_time_limit': 0}
[DEBUG] swh.scheduler.cli.runner -- Scheduler <swh.scheduler.api.client.RemoteScheduler object at 0x7f4a1d9e2a90>
[DEBUG] swh.scheduler.cli.runner -- Run ready tasks
[INFO] swh.scheduler.celery_backend.runner -- Grabbed 1 tasks origin-update-git
[DEBUG] swh.scheduler.celery_backend.runner -- Sent 1 celery tasks
[DEBUG] amqp -- Start from server, version: 0.9, properties: {'platform': 'Erlang/OTP', 'product': 'RabbitMQ', 'capabilities': {'per_consumer_qos': True, 'publisher_confirms': True, 'consumer_cancel_notify': True, 'authentication_failure_close': True, 'connection.blocked': True, 'exchange_exchange_bindings': True, 'basic.nack': True, 'consumer_priorities': True, 'direct_reply_to': True}, 'information': 'Licensed under the MPL.  See http://www.rabbitmq.com/', 'cluster_name': 'rabbit@guggenheim', 'version': '3.6.6', 'copyright': 'Copyright (C) 2007-2016 Pivotal Software, Inc.'}, mechanisms: [b'AMQPLAIN', b'PLAIN'], locales: ['en_US']
[DEBUG] amqp -- using channel_id: 1
[DEBUG] amqp -- Channel open
[INFO] swh.scheduler.cli.runner -- Scheduled 1 tasks

Nevertheless, the issue persists and the loading task is not executed.

olasd added a comment.Feb 27 2019, 4:58 PM

The queue names have changed (the queue that the runner sends the jobs to is exactly the same as the fully qualified class name); removing the task_queues entry from your config should do the right thing.

However your log trace doesn't seem to match what I expect from the config file (I see too many queue names here), so are you sure it's loading the right config?

So the right way to fix my issue locally was to update the worker configuration to the following:

task_broker: amqp://guest@localhost//
task_modules:
  - swh.loader.git.tasks
  - swh.loader.mercurial.tasks
  - swh.loader.svn.tasks
  - swh.deposit.loader.tasks
  - swh.vault.cooking_tasks
  - swh.loader.tar.tasks
  - swh.loader.pypi.tasks
task_queues:
  - swh.loader.git.tasks.LoadDiskGitRepository
  - swh.loader.git.tasks.UncompressAndLoadDiskGitRepository
  - swh.loader.git.tasks.UpdateGitRepository
  - swh.loader.mercurial.tasks.LoadArchiveMercurial
  - swh.loader.mercurial.tasks.LoadMercurial
  - swh.loader.pypi.tasks.LoadPyPI
  - swh.loader.svn.tasks.DumpMountAndLoadSvnRepository
  - swh.loader.svn.tasks.LoadSvnRepository
  - swh.loader.svn.tasks.MountAndLoadSvnRepository
  - swh.loader.tar.tasks.LoadTarRepository
  - swh.vault.cooking_tasks.SWHBatchCookingTask
  - swh.vault.cooking_tasks.SWHCookingTask
task_soft_time_limit: 0

As I explained in T1553, we should centralize default configuration somewhere.

olasd added a comment.Feb 27 2019, 5:36 PM

Yeah, sorry I misled you, I did write the code for the worker to automatically listen to the proper queues when the modules are loaded, but it got removed when the old Task base class was removed.

Coming back to the production issue:

  1. the tasks were properly scheduled by the runner (as seen on the swh-scheduler database)
  2. they made their way to a worker (as seen on the worker logs)
  3. the worker task raised an exception due to a RPC issue on the backend (as seen on the worker logs again)
  4. that exception never made its way back to the listener (as seen on the swh-scheduler database)

The next step would be to make a minimal "failing task" test case and see whether the events that the listener expects are properly sent.

anlambert renamed this task from Save code now requests scheduled but not executed in production to Save code now requests status report seems wrong in production.Feb 27 2019, 6:19 PM
anlambert updated the task description. (Show Details)
anlambert updated the task description. (Show Details)Feb 27 2019, 6:22 PM
olasd renamed this task from Save code now requests status report seems wrong in production to Scheduler listener doesn't notice tasks failing with an uncaught exception.Apr 19 2019, 1:51 PM
olasd updated the task description. (Show Details)