Page MenuHomeSoftware Heritage

scheduler's cron cleanup error when filtering tasks to archive
Closed, ResolvedPublic

Description

From checking saatchi's node as swhscheduler user (through mail or mutt):

Date: Wed, 24 Jul 2019 08:52:05 +0000 (UTC)
From: Cron Daemon <root@saatchi.internal.softwareheritage.org>
To: swhscheduler@saatchi.internal.softwareheritage.org
Subject: Cron <swhscheduler@saatchi> /usr/bin/swh scheduler --config-file /etc/softwareheritage/backend/elastic.yml task archive

INFO:swh.core.config:Loading config file /etc/softwareheritage/global.ini
INFO:swh.core.config:Loading config file /etc/softwareheritage/worker.ini
INFO:swh.core.config:Loading config file /etc/softwareheritage/backend/elastic.yml
[INFO] swh.core.config -- Loading config file /etc/softwareheritage/backend/elastic.yml
Traceback (most recent call last):
  File "/usr/bin/swh", line 11, in <module>
    load_entry_point('swh.core==0.0.63', 'console_scripts', 'swh')()
  File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 56, in main
    return swh(auto_envvar_prefix='SWH')
  File "/usr/lib/python3/dist-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/python3/dist-packages/swh/scheduler/cli/task.py", line 553, in archive_tasks
    for task_ids in grouper(gen, n=batch_clean):
  File "/usr/lib/python3/dist-packages/swh/core/utils.py", line 48, in grouper
    for _data in itertools.zip_longest(*args, fillvalue=stop_value):
  File "/usr/lib/python3/dist-packages/swh/scheduler/cli/task.py", line 538, in index_data
    after, before, last_id=last_id, limit=batch_index)
  File "/usr/lib/python3/dist-packages/swh/scheduler/api/client.py", line 102, in filter_task_to_archive
    'last_id': last_id,
  File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 198, in post
    return self._decode_response(response)
  File "/usr/lib/python3/dist-packages/swh/core/api/__init__.py", line 235, in _decode_response
    response.content,
swh.core.api.RemoteException: Unexpected status code for API request: 504 (b'<html>\r\n<head><title>504 Gateway Time-out</title></head>\r\n<body bgcolor="white">\r\n<center><h1>504 Gateway
+Time-out</h1></center>\r\n<hr><center>nginx/1.10.3</center>\r\n</body>\r\n</html>\r\n')

Event Timeline

ardumont triaged this task as Normal priority.Jul 24 2019, 10:56 AM
ardumont created this task.
ardumont updated the task description. (Show Details)
ftigeot changed the task status from Open to Work in Progress.Jul 24 2019, 4:36 PM
ftigeot added a subscriber: ftigeot.

Fwiw, a manual connection to esnode1:9200 doesn't show this error

Fwiw, a manual connection to esnode1:9200 doesn't show this error

Indeed (forgot to mention it here).

And the configuration mentioned in the log:

/etc/softwareheritage/backend/elastic.yml

---
scheduler:
  cls: remote
  args:
    url: http://saatchi.internal.softwareheritage.org:5008/
storage_nodes:
- host: esnode2.internal.softwareheritage.org
  port: 9200
- host: esnode3.internal.softwareheritage.org
  port: 9200
- host: esnode1.internal.softwareheritage.org
  port: 9200
client_options:
  sniff_on_start: false
  sniff_on_connection_fail: true
  http_compress: false

Oh, now i'm wondering if it's not the scheduler api which timeouts... ;)

Oh, now i'm wondering if it's not the scheduler api which timeouts... ;)

File "/usr/lib/python3/dist-packages/swh/scheduler/api/client.py", line 102, in filter_task_to_archive
    'last_id': last_id,

agrees with me!
Ok, now that makes more sense, i'll rename the task then.

I don't know what made me think the connection error was with elasticsearch...¯\_(ツ)_/¯

ardumont renamed this task from scheduler's cron cleanup no longer able to connect to elastic node to scheduler's cron cleanup error when filtering tasks to archive.Jul 26 2019, 9:53 AM

(That cron is currently stopped to stop harassing the scheduler for nothing)

As we did some cleanup (one staging node was concurrently archiving by mistake) and migration (that improves some queries) on the scheduler, I thought i'd try this again.
So i triggered back manually a run.

That still fails the same way.
We definitely need to deal more appropriately with the client call to that api.

ardumont changed the task status from Work in Progress to Open.Dec 13 2019, 11:55 AM

We definitely need to deal more appropriately with the client call to that api.

Changed the scheduler's api endpoint into a paginated endpoint.
Delegating the pagination resolution client side.
So that should take care of the timeout seen here.

Done in D2457

Actually make that cli actually work again after that (D2458).

ardumont changed the task status from Open to Work in Progress.Dec 14 2019, 11:39 AM
ardumont closed this task as Resolved.EditedDec 17 2019, 4:19 PM
ardumont claimed this task.

Developed, testes, diffed, reviewed, adapted, unit/integration tested, reviewed, landed, packaged, deployed!