Page MenuHomeSoftware Heritage

unexpected long time between a task is received and the loading start for few tasks
Closed, MigratedEdits Locked

Description

Sometimes, a task scheduled by the "old" generation scheduler (deposit, scn, vault) is received by a worker is taking a long time to start

Jan 17 16:26:10 worker14 python3[2180500]: [2022-01-17 16:26:10,683: INFO/MainProcess] Received task: swh.loader.git.tasks.UpdateGitRepository[eb038b42-713e-4c23-a655-17d04f23870a]  

<----------------- Nothing happens here between 16:26 and 19:50

Jan 17 19:50:02 worker14 python3[2736039]: [2022-01-17 19:50:02,768: INFO/ForkPoolWorker-134] Load origin 'https://github.com/webtorrent/webtorrent-desktop' with type 'git'
Jan 17 19:50:10 worker14 python3[2736039]: Enumerating objects: 11937, done.
Jan 17 19:50:10 worker14 python3[2736039]: [3.2K blob data]
Jan 17 19:50:10 worker14 python3[2736039]: [3.5K blob data]
Jan 17 19:50:11 worker14 python3[2736039]: Total 11937 (delta 374), reused 337 (delta 235), pack-reused 11362
Jan 17 19:50:11 worker14 python3[2736039]: [2022-01-17 19:50:11,432: INFO/ForkPoolWorker-134] Listed 1069 refs for repo https://github.com/webtorrent/webtorrent-desktop
Jan 17 19:51:02 worker14 python3[2736039]: [2022-01-17 19:51:02,168: INFO/ForkPoolWorker-134] Task swh.loader.git.tasks.UpdateGitRepository[eb038b42-713e-4c23-a655-17d04f23870a] succeeded in 59.69298444222659s: {'status': 'eventful'}

It has some impacts mostly on the SCN as the task remains unexpectedly scheduled during all this time.

Event Timeline

vsellier triaged this task as Normal priority.Jan 18 2022, 11:46 AM
vsellier created this task.

After analyzing further the logs, it seems the celery process is accepting the task eb038b42-713e-4c23-a655-17d04f23870a whereas a previous loading is still running
A more contextual logs:

  • Previous task is received and started:
Jan 17 16:19:31 worker14 python3[2180500]: [2022-01-17 16:19:31,868: INFO/MainProcess] Received task: swh.loader.git.tasks.UpdateGitRepository[70c13ea8-19f4-4ac7-ae4f-ea71f80fc25a]  
Jan 17 16:19:32 worker14 python3[2721731]: [2022-01-17 16:19:32,467: INFO/ForkPoolWorker-133] Load origin 'https://git.libreoffice.org/translations' with type 'git'
  • The new task is accepted whereas the previous one is not yet finished:
Jan 17 16:26:10 worker14 python3[2180500]: [2022-01-17 16:26:10,683: INFO/MainProcess] Received task: swh.loader.git.tasks.UpdateGitRepository[eb038b42-713e-4c23-a655-17d04f23870a]
  • the first task ends:
Jan 17 19:48:20 worker14 python3[2721731]: [51B blob data]
Jan 17 19:48:20 worker14 python3[2721731]: [3.6K blob data]
Jan 17 19:49:57 worker14 python3[2721731]: [2022-01-17 19:49:57,085: ERROR/ForkPoolWorker-133] Loading failure, updating to `failed` status
                                           Traceback (most recent call last):
                                             File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 338, in load
                                               more_data_to_fetch = self.fetch_data()
                                             File "/usr/lib/python3/dist-packages/swh/loader/git/loader.py", line 277, in fetch_data
                                               self.origin.url, base_repo, do_progress
                                             File "/usr/lib/python3/dist-packages/swh/loader/git/loader.py", line 213, in fetch_pack_from_origin
                                               progress=do_activity,
                                             File "/usr/lib/python3/dist-packages/dulwich/client.py", line 2001, in fetch_pack
                                               progress,
                                             File "/usr/lib/python3/dist-packages/dulwich/client.py", line 845, in _handle_upload_pack_tail
                                               SIDE_BAND_CHANNEL_PROGRESS: progress,
                                             File "/usr/lib/python3/dist-packages/dulwich/client.py", line 604, in _read_side_band64k_data
                                               cb(pkt)
                                             File "/usr/lib/python3/dist-packages/swh/loader/git/loader.py", line 201, in do_pack
                                               f"Pack file too big for repository {origin_url}, "
                                           OSError: Pack file too big for repository https://git.libreoffice.org/translations, limit is 4294967296 bytes, current size is 4294966867, would write 65515
Jan 17 19:49:59 worker14 python3[2721731]: [2022-01-17 19:49:59,468: INFO/ForkPoolWorker-133] Task swh.loader.git.tasks.UpdateGitRepository[70c13ea8-19f4-4ac7-ae4f-ea71f80fc25a] succeeded in 12627.452697345056s: {'status': 'failed'}
  • the second task loading starts after 3 hours:
Jan 17 19:50:02 worker14 python3[2736039]: [2022-01-17 19:50:02,768: INFO/ForkPoolWorker-134] Load origin 'https://github.com/webtorrent/webtorrent-desktop' with type 'git'
vsellier changed the task status from Open to Work in Progress.Jan 18 2022, 11:53 AM

It looks it's the normal behavior of celery with the current configuration:

From https://docs.celeryproject.org/en/latest/userguide/optimizing.html#reserve-one-task-at-a-time

When using the default of early acknowledgment, having a prefetch multiplier setting of one, means the worker will reserve at most one extra task for every worker process: or in other words, if the worker is started with -c 10, the worker may reserve at most 20 tasks (10 acknowledged tasks executing, and 10 unacknowledged reserved tasks) at any time.

You can enable this behavior by using the following configuration options:
task_acks_late = True
worker_prefetch_multiplier = 1

The task_acks_lateoption was disabled before the new scheduler migration.
Perhaps it may be interesting to reactivate it for the high_priority_loaders as they handle tasks created by users.
They are also monitored by the end to end probes.

The task_acks_lateoption was disabled before the new scheduler migration.
Perhaps it may be interesting to reactivate it for the high_priority_loaders as they handle tasks created by users.
They are also monitored by the end to end probes.

It might be possible to solve this through loader configuration [1]
Adding a task_acks_late: true [2]

[1] https://forge.softwareheritage.org/source/swh-scheduler/browse/master/swh/scheduler/celery_backend/config.py$347-365

[2] https://forge.softwareheritage.org/source/puppet-swh-site/browse/production/data/common/common.yaml$2400-2401

vsellier claimed this task.

The ack late propery is deployed in staging and production.
The number of scn in the 'scheduled' status should decrease.

It did, prior to the fix, there were 3 systematically marked scheduled from earlier.
It went to 0 on its own since you deployed this so great!