Fast track save code now requests
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	rdicosmo
	Mar 4 2021, 5:35 PM

Description

Currently, save code now requests may require a few hours to complete, sometimes more.

We want to reduce this waiting time to a few minutes (the average time we observed when the global ingestion process was paused a few months ago was a handful of seconds).

To this end, save code now request should be handled by a dedicated ingestion queue.

Revisions and Commits

rDLDG Git loader
	Abandoned		D5488 Define high level load-git-high task
rDSCH Scheduling utilities
	Abandoned		D5493 scheduler: Redirect priority task in their own dedicated task queue
		D5552	rDSCHbefccb94d608 scheduler: Clean up priority/ratio task dead code
		D5535	rDSCH974c0c2e0512 tests: Complete checks on message with priority consumption
		D5520	rDSCH17052c4cfa39 Route priority tasks to dedicated save code now queues
		D5503	rDSCH3e2ae3d46d61 backend: Open endpoints to peek/grab tasks with any priority
rDENV Development environment
		D5526	rDENV2f08ed2d4f4e conf/loader: Declare save code now queues to consume from
rSPSITE puppet-swh-site
		D5486	rSPSITE28c626f24b0c common/common: No need to deploy basic worker instance
		D5486	rSPSITE0d7673fef414 Declare new service worker to consume save code now queues

Related Objects
Search...

		Status	Assigned	Task
		Migrated	gitlab-migration	T3082 Improve Save Code Now handling
		Migrated	gitlab-migration	T3084 Fast track save code now requests

Event Timeline

rdicosmo triaged this task as Normal priority.Mar 4 2021, 5:35 PM

rdicosmo created this task.

We already have a priority queue system in place in the scheduler. And for example, the
archive schedules save code now requests with a priority high [1]

As an implementation detail, those scheduled messages are currently merged into the same
dedicated queue (per loader).

Incremental improvments would be to:

split priority queues into dedicated queues (per loader type)
dedicate systemd workers to consume from those queues

That should limit the changes to the swh-scheduler and the swh-site repositories.

Prior to this task though, it might be wise to have metrics [2] first so we can compare
what's comparable.

[1] https://forge.softwareheritage.org/source/swh-web/browse/master/swh/web/common/origin_save.py$320

[2] T1481

ardumont claimed this task.Apr 7 2021, 12:34 PM

@ardumont we briefly discussed this a while ago with @olasd. I think the proposed solution was indeed to have a separate queue (and workers) for "save code now" request, but not necessarily one separate queue per loader, because the current priority system wasn't considered to be "fast enough". Maybe we can discuss this briefly with him and synthesize here what you come up with?

You're spot on on the fact we need metrics to measure what's the current status and, ideally, to define some relevant PKI (e.g., what's the maximum acceptable delay to process a save code now request).

@ardumont we briefly discussed this a while ago with @olasd. I think the proposed
solution was indeed to have a separate queue (and workers) for "save code now"
request, but not necessarily one separate queue per loader,

I recall something like that but I was of a mind to try and kill two birds with one
stone. Our current priority queue is not quite effective because it's not really a
priority queue in the end... Well, the higher priority messages do bypass not yet
scheduled origins. But, they do not bypass already scheduled origins (which may be a
large number of already enqueued message thus the apparent slowness of it all).

IMSMR, it computes a given ratio of messages (priority dependent) to push into the
existing queue (of the loader). If said queue is already quite filled in, it still can
take a long time because it will consume in order the already present messages in the
queue (which is the case for the git loader for example which is always saturated) prior
to ingest the newly higher priority one.

If we add those separated new priority queues, that should give it a boost. And serves
the purpose of fast tracking the save code now requests (since they are already sent
with the high priority today).

With your feedback, I realize and agree it's not necessary to separate per loader.

We can always start like this ^ and separate more if we ever feel the need (for some
reason, starvation maybe, quite unsure there is a need actually, ...)

The only caveat I see with one priority queue (for all loaders) is that the "priority"
workers (including save code now workers but not limited to it) will have to consume
from all sorts of loaders. Then again, that may not be a problem at all.

because the current priority system wasn't considered to be "fast enough".

Yes, see my previous detailed comment ;)

Maybe we can discuss this briefly with him and synthesize here what you come up with?

Sure, we discussed it this morning on #swh-devel. I synthesized my understanding of our
discussion which may be biased by my own view on the matter.

@olasd what do think?

You're spot on on the fact we need metrics to measure what's the current status and,
ideally, to define some relevant PKI (e.g., what's the maximum acceptable delay to
process a save code now request).

Yes, I started digging that way in the dedicated task.

ardumont mentioned this in T1481: add metric to monitor "save code now" efficiency.Apr 7 2021, 3:18 PM

Operationally, there's two axes we can play with:

queues, which set the FIFO order in which the tasks are processed;
workers (and the queues they subscribe to), which set the amount of parallel processing we dedicate to the given queues.

We currently have one queue per job type, and currently (in staging and production) we group the worker instances by task type.

I think it makes sense to create one "high priority" queue per task type (because that maps to a metric that's easy to understand : we have N pending save code now tasks for git repositories).

But, to start with and considering the (somewhat) low load related to save code now, I also think it would be sensible to deploy a single new systemd unit (on each worker[01-16] vm), that would listen to all these high priority queues and have all the task types available (a single celery worker for all loaders is what we use in docker, and it works fine).

Conclusion:

swh-site: Deploy one systemd unit (per worker) which is able to deal with all the existing save code now requests and subscribed to the one high priority queue. Loaders are: loader-git, loader-svn, loader-mercurial for now.

Adapt "somehow" the scheduling so the current high priority tasks ends up in the new priority queue. Adaptations for that would be somewhere along the line of the scheduler runner to redirect messages into that priority queue (and not the actual default one). I'll need to dig in a bit.

Note that from the save code now point of view, nothing changes.

After this, I gather we can refactor the scheduler to drop the priority ratio (high,
normal, low) implementations and only keep the high/no priority queues system. high
being mostly used by save code now and some listers. but could also serve for something
else, the future "save forge now" comes to mind.

ardumont changed the task status from Open to Work in Progress.Apr 12 2021, 3:53 PM

ardumont edited projects, added System administration; removed System administrators.

ardumont moved this task from Backlog to in-progress on the System administration board.Apr 12 2021, 3:55 PM

ardumont added a revision: D5486: Declare new service worker to consume save code now queues.Apr 12 2021, 5:02 PM

Thanks for this!

Question: will the high priority queue shared be used by tasks other than "save code now" tasks?

It could but not immediately.
Let's see if i can actually pull it off ;)

ardumont added a revision: D5488: Define high level load-git-high task.Apr 13 2021, 10:22 AM

ardumont added a revision: D5493: scheduler: Redirect priority task in their own dedicated task queue.Apr 13 2021, 1:22 PM

ardumont added a revision: D5503: backend: Open endpoints to peek/grab tasks with any priority.Apr 13 2021, 6:03 PM

ardumont added a commit: rDSCH3e2ae3d46d61: backend: Open endpoints to peek/grab tasks with any priority.Apr 14 2021, 10:51 AM

ardumont added a revision: D5520: Route priority tasks to dedicated save code now queues.Apr 14 2021, 12:51 PM

ardumont added a revision: D5526: conf/loader: Declare save code now queues to consume from.Apr 14 2021, 5:19 PM

ardumont added a commit: rSPSITE0d7673fef414: Declare new service worker to consume save code now queues.Apr 14 2021, 6:05 PM

ardumont added a commit: rSPSITE28c626f24b0c: common/common: No need to deploy basic worker instance.

Current deployment tryout of D5520 is currently running on staging and i'm happy to
report it's working as expected.

The staging workers are currently busy running on git.kernel.org and have done little
work on latest save code now requests so far.

After triggering some new save code now requests there, they have already been ingested.

So remains to actually add tests on that diff (it's missing some conditional coverage),
and this should be it.

Great news :-)

ardumont added a revision: D5535: tests: Complete checks on message with priority consumption.Apr 15 2021, 1:26 PM

ardumont added a commit: rDSCH17052c4cfa39: Route priority tasks to dedicated save code now queues.Apr 15 2021, 1:31 PM

ardumont added a commit: rDENV2f08ed2d4f4e: conf/loader: Declare save code now queues to consume from.

ardumont added a commit: rDSCH974c0c2e0512: tests: Complete checks on message with priority consumption.Apr 15 2021, 3:04 PM

Pushed, packaged, deployed.

scheduler runner continues happily to schedule existing tasks and some new task with priority

Apr 15 13:12:51 saatchi swh[234257]: INFO:swh.scheduler.celery_backend.runner:Grabbed 2084 tasks load-git
Apr 15 13:12:54 saatchi swh[234257]: INFO:swh.scheduler.cli.admin.runner:Scheduled 4128 tasks
Apr 15 13:14:06 saatchi swh[234257]: INFO:swh.scheduler.celery_backend.runner:Grabbed 1 tasks load-pypi
Apr 15 13:14:06 saatchi swh[234257]: INFO:swh.scheduler.celery_backend.runner:Grabbed 1 tasks load-git (priority)
...

That task got done almost immediately...
So there you go ;)

ardumont moved this task from in-progress to deployed/landed/monitoring on the System administration board.Apr 15 2021, 3:18 PM

In T3084#63278, @ardumont wrote:
Pushed, packaged, deployed.

scheduler runner continues happily to schedule existing tasks and some new task with priority
Apr 15 13:12:51 saatchi swh[234257]: INFO:swh.scheduler.celery_backend.runner:Grabbed 2084 tasks load-git
Apr 15 13:12:54 saatchi swh[234257]: INFO:swh.scheduler.cli.admin.runner:Scheduled 4128 tasks
Apr 15 13:14:06 saatchi swh[234257]: INFO:swh.scheduler.celery_backend.runner:Grabbed 1 tasks load-pypi
Apr 15 13:14:06 saatchi swh[234257]: INFO:swh.scheduler.celery_backend.runner:Grabbed 1 tasks load-git (priority)
...
That task got done almost immediately...
So there you go ;)

Great job, thanks!

I saw a parmap origin which got scheduled (la la la ;)

ardumont closed this task as Resolved.Apr 15 2021, 6:00 PM

ardumont moved this task from deployed/landed/monitoring to done on the System administration board.

ardumont mentioned this in T3255: save_code_now: Investigate why svn loading tasks are stuck.Apr 16 2021, 11:01 AM

ardumont mentioned this in rSPSITEe933748ae1cb: Make high priority queue worker consumes svn tasks.Apr 16 2021, 11:07 AM

is there a grafana dashboard dedicated to this queue?

ardumont mentioned this in T3271: scheduler: Clean up dead code about priority/ratio.Apr 19 2021, 11:30 AM

ardumont added a revision: D5552: scheduler: Clean up priority/ratio task dead code.Apr 19 2021, 12:36 PM