Pause background ingestion until we get more local storage space
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	olasd
	Oct 2 2020, 11:13 AM

Description

The main objstorage is almost out of disk space (4.x TB left).

We have ordered some more storage in August, but the preparation of the order is taking longer than we anticipated.

We need to stop background ingestion (and leave high priority jobs like save code now and the deposit enabled) until we're able to extend the main storage.

Event Timeline

olasd changed the task status from Open to Work in Progress.Oct 2 2020, 11:13 AM

olasd triaged this task as High priority.

olasd created this task.

I have done the following change to the scheduler database:

create or replace function swh_scheduler_peek_ready_tasks (task_type text, ts timestamptz default now(),
                                                           num_tasks bigint default NULL, num_tasks_priority bigint default NULL)
  returns setof task
  language sql
as $$
    select * from swh_scheduler_peek_priority_tasks(task_type, ts, num_tasks)
             order by priority, next_run
$$;

This makes the scheduler only pick up tasks with the priority set, which is the case of the tasks created by save code now...
...however that's not the case for deposit tasks.

What I really want is to let through *oneshot* tasks but it looks like we don't really have such provisions; I'm afraid filtering on that will mess up the index hits.

To do what I actually wanted to do (let through oneshot tasks and tasks with priority set) I've done the following:

restore swh_scheduler_peek_ready_tasks
updated swh_scheduler_peek_no_priority_tasks with the following

create or replace function swh_scheduler_peek_no_priority_tasks (task_type text, ts timestamptz default now(),
                                                                 num_tasks bigint default NULL)
  returns setof task
  language sql
  stable
as $$
select * from task
  where next_run <= ts
        and type = task_type
        and status = 'next_run_not_scheduled'
        and policy = 'oneshot'
        and priority is null
  order by next_run
  limit num_tasks
  for update skip locked;
$$;

comment on function swh_scheduler_peek_no_priority_tasks (text, timestamptz, bigint)
is 'Retrieve tasks without priority';

The only difference is the and policy = 'oneshot'.

As expected this makes the function pretty darn slow: there's very very few oneshot tasks, so the scheduler has to go through a lot of useless rows in its filtering.

I've added the following index:

create index concurrently on task(type, next_run) where status = 'next_run_not_scheduled' and policy = 'oneshot' and priority is null;

And now the queries are quasi-instantaneous.

It looks like we have a bunch of recurrent tasks with priorities (I suspect from listers run with non-default parameters); I'm creating the following index:

create index concurrently on task(type, priority, next_run) where status = 'next_run_not_scheduled' and priority is not null;

to find them and potentially fix them.

When this is done (I guess in large part it is already, judging from the ingestion dashboard), please also update https://status.softwareheritage.org/ with a suitable message.

In T2656#50087, @zack wrote:

When this is done (I guess in large part it is already, judging from the ingestion dashboard), please also update https://status.softwareheritage.org/ with a suitable message.

I've thought about it, but I don't know which component to mark as "degraded" currently. If anything, user-facing features like save code now and the vault should be going faster than usual :-)

In T2656#50089, @olasd wrote:

I've thought about it, but I don't know which component to mark as "degraded" currently. If anything, user-facing features like save code now and the vault should be going faster than usual :-)

Agreed. As part of my previous message I thought about adding that this might require rewording/reworking what is presented in status.s.o, but I didn't want to make that as a precondition (because it's extra work) of this.

Anyway: what I consider important is being transparent about the message "crawling is temporarily stopped, but save code now and deposit work as usual". Whatever way we find to convey that, it's fine. If as a result we rework/improve the breakdown on status.s.o, even better ! :-)

After your suggestion on IRC I've added the following two components to status.io:

Source Code Crawlers
Save Code Now

status.io can't easily do "sub-services", so I don't think we'll get more granularity than that.

Storage has been extended. I've restored the function to its original version (without policy = 'oneshot').

This task has been migrated to GitLab.

Pause background ingestion until we get more local storage spaceClosed, MigratedEdits LockedActions

Description

Event Timeline

Pause background ingestion until we get more local storage space
Closed, MigratedEdits Locked
Actions