Page MenuHomeSoftware Heritage

Analyze low performance scheduling
Closed, MigratedEdits Locked

Description

The scheduler takes some time to schedule new origins.

It's now more apparent since we added some more workers to consume the standard git
queue [1]. It was a bit apparent with the gitlab.com origins (still lagging but less
than earlier [2])

[1] T3825: the queue is now drained completely but the scheduler takes too much time to
fill it.

[2] still 409k origins lag (vs 3M at first) with an apparent plateau around the
01/01/2022 - https://grafana.softwareheritage.org/goto/hNSnn-Ank?orgId=1

Event Timeline

ardumont triaged this task as High priority.Jan 3 2022, 5:36 PM
ardumont created this task.

the first diagnostic is it seems relative to the origin selection query taking a long time to respond (usually > 30mn)

1245882 softwar...eduler                      swhscheduler 192.168.100.210/   60.4  1.2  216.89M    2.35K  19:12.25                     N             active   WITH selected_origins AS ( SELECT enabled, extra_loader_arguments, first_seen, last_seen, last_update, lister_id,
url, visit_type, next_visit_queue_position FROM listed_origins LEFT JOIN origin_visit_stats USING (url, visit_type) WHERE (enabled) AND (visit_type = 'git') AND (origin_visit_stats.last_scheduled IS NULL OR origin_visit_stats.last_scheduled < GREATEST(
'2021-12-27T16:24:02.974525+00:00'::timestamptz, origin_visit_stats.last_visit ) ) AND (origin_visit_stats.last_visit_status is distinct from 'failed' or origin_visit_stats.last_visit < '2021-12-20T16:24:02.974525+00:00'::timestamptz) AND
(origin_visit_stats.last_visit_status is distinct from 'not_found' or origin_visit_stats.last_visit < '2021-12-03T16:24:02.974525+00:00'::timestamptz) AND (origin_visit_stats.last_snapshot IS NULL) AND (li

An explain on the request shows the filter

(enabled)
AND (visit_type = 'git')
AND (last_update IS NULL)

perform a long seq scan on the table listed_origins table.
It takes more time as there is more tasks to send to rabbitmq.

On staging, adding an index

create index CONCURRENTLY test_T3827 on listed_origins (enabled, visit_type, last_update) where last_update is null;

removes the parallel seq scan but there is not enough data to show an execution duration difference.

We are trying to create the same index on production to test, we will add it to the migration files if it improves the request duration.

result of the explain of the sub query with limit of 100:

softwareheritage-scheduler=> explain analyze SELECT enabled,
    extra_loader_arguments,
    first_seen,
    last_seen,
    last_update,
    lister_id,
    url,
    visit_type,
    next_visit_queue_position
FROM listed_origins
    LEFT JOIN origin_visit_stats USING (url, visit_type)
WHERE (enabled)
    AND (visit_type = 'git')
    AND (
        origin_visit_stats.last_scheduled IS NULL
        OR origin_visit_stats.last_scheduled < GREATEST(
            '2021-12-27T14:10:48.509526+00:00'::timestamptz,
            origin_visit_stats.last_visit
        )
    )
    AND (
        origin_visit_stats.last_visit_status is distinct
        from 'failed'
            or origin_visit_stats.last_visit < '2021-12-20T14:10:48.509526+00:00'::timestamptz
    )
    AND (
        origin_visit_stats.last_visit_status is distinct
        from 'not_found'
            or origin_visit_stats.last_visit < '2021-12-03T14:10:48.509526+00:00'::timestamptz
    )
    AND (last_update IS NULL)
ORDER BY origin_visit_stats.next_visit_queue_position nulls first,
    listed_origins.first_seen
LIMIT 100;


               QUERY PLAN                                                                                                                                                                                                                                                                                               
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1000.70..9471.30 rows=100 width=103) (actual time=1.170..1489.579 rows=100 loops=1)
   ->  Gather  (cost=1000.70..12012827.75 rows=141806 width=103) (actual time=1.162..1489.528 rows=100 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Nested Loop Left Join  (cost=0.70..11997647.15 rows=59086 width=103) (actual time=3.350..1375.986 rows=35 loops=3)
               Filter: (((origin_visit_stats.last_scheduled IS NULL) OR (origin_visit_stats.last_scheduled < GREATEST('2021-12-27 14:10:48.509526+00'::timestamp with time zone, origin_visit_stats.last_visit))) AND ((origin_visit_stats.last_visit_status IS DISTINCT FROM 'failed'::last_visit_status) OR (origin_visit_stats.last_visit < '2021-12-20 14:10:48.509526+00'::timestamp with time zone)) AND ((origin_visit_stats.last_visit_status IS DISTINCT FROM 'not_found'::last_visit_status) OR (origin_visit_stats.last_visit < '2021-12-03 14:10:48.509526+00'::timestamp with time zone)))
               ->  Parallel Seq Scan on listed_origins  (cost=0.00..11798985.67 rows=65077 width=95) (actual time=2.866..1213.327 rows=35 loops=3)
                     Filter: (enabled AND (last_update IS NULL) AND (visit_type = 'git'::text))
                     Rows Removed by Filter: 337880
               ->  Index Scan using origin_visit_stats_pkey on origin_visit_stats  (cost=0.70..3.03 rows=1 width=75) (actual time=4.684..4.684 rows=1 loops=104)
                     Index Cond: ((url = listed_origins.url) AND (visit_type = listed_origins.visit_type) AND (visit_type = 'git'::text))
 Planning Time: 0.468 ms
 Execution Time: 1489.993 ms
(13 rows)

The index won't help as half of the rows of the table have this property, and you'll still have to filter and sort all the results afterwards (so the index will, at best, be used to fetch the rows less efficiently than the seq scan did). I doubt PostgreSQL will even consider the index.

If you want to speed up scheduling (we should just make the rabbitmq buffer larger first), you could enable the tablesample flag to avoid scanning the whole table.

Adding a bunch of indexes will also make the updates by the listers slower, which may have cascading effects, so we should really avoid that if possible.

The initial index was a try to improve the query for the origins_without_last_update policy. Indeed, it's not used by for the other policies (never_visited_oldest_update_first / already_visited_order_by_lag) so it's not ok

We have tried if created the index with NULLS FIRST can help but it seems it's not helpul because it's just ignored by the never_visited_oldest_update_first and already_visited_order_by_lag policies

Adding the simple index create index CONCURRENTLY test_T3827_3 on listed_origins (enabled, visit_type, last_update); greatly improves the execution time of the never_visited_oldest_update_first policy (with the same effect as the IS NULL index for origins_without_last_update)
Unfortunately, this index is not used by the third policy and the request are still running for ages

never_visited_oldest_update_first without the index:

 Limit  (cost=23441250.92..23441274.25 rows=200 width=111)
   ->  Gather Merge  (cost=23441250.92..36762763.54 rows=114176422 width=111)
         Workers Planned: 2
         ->  Sort  (cost=23440250.89..23582971.42 rows=57088211 width=111)
               Sort Key: listed_origins.last_update
               ->  Parallel Hash Join  (cost=6185052.16..20972939.46 rows=57088211 width=111)
                     Hash Cond: (listed_origins.url = origin_visit_stats.url)
                     ->  Parallel Seq Scan on listed_origins  (cost=0.00..11308672.00 rows=91224877 width=95)
                           Filter: (enabled AND (last_update IS NOT NULL) AND (visit_type = 'git'::text))
                     ->  Parallel Hash  (cost=4891960.28..4891960.28 rows=58075670 width=55)
                           ->  Parallel Seq Scan on origin_visit_stats  (cost=0.00..4891960.28 rows=58075670 width=55)
                                 Filter: ((last_snapshot IS NOT NULL) AND (visit_type = 'git'::text) AND ((last_scheduled IS NULL) OR (last_scheduled < GREATEST('2021-12-27 14:10:48.509526+00'::timestamp with time zone, last_visit))) AND ((last_visit_status IS DISTINCT FROM 'failed'::last_visit_status) OR (last_visit < '2021-12-20 14:10:48.509526+00'::timestamp with time zone)) AND ((last_visit_status IS DISTINCT FROM 'not_found'::last_visit_status) OR (last_visit < '2021-12-03 14:10:48.509526+00'::timestamp with time zone)))
 JIT:
   Functions: 15
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(15 rows)

never_visited_oldest_update_first with the simple index:

                  QUERY PLAN                                                                                                                                                                                                                                                                                                  
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=184963.00..184986.33 rows=200 width=131)
   ->  Gather Merge  (cost=184963.00..189953.18 rows=42770 width=131)
         Workers Planned: 2
         ->  Sort  (cost=183962.97..184016.43 rows=21385 width=131)
               Sort Key: origin_visit_stats.next_visit_queue_position NULLS FIRST, listed_origins.first_seen
               ->  Parallel Hash Left Join  (cost=63879.70..183038.73 rows=21385 width=131)
                     Hash Cond: ((listed_origins.visit_type = origin_visit_stats.visit_type) AND (listed_origins.url = origin_visit_stats.url))
                     Filter: (((origin_visit_stats.last_scheduled IS NULL) OR (origin_visit_stats.last_scheduled < GREATEST('2021-12-27 14:10:48.509526+00'::timestamp with time zone, origin_visit_stats.last_visit))) AND ((origin_visit_stats.last_visit_status IS DISTINCT FROM 'failed'::last_visit_status) OR (origin_visit_stats.last_visit < '2021-12-20 14:10:48.509526+00'::timestamp with time zone)) AND ((origin_visit_stats.last_visit_status IS DISTINCT FROM 'not_found'::last_visit_status) OR (origin_visit_stats.last_visit < '2021-12-03 14:10:48.509526+00'::timestamp with time zone)))
                     ->  Parallel Index Scan using test_t3827_3 on listed_origins  (cost=0.43..114828.68 rows=32719 width=107)
                           Index Cond: ((enabled = true) AND (visit_type = 'git'::text) AND (last_update IS NULL))
                     ->  Parallel Hash  (cost=57180.61..57180.61 rows=241844 width=78)
                           ->  Parallel Seq Scan on origin_visit_stats  (cost=0.00..57180.61 rows=241844 width=78)
                                 Filter: (visit_type = 'git'::text)
 JIT:
   Functions: 19
   Options: Inlining false, Optimization false, Expressions true, Deforming true
(16 rows)

we should give a try to the tablesampling method as suggested by olasd.
The response time look better even if it's still not perfect. For example for the never_visited_oldest_update_first policy :

  • with a tablesample at 0.05:
Time: 108796.726 ms (01:48.797)
  • with a tablesample at 0.1 (0.1%):
Time: 163393.448 ms (02:43.393)
  • with a tablesample at 0.5 (0.5%):
Time: 567148.719 ms (09:27.149)
  • with a tablesample at 1 (1%):
Time: 1031728.699 ms (17:11.729)
ardumont changed the task status from Open to Work in Progress.Jan 4 2022, 4:07 PM
ardumont moved this task from Backlog to in-progress on the System administration board.

a dirty fix on the code to force the table sampling looks efficient

-- a/swh/scheduler/celery_backend/recurrent_visits.py
+++ b/swh/scheduler/celery_backend/recurrent_visits.py
@@ -103,7 +103,8 @@ def grab_next_visits_policy_weights(
     for policy, ratio in policy_ratio.items():
         num_tasks_to_send = int(num_visits * ratio)
         fetched_origins[policy] = scheduler.grab_next_visits(
-            visit_type, num_tasks_to_send, policy=policy
+            visit_type, num_tasks_to_send, policy=policy,
+            tablesample=0.1,
         )
 
     all_origins: List[ListedOrigin] = list(

A run of the scheduler on the git visit type is now done in ~ 5mn and the queue doesn't become emptry between 2 runs

let's prepare now a cleaner fix on the swh-scheduler codebase

vsellier claimed this task.

The fix was released in the version v0.23.0 and deployed in staging and production.
Everything looks good.