Page MenuHomeSoftware Heritage

clean up swh-scheduler model db
Closed, MigratedEdits Locked

Description

The current scheduler holds lots of information that are no longer necessary:

  • beside load-nixguix recurrent tasks all other are no longer scheduled by this old model
  • only oneshot tasks are interesting but no more than 2 months old (save code now, deposit, vault, ...)

Those can be archived (and later on be dropped if we don't read in those [1]
This will make the current mechanisms that needs read/write in those tables faster.

[1] we usually don't.

Event Timeline

ardumont triaged this task as Normal priority.Jan 10 2022, 10:48 AM
ardumont created this task.

wip script [1]

Ongoing dump [2] from the staging db to check the script [1] is ok (within equivalent db mounted in docker)

[1] P1253

[2]

postgres@db1:/srv/softwareheritage/postgres/12/main/backup$ pg_dump --host=db1.internal.staging.swh.network --dbname=swh-scheduler --port=5432 --username=swh-scheduler --table task_type --table priority_ratio --table task --table task_run --data-only --no-owner | gzip -c - > $(date +%Y-%m-%dT%H:%M:%SZ)-staging-swh-scheduler-first-model.sql.gz
ardumont changed the task status from Open to Work in Progress.Jan 11 2022, 9:58 AM
ardumont moved this task from Backlog to in-progress on the System administration board.

So script is ok and landed.
I'll trigger it on staging first.

staging:

Triggered the migration (using a transaction as the dataset is not that huge) and all
went well.

Add to do the extra steps:

alter table task owner to "swh-scheduler";
alter table task_run owner to "swh-scheduler";
grant select on all tables in schema public to guest;

Result:

swh-scheduler=# \d+
                                                                                                         List of relations
 Schema |              Name              |   Type   |     Owner     | Persistence | Access method |    Size    |                                                    Description
--------+--------------------------------+----------+---------------+-------------+---------------+------------+-------------------------------------------------------------------------------------------------------------------
 public | archive_task                   | table    | swh-scheduler | permanent   | heap          | 14 GB      | Schedule of recurring tasks
 public | archive_task_run               | table    | swh-scheduler | permanent   | heap          | 12 GB      | History of task runs sent to the job-running backend
 public | dbversion                      | table    | swh-scheduler | permanent   | heap          | 48 kB      | Schema update tracking
 public | listed_origins                 | table    | swh-scheduler | permanent   | heap          | 1327 MB    | Origins known to the origin visit scheduler
 public | listers                        | table    | swh-scheduler | permanent   | heap          | 2344 kB    | Lister instances known to the origin visit scheduler
 public | origin_visit_stats             | table    | swh-scheduler | permanent   | heap          | 370 MB     |
 public | priority_ratio                 | table    | swh-scheduler | permanent   | heap          | 40 kB      | Oneshot task's reading ratio per priority
 public | scheduler_metrics              | table    | swh-scheduler | permanent   | heap          | 56 kB      | Cache of per-lister metrics for the scheduler, collated between the listed_origins and origin_visit_stats tables.
 public | task                           | table    | swh-scheduler | permanent   | heap          | 137 MB     |
 public | task_id_seq                    | sequence | swh-scheduler | permanent   |               | 8192 bytes |
 public | task_run                       | table    | swh-scheduler | permanent   | heap          | 192 MB     |
 public | task_run_id_seq                | sequence | swh-scheduler | permanent   |               | 8192 bytes |
 public | task_type                      | table    | swh-scheduler | permanent   | heap          | 56 kB      | Types of schedulable tasks
 public | visit_scheduler_queue_position | table    | swh-scheduler | permanent   | heap          | 48 kB      | Current queue position for the recurrent visit scheduler
(14 rows)

And same goes for production [1].

We gain a bit of space in those tables:

  • task: 209G -> 864M
  • task_run: 190G -> 84M

At some point, we can reclaim space by dropping the archive_* tables.

[1]

15:05:07 softwareheritage-scheduler@belvedere:5432=> \d+
                                                                                          List of relations
+--------+--------------------------------+----------+--------------+------------+-------------------------------------------------------------------------------------------------------------------+
| Schema |              Name              |   Type   |    Owner     |    Size    |                                                    Description                                                    |
+--------+--------------------------------+----------+--------------+------------+-------------------------------------------------------------------------------------------------------------------+
| public | archive_task                   | table    | swhscheduler | 209 GB     | Schedule of recurring tasks                                                                                       |
| public | archive_task_run               | table    | swhscheduler | 190 GB     | History of task runs sent to the job-running backend                                                              |
| public | dbversion                      | table    | swhscheduler | 48 kB      | Schema update tracking                                                                                            |
| public | listed_origins                 | table    | swhscheduler | 79 GB      | Origins known to the origin visit scheduler                                                                       |
| public | listers                        | table    | swhscheduler | 12 MB      | Lister instances known to the origin visit scheduler                                                              |
| public | origin_visit_stats             | table    | swhscheduler | 23 GB      |                                                                                                                   |
| public | pg_stat_statements             | view     | postgres     | 0 bytes    |                                                                                                                   |
| public | priority_ratio                 | table    | swhscheduler | 40 kB      | Oneshot task's reading ratio per priority                                                                         |
| public | scheduler_metrics              | table    | swhscheduler | 56 kB      | Cache of per-lister metrics for the scheduler, collated between the listed_origins and origin_visit_stats tables. |
| public | task                           | table    | swhscheduler | 864 MB     |                                                                                                                   |
| public | task_id_seq                    | sequence | swhscheduler | 8192 bytes |                                                                                                                   |
| public | task_run                       | table    | swhscheduler | 84 MB      |                                                                                                                   |
| public | task_run_id_seq                | sequence | swhscheduler | 8192 bytes |                                                                                                                   |
| public | task_type                      | table    | swhscheduler | 56 kB      | Types of schedulable tasks                                                                                        |
| public | visit_scheduler_queue_position | table    | swhscheduler | 56 kB      | Current queue position for the recurrent visit scheduler                                                          |
+--------+--------------------------------+----------+--------------+------------+-------------------------------------------------------------------------------------------------------------------+
(15 rows)
ardumont claimed this task.
ardumont moved this task from deployed/landed/monitoring to done on the System administration board.