Scheduler: Automate completed oneshot or disabled recurring tasks archival
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	ardumont
	Mar 7 2018, 4:06 PM

Description

Right now, there is no archival tool in the scheduler db (either cli, automation, etc...).

Having some mechanism for it would permit to schedule more oneshot tasks (4B contents indexation, googlecode import, etc...) as we could archive those later on.

Revisions and Commits

rDSCH Scheduling utilities
	rDSCH8bbbe7b755cd swh.scheduler.cli: Change archival period to rolling month - 1 week

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T723 General improvements to the scheduler
Migrated	gitlab-migration	T724 Move scheduler task run logging to a more permanent location
Migrated	gitlab-migration	T986 Scheduler: Automate completed oneshot or disabled recurring tasks archival
Migrated	gitlab-migration	T1023 scheduler: tasks archival: Test run on test elasticsearch cluster
Migrated	gitlab-migration	T1071 Scheduler: Archive existing completed oneshot and disabled recurring tasks

Event Timeline

ardumont triaged this task as High priority.Mar 7 2018, 4:06 PM

ardumont created this task.

ardumont removed olasd as the assignee of this task.Mar 9 2018, 11:48 AM

ardumont lowered the priority of this task from High to Normal.

ardumont added a parent task: T724: Move scheduler task run logging to a more permanent location.Mar 12 2018, 11:32 AM

ardumont removed a parent task: T723: General improvements to the scheduler.

ardumont renamed this task from Permit cleaning up oneshot tasks after a retention policy period exceeded to Permit oneshot tasks archival.Mar 14 2018, 3:25 PM

ardumont renamed this task from Permit oneshot tasks archival to Scheduler: Permit oneshot tasks archival.

ardumont updated the task description. (Show Details)

ardumont mentioned this in T682: Ingest Google Code Mercurial repositories.

Need

Archive 'completed' 'oneshot' tasks and 'disabled' 'recurring' tasks.

Possible solutions

Postgresql's partitioning scheme

Logstash/Elasticsearch (matching what we do for workers' logs, journalbeat -> logstash -> elasticsearch)

Postgresql's partitioning scheme

Here is the summing up of my take on this, I found more cons to this
solution than pros.

There is no automatic way of doing either solution. We will need to
write some stored procedure to actually use this. As we are currently
already doing much work with stored procedures and triggers, that
would mean a somewhat difficult maintenance solution for not so
postgresql fluent users (including me).

|----------------+-------------------------------------------+----------------------------------------------------+---------------------------------------------|
| Solution       | pros                                      | cons                                               | Cons can be improved with                   |
|----------------+-------------------------------------------+----------------------------------------------------+---------------------------------------------|
| 1. Declarative | - improve performance query on main table | - Initial table must be a partition type           |                                             |
|                | - can partition on multiple columns       | (-> mean migration first)                          |                                             |
|                | - bulk drop/delete faster                 | - No automatic way (docs demo manual partitioning) | -> maintain our own stored procedure        |
|                |                                           | - High maintenance (create new partition,          |                                             |
|                |                                           | create subpartition indexes, etc...)               |                                             |
|                |                                           | - Clutter the db with a gazillion tables           | -> tablespace separation, schema visibility |
|                |                                           |                                                    | or something                                |
|----------------+-------------------------------------------+----------------------------------------------------+---------------------------------------------|
| 2. Inheritance | - improve performance query on main table | - No automatic way (docs demo manual partitioning) | -> maintain our own stored procedure        |
|                | - can partition on multiple columns       | - High maintenance (create new partition,          |                                             |
|                | - can add partition constraint check (e.g | create subpartition indexes, etc...)               |                                             |
|                | policy is oneshot or (policy is recurring | - Clutter the db with a gazillion tables           | -> tablespace separation, schema visibility |
|                | and status is disabled))                  |                                                    | or something                                |
|                | - bulk drop/delete faster                 |                                                    |                                             |
|----------------+-------------------------------------------+----------------------------------------------------+---------------------------------------------|

The solution could possibly be implemented in its current state with
solution 2. We need more than multiple column partitioning, we need to
have some constraint checks as well (task_status, task_policy).

source:
Postgresql's Table partitioning documentation

Logstash/Elasticsearch

|------------------+------------------------------------------+-----------------------------------------+-----------------------------------------------------|
| Solution         | pros                                     | cons                                    | Cons can be improved with                           |
|------------------+------------------------------------------+-----------------------------------------+-----------------------------------------------------|
| 1. Logstash      | - Reuse: unify with the way we push logs | - Will trigger vacuum on scheduler db   |                                                     |
|                  | today                                    | - Possible serialization issue (bytes)  |                                                     |
|                  | - Simple to implement (read db, push     | - Not everything in the same location   | Can be improved with a specific scheduling admin ui |
|                  | delete old values in db)                 | (one part in db, other in e.s)          |                                                     |
|------------------+------------------------------------------+-----------------------------------------+-----------------------------------------------------|
| 2. Elasticsearch | - Simple to implement (read db, push,    | - Will trigger vacuum on scheduler db   |                                                     |
|                  | delete old values in db                  | - Possible serialization issues (bytes) |                                                     |
|                  | - python3-elasticsearch debian           | - Not everything in the same location   | Can be improved with a specific scheduling admin ui |
|                  | package (stretch, buster)                | (one part in db, other in e.s)          |                                                     |
|------------------+------------------------------------------+-----------------------------------------+-----------------------------------------------------|

As we are currently converging on having a real elasticsearch cluster
anyway, this approach sounds more reasonable and maintainable
long-term. Also, it would be in sync with the initial need of the parent task.

And as said there, more in sync with the actual way of archiving
worker logs.

Remains to determine if we want to push to 1. logstash (which supports
data as well, not only logs ;) or 2. elasticsearch directly.

I'm digging more towards the logstash/elasticsearch direction now.

ardumont claimed this task.Mar 17 2018, 6:01 PM

ardumont mentioned this in P240 Index template for swh-scheduler task to archive.Mar 21 2018, 5:11 PM

ardumont mentioned this in P242 [fixed] Hanging 'task reading' function on filtering period without data.Mar 22 2018, 12:20 PM

Status on this, things are converging.

I have been using local instance to not break current production, that means:

a local elasticsearch instance. This helped in designing the index to hold the task data (P240).
a local db holding the current swh-scheduler's state to have consistent production data (dump is at prado:/srv/remote-backups/postgres/dumps/softwareheritage-scheduler-2018-03-14.tar.gz).

Tests so far are good (without deletion yet though, it's coded but not production tested, production as in my local reproduction).
Now heading for the clean up tests.

ardumont mentioned this in rDSCH8435efde8871: swh.scheduler.cli: Add sql function to list archivable tasks.Mar 22 2018, 2:34 PM

ardumont mentioned this in rDSCHb76384b107f8: swh.scheduler.cli: Open endpoint to archive tasks in elasticsearch.

ardumont mentioned this in rDSCHe3867ed05e8e: swh.scheduler.cli.archive: Fix edge case when reading tasks.Mar 22 2018, 7:45 PM

ardumont mentioned this in rDSCH9e9f7ef4b3e8: swh.scheduler.cli.archive: Stream indexed tasks removal.

ardumont mentioned this in rDSCHbcb4f74dde01: swh.scheduler.cli.archive: Open batch size option for index/cleanup.

ardumont mentioned this in rDSCHd83da8693468: swh.scheduler.cli.archive: Remove unnecessary order by condition.

ardumont renamed this task from Scheduler: Permit oneshot tasks archival to Scheduler: Permit tasks archival (completed oneshot or disabled recurring).Mar 26 2018, 11:31 AM

ardumont mentioned this in rDSCHaf4c9b8438b3: swh.scheduler.cli.archive: Optimize reading task lookup query.Mar 27 2018, 10:48 AM

ardumont mentioned this in rDSCH4d13f5dc4940: swh.scheduler.cli.archive: Improve dry-run behavior.

ardumont mentioned this in rDSCHe785252c34b6: swh.scheduler.cli: Use bulk api to index tasks.Mar 29 2018, 12:32 PM

ardumont mentioned this in rDSCHd6b393dfd922: swh.scheduler.cli.archive: By default, archive last month's data.Mar 30 2018, 11:44 AM

ardumont mentioned this in rSPSITE34fe30a747cf: deploy/scheduler: Add scheduler task archive cli's data.Mar 30 2018, 12:31 PM

ardumont mentioned this in rSPSITE15312b8c7e1b: deploy/scheduler: Tryout run on petitpalais first.

ardumont mentioned this in rSPPROFf3265751232e: deploy/scheduler: Add cron to archive tasks.Mar 30 2018, 12:45 PM

After re-packaging a version of python3-elasticsearch for debian stable in our debian repository (using testing's 5.4.0 version), it's now deployed and running using @ftigeot's elasticsearch cluster instance.

It uses the actual swh-scheduler db (without cleanup as this is a tryout run).

ardumont added a subtask: T792: Make the elasticsearch logging cluster actually a cluster.Apr 10 2018, 11:48 AM

ardumont mentioned this in rDSCH962fd8b55e29: swh.scheduler.backend_es: Return operation failure instead of raising.Apr 10 2018, 5:47 PM

ardumont mentioned this in rDSCH04ccc2d6b8a4: swh.sched.cli.archive: Use interval period to filter archival tasks.

ardumont mentioned this in rDSCH1da2d71f3c1f: swh.scheduler.cli.archive: Delete only completely indexed tasks.

ardumont mentioned this in rDSCH6c11eb6f2b27: swh.sched.cli.archive: Improve logging.

ardumont mentioned this in rDSCHe972b6aebe77: swh.scheduler.cli.archive: Simplify task and task_run ids extraction.

ardumont created subtask T1023: scheduler: tasks archival: Test run on test elasticsearch cluster.Apr 12 2018, 9:30 AM

ardumont mentioned this in rDSCH7afd05032da3: swh.scheduler: Add tests around filtering archivable tasks.Apr 25 2018, 6:46 PM

ardumont mentioned this in rDSCH6ef0a885d151: swh.scheduler: Add tests around removing archivable tasks.

After re-packaging a version of python3-elasticsearch for debian stable in our debian repository (using testing's 5.4.0 version), it's now deployed and running using @ftigeot's elasticsearch cluster instance.

It uses the actual swh-scheduler db (without cleanup as this is a tryout run).

This was a success btw.

Now, the cluster elasticsearch is ready (holding esnode1, esnode2 for now, soon a new one will be added).
Started a swh-scheduler db dump prior to triggering the actual archival:

postgres@prado:/srv/remote-backups/postgres/dumps$ pg_dump -p 5434 --format tar softwareheritage-scheduler | gzip -c - > ./swh-scheduler.tar.gz

Started a swh-scheduler db dump prior to triggering the actual archival:

Well, that failed as temporary files must be created along the way which saturates the /
This is quite unsettling as this is not man pg_dump documented...

So, finally settled to:

postgres@prado:/srv/remote-backups/postgres/dumps/T986 $ pg_dump -p 5434 --format directory --file swh-scheduler-$(date -d 'now' +"%Y-%m-%d") softwareheritage-scheduler

And now dump is happening with prado's / remaining disk size stable \m/

ardumont closed subtask T1023: scheduler: tasks archival: Test run on test elasticsearch cluster as Resolved.May 25 2018, 8:06 AM

ardumont renamed this task from Scheduler: Permit tasks archival (completed oneshot or disabled recurring) to Scheduler: Automate completed oneshot or disabled recurring tasks archival.May 25 2018, 8:08 AM

ardumont changed the status of subtask T1071: Scheduler: Archive existing completed oneshot and disabled recurring tasks from Open to Work in Progress.May 25 2018, 8:15 AM

ardumont mentioned this in rSPPROFa6796a4da914: deploy/scheduler: Run the scheduler updater consumer service.May 30 2018, 5:22 PM

ardumont closed subtask T1071: Scheduler: Archive existing completed oneshot and disabled recurring tasks as Resolved.May 31 2018, 11:42 AM

ardumont closed this task as Resolved by committing rDSCH8bbbe7b755cd: swh.scheduler.cli: Change archival period to rolling month - 1 week.May 31 2018, 11:42 AM

ardumont added a commit: rDSCH8bbbe7b755cd: swh.scheduler.cli: Change archival period to rolling month - 1 week.

ardumont mentioned this in rSPPROF3bdf689d757a: deploy/scheduler: Trigger archive elected tasks hourly.May 31 2018, 11:53 AM

ardumont mentioned this in rSPSITEf3265751232e: deploy/scheduler: Add cron to archive tasks.Jun 15 2018, 2:30 PM

ardumont mentioned this in rSPSITEa6796a4da914: deploy/scheduler: Run the scheduler updater consumer service.

ardumont mentioned this in rSPSITE3bdf689d757a: deploy/scheduler: Trigger archive elected tasks hourly.

ftigeot changed the status of subtask T792: Make the elasticsearch logging cluster actually a cluster from Open to Work in Progress.Jul 31 2018, 4:13 PM

ftigeot removed a subtask: T792: Make the elasticsearch logging cluster actually a cluster.May 13 2019, 4:29 PM