Page MenuHomeSoftware Heritage

Scheduler: Automate completed oneshot or disabled recurring tasks archival
Closed, MigratedEdits Locked

Description

Right now, there is no archival tool in the scheduler db (either cli, automation, etc...).

Having some mechanism for it would permit to schedule more oneshot tasks (4B contents indexation, googlecode import, etc...) as we could archive those later on.

Related Objects

Event Timeline

ardumont triaged this task as High priority.Mar 7 2018, 4:06 PM
ardumont created this task.
ardumont lowered the priority of this task from High to Normal.
ardumont renamed this task from Permit cleaning up oneshot tasks after a retention policy period exceeded to Permit oneshot tasks archival.Mar 14 2018, 3:25 PM
ardumont renamed this task from Permit oneshot tasks archival to Scheduler: Permit oneshot tasks archival.
ardumont updated the task description. (Show Details)

Need

Archive 'completed' 'oneshot' tasks and 'disabled' 'recurring' tasks.

Possible solutions

  • Postgresql's partitioning scheme
  • Logstash/Elasticsearch (matching what we do for workers' logs, journalbeat -> logstash -> elasticsearch)

Postgresql's partitioning scheme

Here is the summing up of my take on this, I found more cons to this
solution than pros.

There is no automatic way of doing either solution. We will need to
write some stored procedure to actually use this. As we are currently
already doing much work with stored procedures and triggers, that
would mean a somewhat difficult maintenance solution for not so
postgresql fluent users (including me).

|----------------+-------------------------------------------+----------------------------------------------------+---------------------------------------------|
| Solution       | pros                                      | cons                                               | Cons can be improved with                   |
|----------------+-------------------------------------------+----------------------------------------------------+---------------------------------------------|
| 1. Declarative | - improve performance query on main table | - Initial table must be a partition type           |                                             |
|                | - can partition on multiple columns       | (-> mean migration first)                          |                                             |
|                | - bulk drop/delete faster                 | - No automatic way (docs demo manual partitioning) | -> maintain our own stored procedure        |
|                |                                           | - High maintenance (create new partition,          |                                             |
|                |                                           | create subpartition indexes, etc...)               |                                             |
|                |                                           | - Clutter the db with a gazillion tables           | -> tablespace separation, schema visibility |
|                |                                           |                                                    | or something                                |
|----------------+-------------------------------------------+----------------------------------------------------+---------------------------------------------|
| 2. Inheritance | - improve performance query on main table | - No automatic way (docs demo manual partitioning) | -> maintain our own stored procedure        |
|                | - can partition on multiple columns       | - High maintenance (create new partition,          |                                             |
|                | - can add partition constraint check (e.g | create subpartition indexes, etc...)               |                                             |
|                | policy is oneshot or (policy is recurring | - Clutter the db with a gazillion tables           | -> tablespace separation, schema visibility |
|                | and status is disabled))                  |                                                    | or something                                |
|                | - bulk drop/delete faster                 |                                                    |                                             |
|----------------+-------------------------------------------+----------------------------------------------------+---------------------------------------------|

The solution could possibly be implemented in its current state with
solution 2. We need more than multiple column partitioning, we need to
have some constraint checks as well (task_status, task_policy).

source:
Postgresql's Table partitioning documentation

Logstash/Elasticsearch

|------------------+------------------------------------------+-----------------------------------------+-----------------------------------------------------|
| Solution         | pros                                     | cons                                    | Cons can be improved with                           |
|------------------+------------------------------------------+-----------------------------------------+-----------------------------------------------------|
| 1. Logstash      | - Reuse: unify with the way we push logs | - Will trigger vacuum on scheduler db   |                                                     |
|                  | today                                    | - Possible serialization issue (bytes)  |                                                     |
|                  | - Simple to implement (read db, push     | - Not everything in the same location   | Can be improved with a specific scheduling admin ui |
|                  | delete old values in db)                 | (one part in db, other in e.s)          |                                                     |
|------------------+------------------------------------------+-----------------------------------------+-----------------------------------------------------|
| 2. Elasticsearch | - Simple to implement (read db, push,    | - Will trigger vacuum on scheduler db   |                                                     |
|                  | delete old values in db                  | - Possible serialization issues (bytes) |                                                     |
|                  | - python3-elasticsearch debian           | - Not everything in the same location   | Can be improved with a specific scheduling admin ui |
|                  | package (stretch, buster)                | (one part in db, other in e.s)          |                                                     |
|------------------+------------------------------------------+-----------------------------------------+-----------------------------------------------------|

As we are currently converging on having a real elasticsearch cluster
anyway, this approach sounds more reasonable and maintainable
long-term. Also, it would be in sync with the initial need of the parent task.

And as said there, more in sync with the actual way of archiving
worker logs.

Remains to determine if we want to push to 1. logstash (which supports
data as well, not only logs ;) or 2. elasticsearch directly.

I'm digging more towards the logstash/elasticsearch direction now.

ardumont changed the task status from Open to Work in Progress.EditedMar 22 2018, 2:25 PM

Status on this, things are converging.

I have been using local instance to not break current production, that means:

  • a local elasticsearch instance. This helped in designing the index to hold the task data (P240).
  • a local db holding the current swh-scheduler's state to have consistent production data (dump is at prado:/srv/remote-backups/postgres/dumps/softwareheritage-scheduler-2018-03-14.tar.gz).

Tests so far are good (without deletion yet though, it's coded but not production tested, production as in my local reproduction).
Now heading for the clean up tests.

ardumont renamed this task from Scheduler: Permit oneshot tasks archival to Scheduler: Permit tasks archival (completed oneshot or disabled recurring).Mar 26 2018, 11:31 AM

After re-packaging a version of python3-elasticsearch for debian stable in our debian repository (using testing's 5.4.0 version), it's now deployed and running using @ftigeot's elasticsearch cluster instance.

It uses the actual swh-scheduler db (without cleanup as this is a tryout run).

After re-packaging a version of python3-elasticsearch for debian stable in our debian repository (using testing's 5.4.0 version), it's now deployed and running using @ftigeot's elasticsearch cluster instance.

It uses the actual swh-scheduler db (without cleanup as this is a tryout run).

This was a success btw.

Now, the cluster elasticsearch is ready (holding esnode1, esnode2 for now, soon a new one will be added).
Started a swh-scheduler db dump prior to triggering the actual archival:

postgres@prado:/srv/remote-backups/postgres/dumps$ pg_dump -p 5434 --format tar softwareheritage-scheduler | gzip -c - > ./swh-scheduler.tar.gz

Started a swh-scheduler db dump prior to triggering the actual archival:

Well, that failed as temporary files must be created along the way which saturates the /
This is quite unsettling as this is not man pg_dump documented...

So, finally settled to:

postgres@prado:/srv/remote-backups/postgres/dumps/T986 $ pg_dump -p 5434 --format directory --file swh-scheduler-$(date -d 'now' +"%Y-%m-%d") softwareheritage-scheduler

And now dump is happening with prado's / remaining disk size stable \m/

ardumont renamed this task from Scheduler: Permit tasks archival (completed oneshot or disabled recurring) to Scheduler: Automate completed oneshot or disabled recurring tasks archival.May 25 2018, 8:08 AM