Handle multiple "database profiles" in the swh-storage (/...) SQL configurations
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	olasd
	Sep 16 2020, 5:32 PM

Description

swh.storage, and the underlying PostgreSQL database, is deployed under various operational conditions:

"ingestion" database, the main read-write backend, with lots of indexes for efficient deduplication (as well as indexes for read queries when we need to re-point the frontends away from the read replicas).
"read replica" database, with
- a limited set of constraints which are already checked on the ingestion database, and aren't needed on it
- a limited set of indexes for read queries
"mirrored" database, filled by the journal replayer, which is very much alike the "ingestion" database, but which needs to support out of order insertion of objects (e.g. origin_visit_status before origin_visit, P770).

In the current deployments, we cherry pick which SQL statements we run for migrations. This yields inconsistent and hard to reproduce results, which is most obvious when trying to set up new mirrors.

After a short discussion with @douardda, a few potential solutions have emerged to solve this issue:

use a templating system to generate SQL files for each database profile. Needs to be genericized for the main SQL files *and* migrations. Probably overkill.
have subdirectories of the main SQL files with additional queries to run for each database profile. Needs to have a separate set of migrations for all objects.
add a new "meta" table in the database (with contents defaulting to "ingestion database"), and guard the relevant statements with a select on this table. This can be implemented with a single set of SQL files and a single set of SQL migration scripts; This should also be compatible with any other SQL-based way of managing schema migrations.

I'll give a stab at the third option and see where that gets us.

Revisions and Commits

rDSTO Storage manager
	D3981	rDSTO829118a63f37 Add the SQL commands used to set up the logical replication publication
	D3981	rDSTO5d3de067e838 Support different database flavors in the SQL scripts
	D3981	rDSTO63426e6cda5f pytest_plugin: Use psql to load SQL files instead of connecting with psycopg2

Related Objects

Mentioned Here: P770 tb in the mirror

Event Timeline

olasd triaged this task as Normal priority.Sep 16 2020, 5:32 PM

olasd created this task.

vsellier added a subscriber: vsellier.Sep 16 2020, 5:41 PM

olasd added a revision: D3981: Support different database flavors in the SQL scripts.Sep 17 2020, 8:05 PM

So, I've first attempted something along the lines of https://www.depesz.com/2008/06/18/conditional-ddl/, which uses a function to execute DDL commands stored in a string.

This would have led to SQL scripts looking like:

select execute(
  case swh_get_dbflavor()
    when 'read_replica' then $$
      create index concurrently [...];
    $$ else $$
      create unique index concurrently [...];
      alter table xxx add primary key using yyy;
    $$
  end
);

This is very clunky syntax, and in any case this would execute in a transaction, preventing the execution of the concurrently clauses.

Fortunately, PostgreSQL 11 has introduced a conditional block syntax in the psql shell: \if \elif \else \endif (https://www.postgresql.org/docs/12/app-psql.html#PSQL-METACOMMAND-IF)

It only supports "static expressions" that evaluate to something that looks like a boolean, so the setup is a bit annoying (you have to \gset some boolean constants instead of using proper sql expressions), but the syntax itself doesn't look that bad. The same boilerplate would work on upgrade scripts too.

Now my main problem is that I don't know how to set up tests for these new database flavors. I think we should at least add a "database initialization" smoke test for each flavor.

Btw, sqitch uses the native scripting for the database engine (in case of postgres, psql scripts), so if we ever end up using it, we can keep this approach.

olasd added a commit: rDSTO63426e6cda5f: pytest_plugin: Use psql to load SQL files instead of connecting with psycopg2.Sep 24 2020, 4:25 PM

olasd added a commit: rDSTO5d3de067e838: Support different database flavors in the SQL scripts.

olasd added a commit: rDSTO829118a63f37: Add the SQL commands used to set up the logical replication publication.

This task has been migrated to GitLab.

Handle multiple "database profiles" in the swh-storage (/...) SQL configurationsClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related Objects

Event Timeline

Handle multiple "database profiles" in the swh-storage (/...) SQL configurations
Closed, MigratedEdits Locked
Actions