Page MenuHomeSoftware Heritage

Handle multiple "database profiles" in the swh-storage (/...) SQL configurations
Open, NormalPublic

Description, and the underlying PostgreSQL database, is deployed under various operational conditions:

  • "ingestion" database, the main read-write backend, with lots of indexes for efficient deduplication (as well as indexes for read queries when we need to re-point the frontends away from the read replicas).
  • "read replica" database, with
    • a limited set of constraints which are already checked on the ingestion database, and aren't needed on it
    • a limited set of indexes for read queries
  • "mirrored" database, filled by the journal replayer, which is very much alike the "ingestion" database, but which needs to support out of order insertion of objects (e.g. origin_visit_status before origin_visit, P770).

In the current deployments, we cherry pick which SQL statements we run for migrations. This yields inconsistent and hard to reproduce results, which is most obvious when trying to set up new mirrors.

After a short discussion with @douardda, a few potential solutions have emerged to solve this issue:

  • use a templating system to generate SQL files for each database profile. Needs to be genericized for the main SQL files *and* migrations. Probably overkill.
  • have subdirectories of the main SQL files with additional queries to run for each database profile. Needs to have a separate set of migrations for all objects.
  • add a new "meta" table in the database (with contents defaulting to "ingestion database"), and guard the relevant statements with a select on this table. This can be implemented with a single set of SQL files and a single set of SQL migration scripts; This should also be compatible with any other SQL-based way of managing schema migrations.

I'll give a stab at the third option and see where that gets us.

Related Objects

Event Timeline

olasd triaged this task as Normal priority.Sep 16 2020, 5:32 PM
olasd created this task.

So, I've first attempted something along the lines of, which uses a function to execute DDL commands stored in a string.

This would have led to SQL scripts looking like:

select execute(
  case swh_get_dbflavor()
    when 'read_replica' then $$
      create index concurrently [...];
    $$ else $$
      create unique index concurrently [...];
      alter table xxx add primary key using yyy;

This is very clunky syntax, and in any case this would execute in a transaction, preventing the execution of the concurrently clauses.

Fortunately, PostgreSQL 11 has introduced a conditional block syntax in the psql shell: \if \elif \else \endif (

It only supports "static expressions" that evaluate to something that looks like a boolean, so the setup is a bit annoying (you have to \gset some boolean constants instead of using proper sql expressions), but the syntax itself doesn't look that bad. The same boilerplate would work on upgrade scripts too.

Now my main problem is that I don't know how to set up tests for these new database flavors. I think we should at least add a "database initialization" smoke test for each flavor.

Btw, sqitch uses the native scripting for the database engine (in case of postgres, psql scripts), so if we ever end up using it, we can keep this approach.