swh.storage, and the underlying PostgreSQL database, is deployed under various operational conditions:
- "ingestion" database, the main read-write backend, with lots of indexes for efficient deduplication (as well as indexes for read queries when we need to re-point the frontends away from the read replicas).
- "read replica" database, with
- a limited set of constraints which are already checked on the ingestion database, and aren't needed on it
- a limited set of indexes for read queries
- "mirrored" database, filled by the journal replayer, which is very much alike the "ingestion" database, but which needs to support out of order insertion of objects (e.g. origin_visit_status before origin_visit, P770).
In the current deployments, we cherry pick which SQL statements we run for migrations. This yields inconsistent and hard to reproduce results, which is most obvious when trying to set up new mirrors.
After a short discussion with @douardda, a few potential solutions have emerged to solve this issue:
- use a templating system to generate SQL files for each database profile. Needs to be genericized for the main SQL files *and* migrations. Probably overkill.
- have subdirectories of the main SQL files with additional queries to run for each database profile. Needs to have a separate set of migrations for all objects.
- add a new "meta" table in the database (with contents defaulting to "ingestion database"), and guard the relevant statements with a select on this table. This can be implemented with a single set of SQL files and a single set of SQL migration scripts; This should also be compatible with any other SQL-based way of managing schema migrations.
I'll give a stab at the third option and see where that gets us.