Page MenuHomeSoftware Heritage

Support different database flavors in the SQL scripts
ClosedPublic

Authored by olasd on Sep 17 2020, 8:05 PM.

Details

Summary

This uses a new database table and some psql conditionals to introduce three
different flavors for the swh.storage Postgres database:

  • the 'default' flavor has all the deduplication features, foreign keys and read indexes
  • the 'mirror' flavor has all the deduplication features and read indexes; it drops some foreign keys to allow for out of order addition of some object types
  • the 'read_replica' flavor has the minimal set of indexes to support read queries, and replication using the PostgreSQL logical replication feature

We still need to introduce a way to initialize the "flavor" of the database in
swh db-init, between playing 00-swh-flavor.sql and 60-swh-indexes.sql.

Related to T2604.

(To support this feature, a preceding commit replaces our use of psycopg2 with
psql for initialization of the test databases)

Test Plan

For now, only tested with tox (using the default, full-fat database flavor).

Diff Detail

Repository
rDSTO Storage manager
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 15496
Build 23870: Phabricator diff pipeline on jenkinsJenkins console · Jenkins
Build 23869: arc lint + arc unit

Event Timeline

remove spurious flag on an \else block

Build is green

Patch application report for D3981 (id=14036)

Rebasing onto b0027abc34...

First, rewinding head to replay your work on top of it...
Applying: pytest_plugin: Use psql to load SQL files instead of connecting with psycopg2
Using index info to reconstruct a base tree...
M	swh/storage/pytest_plugin.py
Falling back to patching base and 3-way merge...
Auto-merging swh/storage/pytest_plugin.py
CONFLICT (content): Merge conflict in swh/storage/pytest_plugin.py
Patch failed at 0001 pytest_plugin: Use psql to load SQL files instead of connecting with psycopg2

Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".

Rebase failed (ret=1)!

Could not rebase; Attempt merge onto b0027abc34...

Already up to date.
Changes applied before test
commit 1f8172baa6549e1655d4fec22d1a4378c3494b1d
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Sep 17 19:55:58 2020 +0200

    Support different database flavors in the SQL scripts
    
    This uses a new database table and some psql conditionals to introduce three
    different flavors for the swh.storage Postgres database:
    
     - the 'default' flavor has all the deduplication features, foreign keys and
     read indexes
     - the 'mirror' flavor has all the deduplication features and read indexes; it
     drops some foreign keys to allow for out of order addition of some object types
     - the 'read_replica' flavor has the minimal set of indexes to support
     read queries, and replication using the PostgreSQL logical replication feature
    
    We still need to introduce a way to initialize the "flavor" of the database in
    `swh db-init`, between playing 00-swh-flavor.sql and 60-swh-indexes.sql.
    
    Related to T2604.

commit aff747c14609f48c449ab834801d5e761676e16c
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Sep 17 19:53:45 2020 +0200

    pytest_plugin: Use psql to load SQL files instead of connecting with psycopg2
    
    This avoids running into issues when the SQL files contain psql-specific
    features like backslash-escapes.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/949/ for more details.

Build is green

Patch application report for D3981 (id=14038)

Rebasing onto b0027abc34...

First, rewinding head to replay your work on top of it...
Applying: pytest_plugin: Use psql to load SQL files instead of connecting with psycopg2
Using index info to reconstruct a base tree...
M	swh/storage/pytest_plugin.py
Falling back to patching base and 3-way merge...
Auto-merging swh/storage/pytest_plugin.py
CONFLICT (content): Merge conflict in swh/storage/pytest_plugin.py
Patch failed at 0001 pytest_plugin: Use psql to load SQL files instead of connecting with psycopg2

Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".

Rebase failed (ret=1)!

Could not rebase; Attempt merge onto b0027abc34...

Already up to date.
Changes applied before test
commit 775cdb8d686a3be6343eae2ab895f69da22ceef6
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Sep 17 19:55:58 2020 +0200

    Support different database flavors in the SQL scripts
    
    This uses a new database table and some psql conditionals to introduce three
    different flavors for the swh.storage Postgres database:
    
     - the 'default' flavor has all the deduplication features, foreign keys and
     read indexes
     - the 'mirror' flavor has all the deduplication features and read indexes; it
     drops some foreign keys to allow for out of order addition of some object types
     - the 'read_replica' flavor has the minimal set of indexes to support
     read queries, and replication using the PostgreSQL logical replication feature
    
    We still need to introduce a way to initialize the "flavor" of the database in
    `swh db-init`, between playing 00-swh-flavor.sql and 60-swh-indexes.sql.
    
    Related to T2604.

commit aff747c14609f48c449ab834801d5e761676e16c
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Sep 17 19:53:45 2020 +0200

    pytest_plugin: Use psql to load SQL files instead of connecting with psycopg2
    
    This avoids running into issues when the SQL files contain psql-specific
    features like backslash-escapes.

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/950/ for more details.

Is this a WIP? If yes, what are the -- ? comments?

And as discussed on IRC, we might want to use unique indices in read replicas too, so we have less differences between flavors.

Is this a WIP? If yes, what are the -- ? comments?

And as discussed on IRC, we might want to use unique indices in read replicas too, so we have less differences between flavors.

No, this is not a WIP.

The -- ? comments are indexes that don't seem relevant to any existing queries, and that we should consider dropping from the main db if they're really not.

what about the deduplication flavor? it's not described in the ci message. Also this description of flavors should be documented somewhere (I mean somewhere else than the ci message)

@douardda deduplication means it's not read_replica, so either mirror or default.

We want deduplication for default (it's the one written to by loaders), and also for mirror so they filter out duplicates from the journal.

I'm a little conflicted about this diff, because it adds complexity / different combinations in the code. But on the other hand, it only reflects complexity we already have in production, and lays it out explicitly, so I think it's good overall.

I didn't review the actual changes yet, though.

@douardda deduplication means it's not read_replica, so either mirror or default.

We want deduplication for default (it's the one written to by loaders), and also for mirror so they filter out duplicates from the journal.

I'll rename the variable dbflavor_does_deduplication which will be clearer.

I'll also handle your unique index comments.

In D3981#98779, @olasd wrote:

I'll rename the variable dbflavor_does_deduplication which will be clearer.

👍

Also this description of flavors should be documented somewhere (I mean somewhere else than the ci message)

I agree. There's some comments in the 00-swh-flavors.sql, but I can't do an actual PostgreSQL *comment* on the enum entries. I'll add a file to the sphinx documentation about the different database flavors and how to use them.

ardumont added a subscriber: ardumont.

lgtm

Don't really know what make of the:

  • -- ? comment though. Is it reasonable to just drop those (in another diff or something) and check where it goes?
  • testing of the db flavors? Maybe running daily jobs with different db flavor each day would be enough (subscribing to -devel so we can pick it up if it fails)?
swh/storage/sql/60-swh-indexes.sql
120 ↗(On Diff #14038)

Wondering whether those indexes (and the one before marked as -- ?) were not
installed in the first place to ease some introspection done independently from
the production (with the guest user a long time ago)...

Which finally made it through here so we got no more discrepancy between code
and production... (going the wrong way though... that is, it should have been
the other way around, dropping the unused indexes instead of adding in the
code...)

This revision is now accepted and ready to land.Sep 21 2020, 2:28 PM
  • Rename some variables to be clarer;
  • Clean up some discrepancies between default and read_replica;
  • Add a setup script for the logical publisher;
  • Rebase on top of latest swh.storage

Build has FAILED

Patch application report for D3981 (id=14217)

Rebasing onto e37f63930b...

Current branch diff-target is up to date.
Changes applied before test
commit bbf474d5d1a49757fe40a9404336031cb9e7488f
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Sep 24 12:44:17 2020 +0200

    Add the SQL commands used to set up the logical replication publication

commit 40c80222c4e734f24c7ca36a0e46fa60e24a51b1
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Sep 17 19:55:58 2020 +0200

    Support different database flavors in the SQL scripts
    
    This uses a new database table and some psql conditionals to introduce three
    different flavors for the swh.storage Postgres database:
    
     - the 'default' flavor has all the deduplication features, foreign keys and
     read indexes
     - the 'mirror' flavor has all the deduplication features and read indexes; it
     drops some foreign keys to allow for out of order addition of some object types
     - the 'read_replica' flavor has the minimal set of indexes to support
     read queries, and replication using the PostgreSQL logical replication feature
    
    Related to T2604.

commit bf81e4bdc22ac92dd15b9bc95e41a31652171d92
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Sep 17 19:53:45 2020 +0200

    pytest_plugin: Use psql to load SQL files instead of connecting with psycopg2
    
    This avoids running into issues when the SQL files contain psql-specific
    features like backslash-escapes.

Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/970/
See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/970/console

Properly bump swh.storage.postgresql.db.Db.current_version

Build is green

Patch application report for D3981 (id=14221)

Could not rebase; Attempt merge onto e37f63930b...

Updating e37f6393..829118a6
Fast-forward
 sql/upgrades/163.sql                               |  29 +++
 swh/storage/postgresql/db.py                       |  15 +-
 swh/storage/pytest_plugin.py                       |  30 ++-
 swh/storage/sql/15-flavor.sql                      |  22 +++
 swh/storage/sql/30-schema.sql                      |   2 +-
 swh/storage/sql/60-indexes.sql                     | 215 ++++++++++++++-------
 .../sql/logical_replication/replication_source.sql |  25 +++
 7 files changed, 255 insertions(+), 83 deletions(-)
 create mode 100644 sql/upgrades/163.sql
 create mode 100644 swh/storage/sql/15-flavor.sql
 create mode 100644 swh/storage/sql/logical_replication/replication_source.sql
Changes applied before test
commit 829118a63f37151fdc0db82eee312d6a6eb78e10
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Sep 24 12:44:17 2020 +0200

    Add the SQL commands used to set up the logical replication publication

commit 5d3de067e838fec32a4864219ab3ca70ff39dd52
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Sep 17 19:55:58 2020 +0200

    Support different database flavors in the SQL scripts
    
    This uses a new database table and some psql conditionals to introduce three
    different flavors for the swh.storage Postgres database:
    
     - the 'default' flavor has all the deduplication features, foreign keys and
     read indexes
     - the 'mirror' flavor has all the deduplication features and read indexes; it
     drops some foreign keys to allow for out of order addition of some object types
     - the 'read_replica' flavor has the minimal set of indexes to support
     read queries, and replication using the PostgreSQL logical replication feature
    
    Related to T2604.

commit 63426e6cda5f1ad5899035d656295a12322d625b
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Sep 17 19:53:45 2020 +0200

    pytest_plugin: Use psql to load SQL files instead of connecting with psycopg2
    
    This avoids running into issues when the SQL files contain psql-specific
    features like backslash-escapes.

commit 38b1dbf38c3d1b0e008fd0c7e6942f7e1300518f
Author: Nicolas Dandrimont <nicolas@dandrimont.eu>
Date:   Thu Sep 24 13:53:55 2020 +0200

    Output a warning when the version of the database is different than expected

See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/972/ for more details.

This revision is now accepted and ready to land.Sep 24 2020, 2:35 PM

Ah, nice the new replication instructions.