⚓ T1534 PostgreSQL replication issues between prado and somerset

olasd triaged this task as High priority.Feb 19 2019, 2:10 PM

olasd created this task.

olasd added a subscriber: ftigeot.Feb 19 2019, 2:26 PM

olasd renamed this task from PostgreSQL replication issues between prado and beaubourg to PostgreSQL replication issues between prado and somerset.Feb 19 2019, 3:22 PM

olasd updated the task description. (Show Details)

Logs on primary:

2019-02-19 14:27:44 UTC [15973]: [1-1] user=postgres,db=softwareheritage LOG:  starting logical decoding for slot "pgl_softwareheritage_prado_somerset"
2019-02-19 14:27:44 UTC [15973]: [2-1] user=postgres,db=softwareheritage DETAIL:  streaming transactions committing after 18607/9189A578, reading WAL from 18607/9189A578
2019-02-19 14:27:44 UTC [15973]: [3-1] user=postgres,db=softwareheritage ERROR:  record with incorrect prev-link 5403A/2E2F1829 at 18607/9189A578 
2019-02-19 14:27:44 UTC [15973]: [4-1] user=postgres,db=softwareheritage LOG:  could not receive data from client: Connection reset by peer

Logs on replica:

2019-02-19 14:27:44 UTC [25074]: [1-1] user=[unknown],db=softwareheritage LOG:  starting apply for subscription somerset
2019-02-19 14:27:44 UTC [25074]: [2-1] user=[unknown],db=softwareheritage ERROR:  data stream ended
2019-02-19 14:27:44 UTC [25074]: [3-1] user=[unknown],db=softwareheritage LOG:  apply worker [25074] at slot 1 generation 6593 exiting with error

After reading some mailing list posts discussing the error message, and discussion with @ftigeot:

at that point the replication stream probably can't be recovered
we may need to re-do replication from scratch

The problem with re-doing replication from scratch is having to send all data to azure again, which is going to take a long time.

action 1: The first step will be migrating readers from somerset back to prado to limit disruption.

Before we do replication again from scratch, there's a (slim) chance we can recover the replicas by doing the following:

restart the database on somerset, hope it recovers by itself
if it doesn't, stop the replication stream from prado to somerset
let prado recover and clean up its WALs
restart a replication stream from prado to somerset, without resetting the tables
- this is possible with pglogical, not with the builtin logical replication mechanism.
- this means that the replica has a hole from the date the replication failed to the date we're doing the streaming restart
fill the gaps in the tables
- configure a foreign data wrapper to access the live tables on prado
- insert the missing lines on somerset

The main unknowns on this procedure are step 3 (whether prado will recover from its I/O issues), 4 (whether the replication will restart properly) and 5 (whether the gap-filling queries will terminate in a reasonable amount of time).

We're currently blocked on step 1 as @grouss is doing heavy work on somerset for a tight deadline (ETA March 1st to know when we'll be able to restart).

If step 3 fails and prado doesn't recover from its I/O issues, one possible solution would probably be to migrate the primary database to somerset, as this is a more recent machine with more consistent performance. However this creates a new bottleneck on the infrastructure (primary db + primary indexer db + tons of VMs on the same machine).

If step 4 fails, or if step 5 takes an unreasonable amount of time, we need to restart replication from scratch. This extended downtime of the database infrastructure would be an opportunity to:

upgrade to PostgreSQL 11
migrate to the built-in logical replication system instead of pglogical (while slightly more flexible, pglogical 2 is now a 2ndQuadrant-customer-only release https://github.com/2ndQuadrant/pglogical, and IMO it makes sense to move away from it).

Considering the time constraints incurred by Guillaume's work, it might make sense to just scrap the recovery plan altogether and do the PG 11 upgrade right now, recreating a replica once he's done with his current work. We'll let this idea mature for some hours before making a final decision :)

In parallel to all of this, we're making plans for migrating the primary postgresql database(s) to a new server that would be dedicated to them.

After some more stewing and discussion with @zack, we'll be going for the "upgrade to pg 11 and restart replication from scratch" route;

I've dropped the replication slot between prado and somerset with the following commands:

on somerset: select pglogical.drop_subscription('somerset');
on prado: select pg_drop_replication_slot('pgl_softwareheritage_prado_somerset');

We'll see whether that improves prado's i/o behavior.

olasd mentioned this in T1591: Upgrade PostgreSQL clusters up to 11.Mar 20 2019, 3:04 PM

The replicated cluster is now clear to be taken down for a rebuild.

The web frontend has been moved to the primary database.

@grouss may still need access to the occurrence/occurrence_history tables, which I will preserve separately by dumping/recreating them to compact them.

I will then initialize a new postgresql 11 cluster and start the logical replication process between prado and this new cluster

olasd updated the task description. (Show Details)Mar 22 2019, 4:52 PM

olasd updated the task description. (Show Details)Mar 22 2019, 5:31 PM

olasd updated the task description. (Show Details)Mar 22 2019, 5:51 PM

olasd updated the task description. (Show Details)Mar 22 2019, 6:00 PM

olasd updated the task description. (Show Details)Mar 22 2019, 6:32 PM

olasd updated the task description. (Show Details)Mar 22 2019, 11:26 PM

olasd updated the task description. (Show Details)Mar 23 2019, 10:02 AM

olasd updated the task description. (Show Details)Mar 23 2019, 2:29 PM

olasd updated the task description. (Show Details)Mar 25 2019, 10:32 AM

The replication process from prado to somerset is now complete, and the archive frontend has been switched over to this database.

This task has been migrated to GitLab.

PostgreSQL replication issues between prado and somerset
Closed, MigratedEdits Locked
Actions

Description

Related Objects

Event Timeline

PostgreSQL replication issues between prado and somersetClosed, MigratedEdits LockedActions

Description

Related Objects

Event Timeline

PostgreSQL replication issues between prado and somerset
Closed, MigratedEdits Locked
Actions