Page MenuHomeSoftware Heritage

PostgreSQL replication issues between prado and somerset
Open, HighPublic

Description

The PostgreSQL replication of the graph database between prado and somerset is currently hanging. We need to make a plan to fix it, and to implement that plan.

Event Timeline

olasd created this task.Tue, Feb 19, 2:10 PM
olasd triaged this task as High priority.
olasd added a subscriber: ftigeot.Tue, Feb 19, 2:26 PM
olasd renamed this task from PostgreSQL replication issues between prado and beaubourg to PostgreSQL replication issues between prado and somerset.Tue, Feb 19, 3:22 PM
olasd updated the task description. (Show Details)
olasd added a comment.Tue, Feb 19, 3:29 PM

Logs on primary:

2019-02-19 14:27:44 UTC [15973]: [1-1] user=postgres,db=softwareheritage LOG:  starting logical decoding for slot "pgl_softwareheritage_prado_somerset"
2019-02-19 14:27:44 UTC [15973]: [2-1] user=postgres,db=softwareheritage DETAIL:  streaming transactions committing after 18607/9189A578, reading WAL from 18607/9189A578
2019-02-19 14:27:44 UTC [15973]: [3-1] user=postgres,db=softwareheritage ERROR:  record with incorrect prev-link 5403A/2E2F1829 at 18607/9189A578 
2019-02-19 14:27:44 UTC [15973]: [4-1] user=postgres,db=softwareheritage LOG:  could not receive data from client: Connection reset by peer

Logs on replica:

2019-02-19 14:27:44 UTC [25074]: [1-1] user=[unknown],db=softwareheritage LOG:  starting apply for subscription somerset
2019-02-19 14:27:44 UTC [25074]: [2-1] user=[unknown],db=softwareheritage ERROR:  data stream ended
2019-02-19 14:27:44 UTC [25074]: [3-1] user=[unknown],db=softwareheritage LOG:  apply worker [25074] at slot 1 generation 6593 exiting with error
olasd added a subscriber: grouss.Tue, Feb 19, 6:22 PM

After reading some mailing list posts discussing the error message, and discussion with @ftigeot:

  • at that point the replication stream probably can't be recovered
  • we may need to re-do replication from scratch

The problem with re-doing replication from scratch is having to send all data to azure again, which is going to take a long time.

action 1: The first step will be migrating readers from somerset back to prado to limit disruption.

Before we do replication again from scratch, there's a (slim) chance we can recover the replicas by doing the following:

  1. restart the database on somerset, hope it recovers by itself
  2. if it doesn't, stop the replication stream from prado to somerset
  3. let prado recover and clean up its WALs
  4. restart a replication stream from prado to somerset, without resetting the tables
    • this is possible with pglogical, not with the builtin logical replication mechanism.
    • this means that the replica has a hole from the date the replication failed to the date we're doing the streaming restart
  5. fill the gaps in the tables
    • configure a foreign data wrapper to access the live tables on prado
    • insert the missing lines on somerset

The main unknowns on this procedure are step 3 (whether prado will recover from its I/O issues), 4 (whether the replication will restart properly) and 5 (whether the gap-filling queries will terminate in a reasonable amount of time).

We're currently blocked on step 1 as @grouss is doing heavy work on somerset for a tight deadline (ETA March 1st to know when we'll be able to restart).

If step 3 fails and prado doesn't recover from its I/O issues, one possible solution would probably be to migrate the primary database to somerset, as this is a more recent machine with more consistent performance. However this creates a new bottleneck on the infrastructure (primary db + primary indexer db + tons of VMs on the same machine).

If step 4 fails, or if step 5 takes an unreasonable amount of time, we need to restart replication from scratch. This extended downtime of the database infrastructure would be an opportunity to:

  • upgrade to PostgreSQL 11
  • migrate to the built-in logical replication system instead of pglogical (while slightly more flexible, pglogical 2 is now a 2ndQuadrant-customer-only release https://github.com/2ndQuadrant/pglogical, and IMO it makes sense to move away from it).

Considering the time constraints incurred by Guillaume's work, it might make sense to just scrap the recovery plan altogether and do the PG 11 upgrade right now, recreating a replica once he's done with his current work. We'll let this idea mature for some hours before making a final decision :)

In parallel to all of this, we're making plans for migrating the primary postgresql database(s) to a new server that would be dedicated to them.

olasd added a subscriber: zack.Wed, Feb 20, 2:32 PM

After some more stewing and discussion with @zack, we'll be going for the "upgrade to pg 11 and restart replication from scratch" route;

I've dropped the replication slot between prado and somerset with the following commands:

  • on somerset: select pglogical.drop_subscription('somerset');
  • on prado: select pg_drop_replication_slot('pgl_softwareheritage_prado_somerset');

We'll see whether that improves prado's i/o behavior.