Page MenuHomeSoftware Heritage

PostgreSQL replication issues between prado and somerset
Closed, ResolvedPublic

Description

The PostgreSQL replication of the graph database between prado and somerset is currently hanging. We need to make a plan to fix it, and to implement that plan.

Current status of the new replica's recovery:

Data copy:

  • content
  • directory
  • directory_entry_dir
  • directory_entry_file
  • directory_entry_rev
  • metadata_provider
  • object_counts
  • origin
  • origin_metadata
  • origin_visit
  • person
  • release
  • revision
  • revision_history
  • skipped_content
  • snapshot
  • snapshot_branch
  • snapshot_branches
  • tool

Indexes created on replica:

  • content
  • directory
  • directory_entry_dir
  • directory_entry_file
  • directory_entry_rev
  • metadata_provider
  • object_counts
  • origin
  • origin_metadata
  • origin_visit
  • person
  • release
  • revision
  • revision_history
  • skipped_content
  • snapshot
  • snapshot_branch
  • snapshot_branches
  • tool

Event Timeline

olasd created this task.Feb 19 2019, 2:10 PM
olasd triaged this task as High priority.
olasd added a subscriber: ftigeot.Feb 19 2019, 2:26 PM
olasd renamed this task from PostgreSQL replication issues between prado and beaubourg to PostgreSQL replication issues between prado and somerset.Feb 19 2019, 3:22 PM
olasd updated the task description. (Show Details)
olasd added a comment.Feb 19 2019, 3:29 PM

Logs on primary:

2019-02-19 14:27:44 UTC [15973]: [1-1] user=postgres,db=softwareheritage LOG:  starting logical decoding for slot "pgl_softwareheritage_prado_somerset"
2019-02-19 14:27:44 UTC [15973]: [2-1] user=postgres,db=softwareheritage DETAIL:  streaming transactions committing after 18607/9189A578, reading WAL from 18607/9189A578
2019-02-19 14:27:44 UTC [15973]: [3-1] user=postgres,db=softwareheritage ERROR:  record with incorrect prev-link 5403A/2E2F1829 at 18607/9189A578 
2019-02-19 14:27:44 UTC [15973]: [4-1] user=postgres,db=softwareheritage LOG:  could not receive data from client: Connection reset by peer

Logs on replica:

2019-02-19 14:27:44 UTC [25074]: [1-1] user=[unknown],db=softwareheritage LOG:  starting apply for subscription somerset
2019-02-19 14:27:44 UTC [25074]: [2-1] user=[unknown],db=softwareheritage ERROR:  data stream ended
2019-02-19 14:27:44 UTC [25074]: [3-1] user=[unknown],db=softwareheritage LOG:  apply worker [25074] at slot 1 generation 6593 exiting with error
olasd added a subscriber: grouss.Feb 19 2019, 6:22 PM

After reading some mailing list posts discussing the error message, and discussion with @ftigeot:

  • at that point the replication stream probably can't be recovered
  • we may need to re-do replication from scratch

The problem with re-doing replication from scratch is having to send all data to azure again, which is going to take a long time.

action 1: The first step will be migrating readers from somerset back to prado to limit disruption.

Before we do replication again from scratch, there's a (slim) chance we can recover the replicas by doing the following:

  1. restart the database on somerset, hope it recovers by itself
  2. if it doesn't, stop the replication stream from prado to somerset
  3. let prado recover and clean up its WALs
  4. restart a replication stream from prado to somerset, without resetting the tables
    • this is possible with pglogical, not with the builtin logical replication mechanism.
    • this means that the replica has a hole from the date the replication failed to the date we're doing the streaming restart
  5. fill the gaps in the tables
    • configure a foreign data wrapper to access the live tables on prado
    • insert the missing lines on somerset

The main unknowns on this procedure are step 3 (whether prado will recover from its I/O issues), 4 (whether the replication will restart properly) and 5 (whether the gap-filling queries will terminate in a reasonable amount of time).

We're currently blocked on step 1 as @grouss is doing heavy work on somerset for a tight deadline (ETA March 1st to know when we'll be able to restart).

If step 3 fails and prado doesn't recover from its I/O issues, one possible solution would probably be to migrate the primary database to somerset, as this is a more recent machine with more consistent performance. However this creates a new bottleneck on the infrastructure (primary db + primary indexer db + tons of VMs on the same machine).

If step 4 fails, or if step 5 takes an unreasonable amount of time, we need to restart replication from scratch. This extended downtime of the database infrastructure would be an opportunity to:

  • upgrade to PostgreSQL 11
  • migrate to the built-in logical replication system instead of pglogical (while slightly more flexible, pglogical 2 is now a 2ndQuadrant-customer-only release https://github.com/2ndQuadrant/pglogical, and IMO it makes sense to move away from it).

Considering the time constraints incurred by Guillaume's work, it might make sense to just scrap the recovery plan altogether and do the PG 11 upgrade right now, recreating a replica once he's done with his current work. We'll let this idea mature for some hours before making a final decision :)

In parallel to all of this, we're making plans for migrating the primary postgresql database(s) to a new server that would be dedicated to them.

olasd added a subscriber: zack.Feb 20 2019, 2:32 PM

After some more stewing and discussion with @zack, we'll be going for the "upgrade to pg 11 and restart replication from scratch" route;

I've dropped the replication slot between prado and somerset with the following commands:

  • on somerset: select pglogical.drop_subscription('somerset');
  • on prado: select pg_drop_replication_slot('pgl_softwareheritage_prado_somerset');

We'll see whether that improves prado's i/o behavior.

olasd claimed this task.Mar 22 2019, 2:55 PM
olasd changed the task status from Open to Work in Progress.

The replicated cluster is now clear to be taken down for a rebuild.

The web frontend has been moved to the primary database.

@grouss may still need access to the occurrence/occurrence_history tables, which I will preserve separately by dumping/recreating them to compact them.

I will then initialize a new postgresql 11 cluster and start the logical replication process between prado and this new cluster

olasd updated the task description. (Show Details)Mar 22 2019, 4:52 PM
olasd updated the task description. (Show Details)Mar 22 2019, 5:31 PM
olasd updated the task description. (Show Details)Mar 22 2019, 5:51 PM
olasd updated the task description. (Show Details)Mar 22 2019, 6:00 PM
olasd updated the task description. (Show Details)Mar 22 2019, 6:32 PM
olasd updated the task description. (Show Details)Mar 22 2019, 11:26 PM
olasd updated the task description. (Show Details)Mar 23 2019, 10:02 AM
olasd updated the task description. (Show Details)Mar 23 2019, 2:29 PM
olasd updated the task description. (Show Details)Mar 25 2019, 10:32 AM
olasd updated the task description. (Show Details)Mar 25 2019, 6:07 PM
olasd closed this task as Resolved.

The replication process from prado to somerset is now complete, and the archive frontend has been switched over to this database.