Page MenuHomeSoftware Heritage

prado -> somerset replication failed since the 2018-09-11 crash
Closed, MigratedEdits Locked

Description

The replication from prado to somerset on the primary database failed with the following messages:

2018-09-11 12:35:31 UTC [31397]: [1-1] user=postgres,db=softwareheritage LOG:  starting logical decoding for slot "pgl_softwareheritage_prado_somerset"
2018-09-11 12:35:31 UTC [31397]: [2-1] user=postgres,db=softwareheritage DETAIL:  streaming transactions committing after 15DB5/898105E0, reading WAL from 15DB5/1444F1E0
2018-09-11 12:35:31 UTC [31397]: [3-1] user=postgres,db=softwareheritage ERROR:  requested WAL segment 0000000100015DB500000014 has already been removed
2018-09-11 12:35:31 UTC [31397]: [4-1] user=postgres,db=softwareheritage LOG:  could not receive data from client: Connection reset by peer

The missing WALs have been restored from the wal archive on banco using the following command:

postgres@prado:/srv/softwareheritage/postgres/10/main/pg_wal$ for i in `seq 20 94`; do segment=0000000100015DB5000000`printf %02X $i`; if [ -f $segment ]; then echo $segment present; else ssh barman@banco barman get-wal swh-10 $segment > $segment; fi; done

Replication has restarted.

Event Timeline

olasd triaged this task as High priority.
olasd created this task.

After being prompted by a PostgreSQL developer, I digged a bit further in the logs:

Postgres startup logs :

2018-09-11 09:17:21 UTC [429]: [1-1] LOG:  database system was interrupted; last known up at 2018-09-11 01:05:16 UTC
[... monitoring things trying to connect and spamming with "FATAL:  the database system is starting up" ...]
2018-09-11 09:17:48 UTC [429]: [2-1] LOG:  database system was not properly shut down; automatic recovery in progress
2018-09-11 09:17:48 UTC [429]: [3-1] LOG:  redo starts at 15DB5/961A3480
2018-09-11 09:17:48 UTC [429]: [4-1] LOG:  invalid record length at 15DB5/9763DAC0: wanted 24, got 0
2018-09-11 09:17:48 UTC [429]: [5-1] LOG:  redo done at 15DB5/9763DA50
2018-09-11 09:17:48 UTC [429]: [6-1] LOG:  last completed transaction was at log time 2018-09-11 06:29:10.207916+00
2018-09-11 09:17:50 UTC [251]: [7-1] LOG:  database system is ready to accept connections
2018-09-11 09:17:50 UTC [824]: [1-1] LOG:  starting pglogical supervisor
2018-09-11 09:17:50 UTC [826]: [1-1] user=[unknown],db=postgres LOG:  manager worker [826] at slot 0 generation 1 detaching cleanly
2018-09-11 09:17:50 UTC [827]: [1-1] user=[unknown],db=softwareheritage LOG:  starting pglogical database manager for database softwareheritage
2018-09-11 09:17:50 UTC [828]: [1-1] user=[unknown],db=[unknown] LOG:  incomplete startup packet
2018-09-11 09:17:51 UTC [829]: [1-1] user=[unknown],db=template1 LOG:  manager worker [829] at slot 1 generation 1 detaching cleanly
2018-09-11 09:20:24 UTC [1318]: [1-1] user=postgres,db=softwareheritage LOG:  starting logical decoding for slot "pgl_softwareheritage_prado_somerset"
2018-09-11 09:20:24 UTC [1318]: [2-1] user=postgres,db=softwareheritage DETAIL:  streaming transactions committing after 15DB5/898105E0, reading WAL from 15DB5/1444F1E0
2018-09-11 09:20:24 UTC [1318]: [3-1] user=postgres,db=softwareheritage ERROR:  requested WAL segment 0000000100015DB500000014 has already been removed
2018-09-11 09:20:24 UTC [1318]: [4-1] user=postgres,db=softwareheritage LOG:  could not receive data from client: Connection reset by peer
[ repeat last four lines until WALs from 0000000100015DB500000014 to 0000000100015DB500000095 got restored from the archive ]

Fsck logs:

olasd@louvre:~$ journalctl -b -t systemd-fsck | grep ssd-prado--postgres
Sep 11 08:47:11 louvre systemd-fsck[5358]: /dev/mapper/ssd-prado--postgres-part1: recovering journal
Sep 11 08:47:19 louvre systemd-fsck[5358]: /dev/mapper/ssd-prado--postgres-part1: Clearing orphaned inode 310733412 (uid=109, gid=115, mode=0100600, size=16777216)
Sep 11 08:47:19 louvre systemd-fsck[5358]: /dev/mapper/ssd-prado--postgres-part1: Clearing orphaned inode 310734618 (uid=109, gid=115, mode=0100600, size=16777216)
Sep 11 08:47:19 louvre systemd-fsck[5358]: /dev/mapper/ssd-prado--postgres-part1: Clearing orphaned inode 310734823 (uid=109, gid=115, mode=0100600, size=16777216)
Sep 11 08:47:19 louvre systemd-fsck[5358]: /dev/mapper/ssd-prado--postgres-part1: Clearing orphaned inode 310733466 (uid=109, gid=115, mode=0100600, size=16777216)
Sep 11 08:47:19 louvre systemd-fsck[5358]: /dev/mapper/ssd-prado--postgres-part1: Clearing orphaned inode 310731956 (uid=109, gid=115, mode=0100600, size=16777216)
Sep 11 08:47:19 louvre systemd-fsck[5358]: /dev/mapper/ssd-prado--postgres-part1: Clearing orphaned inode 310734454 (uid=109, gid=115, mode=0100600, size=16777216)
Sep 11 08:47:19 louvre systemd-fsck[5358]: /dev/mapper/ssd-prado--postgres-part1: Clearing orphaned inode 310734203 (uid=109, gid=115, mode=0100600, size=16777216)
Sep 11 08:47:19 louvre systemd-fsck[5358]: /dev/mapper/ssd-prado--postgres-part1: Clearing orphaned inode 310733355 (uid=109, gid=115, mode=0100600, size=16777216)
Sep 11 08:47:19 louvre systemd-fsck[5358]: /dev/mapper/ssd-prado--postgres-part1: Clearing orphaned inode 310733805 (uid=109, gid=115, mode=0100600, size=16777216)
Sep 11 08:47:19 louvre systemd-fsck[5358]: /dev/mapper/ssd-prado--postgres-part1: Clearing orphaned inode 310733971 (uid=109, gid=115, mode=0100600, size=16777216)
Sep 11 08:47:19 louvre systemd-fsck[5358]: /dev/mapper/ssd-prado--postgres-part1: Clearing orphaned inode 310733506 (uid=109, gid=115, mode=0100600, size=16777216)
Sep 11 08:47:19 louvre systemd-fsck[5358]: /dev/mapper/ssd-prado--postgres-part1: Clearing orphaned inode 310733288 (uid=109, gid=115, mode=0100600, size=16777216)
Sep 11 08:47:19 louvre systemd-fsck[5358]: /dev/mapper/ssd-prado--postgres-part1: Clearing orphaned inode 310733323 (uid=109, gid=115, mode=0100600, size=16777216)
Sep 11 08:47:19 louvre systemd-fsck[5358]: /dev/mapper/ssd-prado--postgres-part1: Clearing orphaned inode 310732627 (uid=109, gid=115, mode=0100600, size=16777216)
Sep 11 08:47:19 louvre systemd-fsck[5358]: /dev/mapper/ssd-prado--postgres-part1: Clearing orphaned inode 310733195 (uid=109, gid=115, mode=0100600, size=16777216)
Sep 11 08:47:19 louvre systemd-fsck[5358]: /dev/mapper/ssd-prado--postgres-part1: Clearing orphaned inode 310734992 (uid=109, gid=115, mode=0100600, size=16777216)
Sep 11 08:47:19 louvre systemd-fsck[5358]: /dev/mapper/ssd-prado--postgres-part1: clean, 44235/402653184 files, 2498839839/3221224972 blocks

Looks like fsck just removed the WAL files and the PostgreSQL replication mechanism is not at fault.