Page MenuHomeSoftware Heritage

Azure database replica doesn't sustain writes
Closed, MigratedEdits Locked

Description

The replication lag of the azure database replica is currently growing.

SELECT                                                                                                                                                                                                 
            COALESCE(pid, 0) AS pid,                                                                                                                                                                                                 
            COALESCE(application_name, ' ')::text AS application_name,                                                                                                                                                               
            COALESCE(usename, ' ')::text AS usename,                                                                                                                                                                                  
            COALESCE(client_addr::text, 'local')::text AS client_addr,                                                                                                                                                               
            COALESCE(client_port::text, ' ') AS client_port,                                                                                                                                                                         
            COALESCE(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0'), 0) AS current_xlog_lsn_bytes,                                                                                                                                      
            COALESCE(pg_wal_lsn_diff(sent_lsn, '0/0'), 0) AS sent_location_bytes,                                                                                                                                                    
            COALESCE(pg_wal_lsn_diff(flush_lsn, '0/0'), 0) AS flush_location_bytes,                                                                                                                                                  
            COALESCE(pg_wal_lsn_diff(replay_lsn, '0/0'), 0) AS replay_location_bytes,                                                                                                                                                 
            COALESCE(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn), 0)   AS send_lag_bytes,                                                                                                                                        
            COALESCE(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn), 0)  AS flush_lag_bytes,                                                                                                                                       
            COALESCE(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn), 0) AS replay_lag_bytes                                                                                                                                        
FROM pg_stat_replication FULL JOIN (VALUES(0)) filler(i) ON TRUE;

the send_lag_bytes value is increasing.

Looking at the azure dashboard, the disk writes seem to be sporadically peaking at 250MB/s (which is the maximum sustained write speed for one single disk). Those correspond to around 400 OPS (which is way way below the limit set for the disk).

As PostgreSQL is careful to flush writes to persistent disk properly I've enabled Read/Write caching on the disks to see if this improves / smoothes out the writes.

We could also be looking at splitting the WALs to another striped partition so writes even out to two separate stripesets.

Event Timeline

As a baseline, the network traffic from prado to somerset (local replica) is 2-3 Mbps sustained.

Using iperf between prado and dbreplica0.euwest.azure yields support for 255 Mbps of sustained traffic which precludes a network issue.

olasd changed the task status from Open to Work in Progress.Mar 14 2018, 10:10 AM
olasd claimed this task.

This also causes issues with the main database: the replica lag prevents the main database's catalog tables from being properly vacuumed, which in turn makes our whole system slow as it heavily depends on temporary tables.

I have implemented the separate WAL partition by adding four 128GB uncached data disks (smaller disks have smaller throughput/IOPS) and creating a striped LVM on top of these. I'm looking into monitoring to catch the replication issue earlier.

I've added configuration of the prometheus SQL exporter to our PostgreSQL servers. I've also adapted the grafana Postgres Server Overview dashboard from elephant-shed (https://github.com/credativ/elephant-shed), on https://grafana.softwareheritage.org/d/PEKz-Ygiz/postgresql-server-overview.

Yes, the default view looks a like a poorly decorated Christmas tree, but when you drill down and enable / disable the lines that you care about, it makes it easy to correlate the relevant data.

The database seems to struggle when checkpointing (that is, when writing WALs back to the actual tables), which is very surprising as a run of bonnie++ on the same disk seems to manage around five times the IOPS and around twenty times the byte throughput.

I think there may be something wrong between the stripe size chosen for the data disks and the block size postgresql uses for its writes. I'll keep investigating.

To test the striping hypothesis, I have attached some new drives to the virtual machine, with the goal of moving the data to larger stripes:

for i in `seq 0 3`; do az vm disk attach --disk dbreplica0_pgdata$i -g euwest-db --vm-name dbreplica0 --caching ReadOnly --new --size-gb 1536 --sku Premium_LRS; done
pvcreate /dev/sd[klmn]
vgcreate pgdata1 /dev/sd[klmn]

# -i4: stripe across 4 PVs
# -I4096: use 4MB stripes
lvcreate -l100%FREE -i4 -I4096 -n lvol0 pgdata1

# -b 4096: 4096 bytes block size
# -E stride=1024: align metadata every 1024 blocks
# -E stripe-width=4096: make sure metadata is spread evenly across stripes of width 4096 blocks (= 4 disks * 1024 blocks / stride)
mkfs.ext4 -b 4096 -E stride=1024 -E stripe-width=4096 /dev/pgdata1/lvol0

The postgres data is being copied now.

The change seems to have helped a little but not a lot: we're still writing too fast on the primary for the replica to manage to sustain writes.

Running pgbench on the replica shows that we can sustain around 1k raw transactions per second. During those tests, the pglogical lag pattern didn't change at all, which suggests to me that pglogical really is the culprit here.

I've finally done the last change that makes the azure replica exactly identical to the one on our own infra: using the postgres user instead of the dedicated replica user. No change at all.

I'm now completely out of clue about what to test. The server works fine, it performs correctly when doing a postgresql benchmark, the other replica has no such issue...

A last ditch attempt at turning off fsync gave out no change in performance at all.

olasd reassigned this task from olasd to ftigeot.
olasd added a subscriber: ftigeot.

This has been solved by @ftigeot setting up a physical replica between somerset and dbreplica0.