The replication lag of the azure database replica is currently growing.
SELECT COALESCE(pid, 0) AS pid, COALESCE(application_name, ' ')::text AS application_name, COALESCE(usename, ' ')::text AS usename, COALESCE(client_addr::text, 'local')::text AS client_addr, COALESCE(client_port::text, ' ') AS client_port, COALESCE(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0'), 0) AS current_xlog_lsn_bytes, COALESCE(pg_wal_lsn_diff(sent_lsn, '0/0'), 0) AS sent_location_bytes, COALESCE(pg_wal_lsn_diff(flush_lsn, '0/0'), 0) AS flush_location_bytes, COALESCE(pg_wal_lsn_diff(replay_lsn, '0/0'), 0) AS replay_location_bytes, COALESCE(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn), 0) AS send_lag_bytes, COALESCE(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn), 0) AS flush_lag_bytes, COALESCE(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn), 0) AS replay_lag_bytes FROM pg_stat_replication FULL JOIN (VALUES(0)) filler(i) ON TRUE;
the send_lag_bytes value is increasing.
Looking at the azure dashboard, the disk writes seem to be sporadically peaking at 250MB/s (which is the maximum sustained write speed for one single disk). Those correspond to around 400 OPS (which is way way below the limit set for the disk).
As PostgreSQL is careful to flush writes to persistent disk properly I've enabled Read/Write caching on the disks to see if this improves / smoothes out the writes.
We could also be looking at splitting the WALs to another striped partition so writes even out to two separate stripesets.