Resync issues with Postgres servers after long backups

If you have a large PostgreSQL database running across multiple server, your backup process may cause one or more secondary servers to fall out of step with the primary, particularly if the backup process runs for 30 mins or more.

Normally you will be able to simply resync the servers, but if the backup process is long enough your primary server may delete the write ahead log (WAL) records before the secondary has time to finish the backup process and then catch up with the primary.

This could cause your secondary server to stall because it cannot catch up quickly enough. A server in this state will typically throw an error like this one when you try to restart it: Job for postgresql.service failed because the control process exited with error code.

In this case the short term fix is to take all the load off the primary and allow the rsync process to run uninterrupted until the servers are totally caught up.

As a more permanent fix you should consider increasing the wal_keep_segements setting to allow the secondary server(s) more time to catch up after the backup completes. You could also consider adjusting max_wal_size although this should be handled with particular care.

Please be sure to read the official documentation about these settings before adjusting them.