Stop Punishing Your Postgres for a Crash That Won't Happen

Β· 714 words Β· 4 minute read

There is a misconception I keep running into, and it causes real harm in production. People are afraid to increase checkpoint_timeout. They think a longer timeout means a longer recovery after a crash. So they set it to 5 minutes. Some set it to 1 minute. And then they wonder why their Postgres is struggling.

Let me dismantle this fear, argument by argument.

First: serious people have a replica πŸ”—

If you care about availability, you have at least one replica you can fail over to. A crash on your primary means you promote a replica. Recovery time from a crash becomes a non-issue when you have a standby ready.

If you don’t have a replica, checkpoint_timeout is the least of your problems.

Second: the math doesn’t work the way you think πŸ”—

Here is what Postgres actually does during recovery after a crash. It finds the last checkpoint, then replays all the WAL records that happened after it. That’s it. The time it takes depends on how many WAL records need to be replayed, not on what checkpoint_timeout is set to.

Now think about when checkpoint_timeout actually triggers a checkpoint. It fires when time has elapsed and no checkpoint was triggered by max_wal_size first. In other words: if your instance is hitting checkpoint_timeout, it means you did not generate enough WALs to fill max_wal_size. Your write activity is low.

And here is the key point. Postgres spreads checkpoint writes over 90% of the time between two checkpoints (controlled by checkpoint_completion_target, which defaults to 0.9). So if your instance writes little enough that checkpoint_timeout fires every 5 minutes, recovery after a crash will replay those few WALs in a fraction of that time. Not 5 minutes. Much less.

The fear is based on a misunderstanding of what drives recovery duration.

Third: a short timeout actively makes things worse πŸ”—

Frequent checkpoints are not free. Every checkpoint triggers full-page writes: the first modification to a page after a checkpoint writes the entire page into WAL, not just the changed bytes. This is the full_page_writes mechanism, and it is there to protect you from partial page writes during a crash. You should not disable it.

The consequence is straightforward. More checkpoints mean more full-page writes. More full-page writes mean more WAL. More WAL can mean more checkpoints. You can enter what I call the vicious circle of huge writes, which I described in detail here.

Setting checkpoint_timeout to 1 minute does not protect you. It adds unnecessary write pressure to your system.

How to know what your checkpoints are actually doing πŸ”—

Before tuning anything, you need to see what is happening. Two settings are required:

log_checkpoints = on
log_destination = 'csvlog'
logging_collector = on

With log_checkpoints enabled, Postgres logs a line when each checkpoint starts and a line when it completes, including duration and how much WAL was written. That is your raw material.

From there, you can use any log analysis tool you like. If you use pglog, my extension that lets you query Postgres logs with SQL, the checkpoints_stats view gives you directly:

  • average and median frequency between checkpoints
  • average and median checkpoint duration

If your median frequency is close to your checkpoint_timeout value, your instance is not writing enough WALs to trigger checkpoints by max_wal_size. That is your signal that checkpoint_timeout is doing the work, and that you have room to increase it safely.

If your median duration is close to your checkpoint_timeout value, your checkpoints are barely finishing before the next one starts. That is a different problem, and increasing checkpoint_timeout might actually help there too, by giving each checkpoint more room to breathe.

What you actually want πŸ”—

You want checkpoints to be triggered by max_wal_size, not by checkpoint_timeout. That means your instance is writing at a healthy pace and checkpoints happen when they are needed.

checkpoint_timeout should be a safety net, not a knob you turn down out of fear. A value between 15 and 30 minutes is reasonable for most workloads. If your checkpoints are always triggered by timeout, look at max_wal_size instead and ask yourself whether it is sized correctly for your write volume.

The Postgres documentation covers both parameters in detail.

Tune checkpoints based on your write workload, not based on a recovery scenario that your replica already handles for you.