There is a misconception I keep running into, and it causes real harm in
production. People are afraid to increase checkpoint_timeout. They think a
longer timeout means a longer recovery after a crash. So they set it to 5
minutes. Some set it to 1 minute. And then they wonder why their Postgres is
struggling.
Let me dismantle this fear, argument by argument.
First: serious people have a replica π
If you care about availability, you have at least one replica you can fail over to. A crash on your primary means you promote a replica. Recovery time from a crash becomes a non-issue when you have a standby ready.
If you don’t have a replica, checkpoint_timeout is the least of your problems.
Second: the math doesn’t work the way you think π
Here is what Postgres actually does during recovery after a crash. It finds the
last checkpoint, then replays all the WAL records that happened after it. That’s
it. The time it takes depends on how many WAL records need to be replayed, not
on what checkpoint_timeout is set to.
Now think about when checkpoint_timeout actually triggers a checkpoint. It
fires when time has elapsed and no checkpoint was triggered by max_wal_size
first. In other words: if your instance is hitting checkpoint_timeout, it
means you did not generate enough WALs to fill max_wal_size. Your write
activity is low.
And here is the key point. Postgres spreads checkpoint writes over 90% of the
time between two checkpoints (controlled by checkpoint_completion_target,
which defaults to 0.9). So if your instance writes little enough that
checkpoint_timeout fires every 5 minutes, recovery after a crash will replay
those few WALs in a fraction of that time. Not 5 minutes. Much less.
The fear is based on a misunderstanding of what drives recovery duration.
Third: a short timeout actively makes things worse π
Frequent checkpoints are not free. Every checkpoint triggers full-page writes:
the first modification to a page after a checkpoint writes the entire page into
WAL, not just the changed bytes. This is the full_page_writes mechanism, and
it is there to protect you from partial page writes during a crash. You should
not disable it.
The consequence is straightforward. More checkpoints mean more full-page writes. More full-page writes mean more WAL. More WAL can mean more checkpoints. You can enter what I call the vicious circle of huge writes, which I described in detail here.
Setting checkpoint_timeout to 1 minute does not protect you. It adds
unnecessary write pressure to your system.
How to know what your checkpoints are actually doing π
Before tuning anything, you need to see what is happening. Two settings are required:
log_checkpoints = on
log_destination = 'csvlog'
logging_collector = on
With log_checkpoints enabled, Postgres logs a line when each checkpoint starts
and a line when it completes, including duration and how much WAL was written.
That is your raw material.
From there, you can use any log analysis tool you like. If you use
pglog, my extension that lets you query
Postgres logs with SQL, the checkpoints_stats view gives you directly:
- average and median frequency between checkpoints
- average and median checkpoint duration
If your median frequency is close to your checkpoint_timeout value, your
instance is not writing enough WALs to trigger checkpoints by max_wal_size.
That is your signal that checkpoint_timeout is doing the work, and that you
have room to increase it safely.
If your median duration is close to your checkpoint_timeout value, your
checkpoints are barely finishing before the next one starts. That is a
different problem, and increasing checkpoint_timeout might actually help
there too, by giving each checkpoint more room to breathe.
What you actually want π
You want checkpoints to be triggered by max_wal_size, not by
checkpoint_timeout. That means your instance is writing at a healthy pace and
checkpoints happen when they are needed.
checkpoint_timeout should be a safety net, not a knob you turn down out of
fear. A value between 15 and 30 minutes is reasonable for most workloads. If
your checkpoints are always triggered by timeout, look at max_wal_size instead
and ask yourself whether it is sized correctly for your write volume.
The Postgres documentation covers both parameters in detail.
Tune checkpoints based on your write workload, not based on a recovery scenario that your replica already handles for you.