Mostly Dead is Slightly Alive: Killing Zombie Sessions

Mar 4, 2026 · 1107 words · 6 minute read

As a PostgreSQL expert, one of the most common “ghosts” I hunt during database audits is the zombie session. You know the one: a backend process that stays active or idle in transaction, holding onto critical locks and preventing vacuum from doing its job, all because the client disappeared without saying goodbye.

In the words of Miracle Max from The Princess Bride, there’s a big difference between mostly dead and all dead. In PostgreSQL, a connection with a 2-hour keepalive default is “mostly dead”: it’s not doing anything useful, but it’s still holding onto your locks and bloating your process list.

If you look at your postgresql.conf, you will see tcp_keepalives_idle, _interval, and _count are all set to 0. This means “defer to the Operating System.” On Linux, the default is usually 7200 seconds. Two hours. In a world of microservices and cloud networking, waiting two hours to kill a dead connection is an eternity. So why hasn’t the community changed the default?

The “Polite” Kernel: A History of RFC 1122 🔗

To understand why PostgreSQL is conservative, we have to look at why Linux is conservative. The TCP/IP stack was designed for survivability. According to RFC 1122 (dating back to 1989), keepalives are optional and should not be aggressive.

The kernel developers’ logic is simple: Do no harm.

The “Desert Link” Theory: If a router reboots or a cable was unplugged for 10 minutes, TCP should ideally wait. A “polite” kernel assumes the silence is temporary.
The Middlebox Nightmare: Aggressive pings can overwhelm the state tables of old firewalls or load balancers.
Bandwidth & Battery: For IoT or mobile clients, waking up the radio every 30 seconds to say “I am alive” is a death sentence for battery life.

Linux is being a diplomat. It provides the tools to be fast, but it keeps the defaults slow to ensure it never accidentally breaks a connection that might have recovered.

The Munro Advocacy: Why We Are Still Talking About This 🔗

Thomas Munro has long advocated for PostgreSQL to be more proactive in detecting dead clients. However, the debate over default values is far from settled. In fact, as recently as late February 2026, the hackers mailing list was still debating whether to change the default for client_connection_check_interval.

The resistance from the community isn’t just about laziness; it is about technical side effects. For example, some contributors noted that aggressive checking can lead to “noisy” logs (e.g., repeatedly logging lock waits) or unnecessary system call overhead on servers with thousands of connections.

There are two ways a connection “dies,” and they require different weapons:

1. The Network Death (TCP Keepalive) 🔗

When you tune tcp_keepalives_*, you are asking the OS to send a probe.

The Limit: This only tells you if the remote Network Stack is alive.
The Fail: If the client application crashes but the server hosting it stays up, that server’s kernel will happily reply “ACK” to your probes. PostgreSQL thinks everything is fine while your locks remain held by a ghost.

2. The Application Death (`client_connection_check_interval`) 🔗

Introduced in PostgreSQL 14 (see the official Release Notes), this is the real game-changer. Unlike keepalives, this parameter forces the backend to “peek” at the socket while a query is running.

The Win: It detects if the client has closed the socket or disappeared, even if the remote OS is still humming along.
The Use Case: A user starts a massive 20-minute report, gets bored, and closes their browser. Without this, Postgres works for 20 minutes for nobody. With client_connection_check_interval = 1s, it detects the departure and aborts the query immediately.

Why the Keepalive + Check Interval Duo is Mandatory 🔗

This is the most critical point: these two parameters are not redundant. They are complementary because they handle two radically different types of “deaths.”

1. The Application Crash (Signal Present) 🔗

If your Python script or Java application suffers a segfault or is killed by a kill -9, the client’s OS remains alive. It will properly close all sockets belonging to the crashed process by sending a FIN signal to the PostgreSQL server.

The Result: The client_connection_check_interval immediately sees this signal via the poll() system call and stops the running query.

2. The OS Crash or Network Outage (Total Silence) 🔗

If the client server suffers a power failure, a “Kernel Panic,” or someone pulls the network cable, no signal is sent. To the PostgreSQL server, the socket appears “open” but silent.

The Trap: The poll() call used by the connection check interval will see nothing. It will report “all clear” as long as the server’s TCP stack hasn’t declared the connection dead.
The Savior: This is where tcp_keepalives_* come in. The server’s kernel sends probes. If no ACK is received, the kernel eventually “breaks” the connection. Only then will the connection check interval see the error and release the Postgres backend.

In Summary: 🔗

Keepalive alone: You risk wasting CPU/IO resources on useless queries if only the application has crashed.
Check Interval alone: You keep zombie sessions for hours if the network or remote OS has failed.

My Expert Setup for PgBouncer & Postgres 🔗

If you are running a high-traffic production site (I mean OLTP), you cannot rely on the 2-hour “diplomacy” of the Linux kernel. Here is my recommended configuration¹ :

In `postgresql.conf`: 🔗

tcp_keepalives_idle = 60: Start probing after 1 minute of silence.
tcp_keepalives_interval = 10: Probe every 10 seconds after that.
tcp_keepalives_count = 6: Kill the connection after 1 minute of failed probes.
client_connection_check_interval = 2s: Catch application-level crashes during long queries.

In `pgbouncer.ini`: 🔗

If you use a pooler (and you probably should), it must be part of the conversation. See the PgBouncer documentation for details:

tcp_keepalive = 1: Tell PgBouncer to use the kernel keepalives on its own sockets.
server_idle_timeout = 600: Rotate server connections to keep them fresh.

Remember those settings are starting points. Each application is unique with a unique context. Don’t apply those settings without understanding them.

Conclusion 🔗

The PostgreSQL community isn’t being lazy by keeping these values at 0; they are being portable. They expect us, the DBAs, to know our network topology.

Linux’s politeness is your database’s weakness. Don’t wait for a 2-hour timeout to clean up your locks. Be proactive, tune your keepalives, and for heaven’s sake, enable client_connection_check_interval.

Note on Operating Systems: These recommendations specifically target Linux-based environments. While the concepts of TCP keepalives are universal, the specific behavior of the kernel stack and the default values (like the 7200s timeout) vary significantly on Windows or BSD. If you are running on a different OS, the “Why” remains the same, but you should verify your system’s specific network constants and granularity. ↩︎

PostgreSQL Network Performance

Mostly Dead is Slightly Alive: Killing Zombie Sessions

The “Polite” Kernel: A History of RFC 1122 🔗

The Munro Advocacy: Why We Are Still Talking About This 🔗

1. The Network Death (TCP Keepalive) 🔗

2. The Application Death (client_connection_check_interval) 🔗

Why the Keepalive + Check Interval Duo is Mandatory 🔗

1. The Application Crash (Signal Present) 🔗

2. The OS Crash or Network Outage (Total Silence) 🔗

In Summary: 🔗

My Expert Setup for PgBouncer & Postgres 🔗

In postgresql.conf: 🔗

In pgbouncer.ini: 🔗

Conclusion 🔗

2. The Application Death (`client_connection_check_interval`) 🔗

In `postgresql.conf`: 🔗

In `pgbouncer.ini`: 🔗