My friend Henrietta Dombrovskaya pinged me on Telegram. Her production cluster
had just been killed by the OOM killer after eating 2 TB of RAM. work_mem was
set to 2 MB.
Something didn’t add up.
Hetty, like me, likes playing with monster hardware. 2 TB of RAM is not unusual in her world. But losing the whole cluster to a single query during peak operations is a very different kind of problem from a 3am outage. When the OOM killer strikes at the worst possible moment, you need answers fast.
One important detail: the memory log I’ll show below is not from the production incident. Hetty reproduced the behavior on a separate server to investigate. She stopped the query before the OOM killer struck that time. The production cluster was not so lucky.
I want to point this out right away: this is the kind of problem you solve faster with a good network than with a good search engine. Hetty is a brilliant Postgres expert. We puzzled through this together. I’m writing it up because you’ll run into it too, and because the behavior of Postgres memory management is genuinely surprising.
The tool that saved the day π
Before we dig into the “why”, let me introduce you to a function I didn’t know
existed until that conversation: pg_log_backend_memory_contexts.
Pass it a PID. Postgres will dump the full memory context tree of that backend into the logs. Every allocation. Every context. Sizes and chunk counts included.
select pg_log_backend_memory_contexts(299392);
That’s it. Then check your logs.
I’m a bit embarrassed I didn’t know about this one. Twenty years of Postgres and I learn something new every week. That’s what I love about this database.
(The function was introduced in Postgres 14. Here’s Luka’s blog post about it if you want to know more.)
What we saw π
Here are the relevant lines from the log output:
level: 4; ExecutorState: 234954896 total in 40 blocks; 14989952 free (524059 chunks); 219964944 used
level: 5; HashTableContext: 339738680 total in 47 blocks; 4716320 free (84 chunks); 335022360 used
That’s ~235 MB for ExecutorState and ~340 MB for HashTableContext. With work_mem = 2 MB.
The grand total for the backend:
Grand total: 586713672 bytes in 409 blocks; 20823696 free (527202 chunks); 565889976 used
About 557 MB for a single backend, on the reproduction server. The production cluster had 2 TB of RAM and the OOM killer had to step in.
The 524,059 chunks in ExecutorState caught our eye immediately.
But wait, every DBA knows about work_mem π
Any seasoned DBA knows that work_mem is not “memory per query.” As the
Postgres documentation
states, work_mem is the amount of memory that can be used for each hash or
sort operation. One query can have many of them. Add parallel workers and the
number multiplies fast.
So 2 MB times many hashes, many sorts, many workers can already get you into trouble. But 2 TB? That’s a different order of magnitude. Even with an absurd number of parallel workers and operations, the math doesn’t work out that way.
Something else was going on.
Why does Postgres ignore work_mem here? π
Short answer: it doesn’t ignore it. It just doesn’t control everything.
work_mem limits the memory allocated per hash or sort operation. But here’s
the actual culprit: the memory is only released at the end of the whole
operation, not during.
This is by design. The Postgres memory context system is built around the idea that freeing an entire context in one shot is faster and more reliable than tracking individual allocations. From the source README:
The main advantage of memory contexts over plain use of malloc/free is that the entire contents of a memory context can be freed easily, without having to request freeing of each individual chunk within it.
ExecutorState is the memory context created at the start of query execution.
Everything the executor needs goes in there. It is destroyed when the query
finishes, not before. HashTableContext is a child of ExecutorState. It holds
the hash table data: buckets, entries, all of it. Alive for the full lifetime
of the hash join. Released in one shot at the end.
The 524,059 chunks were the culprit π
Back to those chunks. The query was a select that called a plpgsql function, which internally
performed a copy operation, and then joined the result with something else.
Perfectly valid Postgres. But as Hetty put it: “just because you can do
something, doesn’t mean you should.” The developers had learned to use functions
well. But then they did a very object-oriented thing: they used that function as
if it were just another table in a join. A classic factoring mistake. Hetty
will actually be talking about exactly this kind of SQL structuring pitfall at
Nordic PGDay 2026.
That’s one giant operation from Postgres’s point of view. One single ExecutorState context, alive from start to finish.
Each of those 524,059 chunks used up to work_mem worth of memory. And because
they all lived inside the same ExecutorState context, none of them were released
until the whole operation completed. The operation never completed. The OOM
killer made sure of that.
That’s the combination that explains the 2 TB: not just the number of hashes
and sorts, but the fact that every single work_mem chunk accumulated inside
one context that was never going to release anything until it was done.
The honest footnote: when we asked the community, the answer was roughly: nobody fully understands every aspect of Postgres memory behavior. The memory context README in the source is the closest thing to authoritative documentation. Worth reading if you want to go deep.
Can you prevent this? π
You can’t put a hard cap on Postgres memory per backend. There is no setting for that. Postgres will take what it needs.
What you can do:
-
Fix your statistics. If the planner underestimates row counts, it makes bad decisions about when to spill to disk. Run
ANALYZE, checkpg_statsandpg_statistic. If you have columns with correlated values (which violates higher normal forms and is a modeling smell in itself), useCREATE STATISTICSto capture the correlation. Increasing per-column targets withALTER TABLE ... ALTER COLUMN ... SET STATISTICSonly helps when column values are independent of each other. -
Fix the query. A query that eats 2 TB of RAM is a bad query. Full stop. But sometimes you inherit bad queries. In that case…
-
Use query timeouts. You can’t cap memory, but you can kill queries that run too long with
statement_timeout. -
Monitor with
pg_log_backend_memory_contexts. Now that you know it exists, use it. When something looks wrong, call the function. You’ll know what’s happening before the OOM kills your server.
No hardware can compensate for a query gone wrong π
That’s what Hetty said at the end of our conversation, and she has a point. Though in fairness, whoever wrote that function probably didn’t anticipate this behavior. Most developers don’t.
The root cause here was a badly written query consuming far more memory than intended. The underlying behavior of Postgres, releasing memory only at the end of execution, is by design, not a bug.
Understanding why Postgres behaves this way doesn’t fix bad queries. But it
does help you explain to the application team why their “simple select” brought
down production during peak operations.
And sometimes that’s enough to get the query rewritten.