Great Power, Great Latency: The Spider-Sense of NUMA Tuning

Feb 5, 2026 · 1072 words · 6 minute read

I don’t know if you’ve been following the latest hardware shifts, but we are officially entering the era of “Titan-Class Infrastructure.” Take the latest AMD EPYC “Venice” (Zen 6): a single socket can now push a staggering 448 threads. Pair two of these together? That is 896 threads in one box. Add 12 TB of RAM to the mix, and you have a machine where the physical distance between a CPU core and a data byte is no longer negligible. In this world, the real battle is happening inside the motherboard. Its name is NUMA.

The Theory: Understanding Hardware Architectures 🔗

Before we dive into how this feels in practice, let’s look at the technical evolution of how CPUs talk to your RAM.

UMA (Uniform Memory Access): In simpler systems, there is one central memory pool. Every core has the same “uniform” path to any byte. It’s consistent. However, as core counts exploded, this single path became a massive bottleneck.
NUMA (Non-Uniform Memory Access): To avoid melting the chip by forcing too much data through one path, engineers split the memory and placed it physically closer to specific groups of cores. Accessing memory “local” to the core is instant. Accessing “remote” memory? It’s 2 to 3 times slower.

AMD Venice CPU Architecture - Core, Cache and I/O
Die

Modern CPUs like the AMD Venice are composed of multiple dies. Accessing data across the Infinity Fabric (the I/O Die) is what creates that “Remote Access” latency.

CXL (Compute Express Link): The next frontier. It’s designed for massive virtualization. It allows RAM to be shared across different physical servers to avoid “stranded memory.” While flexible, it introduces the highest latency. The data is essentially “outside” the local motherboard boundary.

Just as NUMA evolved to solve the bottlenecks of UMA, CXL isn’t a replacement: it’s a new architectural layer that sits on top of NUMA to bridge the gap between local speed and global scale.

The Practical Example: Toddlers and Toyboxes 🔗

Now, let’s make this real. To understand why your database might be slow on an 896-thread monster, imagine our CPU cores as toddlers and our RAM as toyboxes.

The UMA Era (The Single Toybox): All toddlers go to the same big toybox in the middle of the room. The problem? They start bumping heads. The path to the box becomes a traffic jam of crying kids.
The NUMA Era (The Private Toyboxes): We gave each group of toddlers their own private toybox right next to them.
- Local Access: Grabbing a toy from your own box is instant. Bingo!
- Remote Access: If a toddler needs a toy from a neighbor’s box, they have to walk across the room. It’s slower, and more importantly, it creates conflict. When two toddlers from different sockets fight over the same toy (a memory page), they stop playing and start bickering. These “chamailleries” (contention) are what kill your TPS.
The CXL Era (The Community Garden): There is now a huge community toybox in the garden. It’s great for sharing, but the walk is long. Imagine 896 toddlers trying to coordinate who gets which toy from the garden… the potential for chaos is massive.

The Virtualization Trap 🔗

In these “Titan” environments, your Postgres instance is likely in a VM or container. If the “Schizophrenic” Scheduler moves your Postgres process (the toddler) to a different socket but leaves your shared_buffers (the toys) in the original socket or the garden, every single memory hit becomes a slow “Remote Access.” Trust me, your performance will tank.

Postgres NUMA
Imbalance

Here, we see the “Trap”: Postgres processes on Socket 1 are forced to cross the Infinity Fabric to reach buffers stored in Node 0. This is the “NUMA tax” in action.

PostgreSQL 18: Developing your “Spider-Sense” 🔗

Identifying these hidden “walks” and bickering over the motherboard was pure guesswork until now. PostgreSQL 18 finally gives us the visibility to see through the hypervisor and the silicon.

pg_shmem_allocations_numa: This new system view is the breakthrough we’ve been waiting for. It allows you to audit exactly how your shared memory is striped across physical NUMA nodes.

By using this view, you can finally perform a real Hardware & Virtualization Audit:

Spotting the “Community Garden” (CXL): CXL memory usually appears as a “CPU-less” NUMA node. If you see your “hot” buffers sitting on a node with no attached cores, you’ve found a latency trap.
Validating the Hypervisor: You can now verify if your VM’s virtual topology actually matches the physical hardware. If the OS thinks it’s spreading data across two nodes but the hypervisor has pinned everything to one socket, you’ve identified a massive source of contention.

Balanced Postgres NUMA
Layout

The Goal: Using PostgreSQL 18 visibility combined with OS tools like numastat or lscpu, we can ensure our “toddlers” and “toys” stay in the same house, minimizing those costly remote hops.

Conclusion: Don’t Tune Physics Until You Fix the Logic 🔗

Should you dive into NUMA tuning immediately? It depends.

While NUMA and CXL tuning are the “frontier” of performance, they are not a silver bullet. Believe me, I’ve seen enough “firefights” to know that you shouldn’t decide based on assumptions. Before you start messing with numactl or CPU pinning, you must ensure your foundation is solid.

Your audit should follow this hierarchy:

Scope the Problem: Is it even Postgres? Check for application-side slowness or a “Busy Neighbor” stealing your cycles on the same host.
Data Volume: Is your query grabbing way more than it needs? (Unused columns, unnecessary rows, or useless joins).
SQL Writing: Is your SQL written to help Postgres find the best execution plan?
Statistics: Does Postgres actually know what your data looks like? Ensure your stats are up-to-date so the planner doesn’t fly blind.
Indexing: Is a missing index forcing a toddler to scan the entire toybox?
I/O Bottlenecks & Vicious Circles: Is your storage so slow that the CPU is just sitting idle? Or worse, are you caught in a vicious circle of huge writes?
Connection Storm: Are you drowning in active processes? 896 threads is a lot, but managing hundreds of direct connections creates a “bickering” tax. Use a connection pooler.
Cache Strategy: Are your shared_buffers tuned for your specific workload?

Only once these are optimized should you dive into the physics of NUMA. Hardware is physics, but software is logic. You need both to win.

If you are hitting a performance wall, I am here to help. I have recently launched my own independent consultancy to help you navigate these architectural waters.

Test, validate, and let’s make your Postgres fly!

PostgreSQL Hardware Performance