// published: May 14, 2026

Why Decoding is memory-bound for LLMs and how to optimize it

If you are building infrastructure for Large Language Models (LLMs) or just trying to figure out why your local AI assistant takes a second to reply, you are bumping into one of the most fascinating hardware paradoxes in computer science.

There is a very common, very logical misconception in AI engineering: LLMs do billions of math operations per second, therefore the “decode” phase must be compute-bound.

The truth is actually a shape-shifting problem. For a single user, decoding is brutally memory-bandwidth bound.

The “Memory Wall”

To understand the bottleneck, we have to look at the physical anatomy of a GPU. Think of the GPU as having two main areas:

HBM/VRAM — The Library: This is massive but relatively slow. It holds the 140+ gigabytes of model weights.
SRAM/Compute Cores — The Desk: This is tiny but blazingly fast. This is where the actual matrix multiplication happens.

The Problem: A model like Llama-3 70B is 140GB. You cannot fit a 140GB model onto a 50MB desk. Because the whole model doesn’t fit on the desk, the GPU must physically haul the model from the Library to the Desk, layer by layer.

Load Layer 1 weights from HBM to SRAM.
Multiply by the token.
Wipe the desk clean.
Load Layer 2 weights from HBM to SRAM.
Repeat 80 times for all layers.

So even though the weights are “loaded” in the VRAM, they are electrically far away from the math units. They have to travel across the memory bus (the “wires”), which is the slowest part of the entire system. The compute cores (the SMs) multiply the weights against that single token in a fraction of a microsecond. Then, they sit completely idle, waiting for the next massive layer of weights to travel across the silicon.

Speculative Decoding — Trading Free Compute for Speed

Engineers hate idle compute. If the compute cores are sitting there waiting anyway, why not give them more work to do with the data they already loaded?

Enter Speculative Decoding.

During the standard decoding phase, the LLM generates only one token at a time because the attention mechanism requires the output of the previous token to compute the next one. To break this bottleneck, a tiny, ultra-fast “Draft Model” is introduced.

This little model runs ahead and guesses the next few tokens, say 5. Then, the massive Target Model loads its 140 GB weight matrix once and verifies all 5 guessed tokens simultaneously. If a guessed token and all of its predecessors matches the LLM’s prediction, it is accepted. Otherwise, it is rejected, and the first incorrect token is replaced with the correct one. When $x$ tokens are accepted, we get $x+1$ tokens at the price of one memory load. Since a token is accepted only when it exactly matches the LLM’s prediction, the accuracy is not compromised at all.

We haven’t made the memory bandwidth any faster. Instead, we amortized the cost. We forced the GPU to do 5 times more math per memory load. Because the compute cores were mostly idle anyway, this extra math is effectively “free.” We traded idle compute capacity to bypass the memory bandwidth bottleneck, typically resulting in a 2x to 3x speedup.

Appearently, increasing the tokens accepted per turn can further speedup the decoding process. Extensive research is being done to enhance the average accepted length, e.g., by improving the accuracy of the Draft Model or introducing more sophisticated techniques like tree-structured speculative tokens.

But as you scale up your server to handle thousands of users, the physics flip, and the system slams into a compute-bound wall.

Takeaway

Speculative decoding is a brilliant hack to squeeze more performance out of the GPU when the system is memory-bound. However, it is not a one-size-fits-all solution. As the workload scales and the GPU becomes fully utilized, the bottleneck shifts from memory bandwidth to compute power, and speculative decoding may no longer be effective. Understanding these dynamics is crucial for optimizing LLM performance in different scenarios.