When you hit send, the model runs a forward pass to predict the next token, samples one, appends it, and repeats. This loop generates entire responses token by token.
What happens in the milliseconds after you press send?
Your prompt is tokenized, breaking text into token IDs. These IDs become embeddings: vectors of numbers. The embeddings flow through the neural network, layer by layer. At the end, the model outputs probabilities for every possible next token.
One token is selected from those probabilities. It's appended to the sequence. The whole process repeats with this slightly longer sequence. Token by token, the response emerges.
This is inference: running the model to produce output. Training taught the model what to predict. Inference is prediction happening in real-time.
The autoregressive loop
LLMs generate text autoregressively: each token depends on all previous tokens.
The Generation Loop
๐Tokenize InputText โ Token IDs
โ
๐Forward PassThrough all layers
โ
๐Get ProbabilitiesFor next token
โ
๐ฒSample TokenBased on temperature
โ
โก๏ธAppend & RepeatUntil done
Each step requires a full forward pass through the network. A 500-token response requires 500 forward passes. This is why generation takes noticeable time.
Why is inference expensive?
Each forward pass involves massive matrix multiplications across billions of parameters. For a 70-billion parameter model, each token requires roughly 70 billion multiply-add operations.
Multiply by sequence length. A 1000-token response means 70 trillion operations. This is why inference requires specialized hardware: GPUs or TPUs that can perform trillions of operations per second.
The KV cache optimization
Here's a key insight: in autoregressive generation, most of the computation is repeated. When generating token 501, you recompute attention for tokens 1-500, even though nothing about them changed.
The KV cache stores the key and value vectors from previous tokens. On each new step, you only compute the new token and look up cached values for previous tokens. This dramatically speeds up generation.
Batching and parallelism
A single request doesn't fully utilize a modern GPU. The solution: process multiple requests simultaneously.
Batching: Group requests together, process them in parallel through the same model. More efficient hardware utilization.
Continuous batching: As some requests finish, new ones join the batch dynamically. Keeps the GPU constantly busy.
Speculative decoding: Use a small, fast model to predict several tokens, then verify with the large model in one pass. Can speed up generation significantly.
These optimizations are why API inference is cheaper than running your own GPU: providers achieve efficiency through scale and engineering.
What determines inference speed?
Key factors:
Model size: More parameters means more computation per token
Context length: Longer contexts mean more attention computation (quadratic scaling)
Sampling: How tokens are selected (simple greedy vs complex sampling)
Smaller models (7B vs 70B parameters) generate much faster but may produce lower quality. Quantized models (reduced precision) trade some quality for speed. The right balance depends on use case.
๐
Token Probability Viewer
See the probability distribution over candidate tokens at each generation step
Inference vs training
Training is much more expensive than inference because:
Training computes gradients (backward pass) in addition to predictions (forward pass)
Training processes the entire dataset many times (epochs)
Training updates parameters, requiring memory for optimizer states
A single inference query costs a tiny fraction of what training cost. But inference scales with users: millions of queries add up. Inference cost is the ongoing expense; training cost is the upfront investment.
The economics of inference
Running inference at scale requires substantial infrastructure. A single H100 GPU costs around $30,000 and can serve maybe tens of queries per second for a large model. Serving millions of users requires thousands of GPUs.
This is why API pricing matters. Providers balance hardware costs, cooling, staff, and profit margins to set per-token prices. Cheaper inference comes from efficiency gains: better hardware, smarter batching, model optimizations.