How does text generation actually happen?

When you hit send, the model runs a forward pass to predict the next token, samples one, appends it, and repeats. This loop generates entire responses token by token.

What happens in the milliseconds after you press send?

Your prompt is tokenized, breaking text into token IDs. These IDs become embeddings: vectors of numbers. The embeddings flow through the neural network, layer by layer. At the end, the model outputs probabilities for every possible next token.

One token is selected from those probabilities. It's appended to the sequence. The whole process repeats with this slightly longer sequence. Token by token, the response emerges.

This is inference: running the model to produce output. Training taught the model what to predict. Inference is prediction happening in real-time.

The autoregressive loop

LLMs generate text autoregressively: each token depends on all previous tokens.

Each step requires a full forward pass through the network. A 500-token response requires 500 forward passes. This is why generation takes noticeable time.

Why is inference expensive?

Each forward pass involves massive matrix multiplications across billions of parameters. For a 70-billion parameter model, each token requires roughly 70 billion multiply-add operations.

Multiply by sequence length. A 1000-token response means 70 trillion operations. This is why inference requires specialized hardware: GPUs or TPUs that can perform trillions of operations per second.

The KV cache optimization

Here's a key insight: in autoregressive generation, most of the computation is repeated. When generating token 501, you recompute attention for tokens 1-500, even though nothing about them changed.

The KV cache stores the key and value vectors from previous tokens. On each new step, you only compute the new token and look up cached values for previous tokens. This dramatically speeds up generation.

What determines inference speed?

Key factors:

  • Model size: More parameters means more computation per token
  • Context length: Longer contexts mean more attention computation (quadratic scaling)
  • Hardware: GPU memory, compute speed, interconnect bandwidth
  • Optimization: KV cache, quantization, batching efficiency
  • Sampling: How tokens are selected (simple greedy vs complex sampling)

Smaller models (7B vs 70B parameters) generate much faster but may produce lower quality. Quantized models (reduced precision) trade some quality for speed. The right balance depends on use case.

๐Ÿ“Š
Token Probability Viewer
See the probability distribution over candidate tokens at each generation step

Inference vs training

Training is much more expensive than inference because:

  • Training computes gradients (backward pass) in addition to predictions (forward pass)
  • Training processes the entire dataset many times (epochs)
  • Training updates parameters, requiring memory for optimizer states

A single inference query costs a tiny fraction of what training cost. But inference scales with users: millions of queries add up. Inference cost is the ongoing expense; training cost is the upfront investment.

The economics of inference

Running inference at scale requires substantial infrastructure. A single H100 GPU costs around $30,000 and can serve maybe tens of queries per second for a large model. Serving millions of users requires thousands of GPUs.

This is why API pricing matters. Providers balance hardware costs, cooling, staff, and profit margins to set per-token prices. Cheaper inference comes from efficiency gains: better hardware, smarter batching, model optimizations.

Sources & Further Reading

๐ŸŽฌ Video
Let's build GPT: from scratch, in code
Andrej Karpathy ยท 2023
๐Ÿ“„ Paper
Fast Inference from Transformers via Speculative Decoding
Leviathan et al. ยท Google ยท 2022
๐Ÿ”— Article