What are parameters?

Parameters are the learned numbers inside a neural network. Billions of them encode everything the model knows about language.

When people say GPT-5.1 has "over two trillion parameters," what does that mean?

Parameters are the numbers that define a neural network. Every weight connecting neurons, every bias shifting activations: these are parameters. When a model "learns," it's adjusting these numbers.

A model with 175 billion parameters has 175 billion individual numbers that were tuned during training. Each one contributes, in some small way, to every prediction the model makes.

What do parameters actually store?

This is subtle. Parameters don't store facts like a database stores records. You can't point to a parameter and say "this one knows that Paris is the capital of France."

Instead, knowledge is distributed across parameters. Patterns about language, facts about the world, reasoning heuristics: all encoded as statistical relationships between millions of numbers. The parameter values collectively create a function that maps inputs to outputs in useful ways.

Why do we need so many?

More parameters means more capacity to store patterns. A small network with thousands of parameters can learn simple rules. A network with billions can learn subtle distinctions.

Consider what language requires:

Grammar rules and exceptions to those rules
Word meanings and how context shifts them
Facts about the world
Reasoning patterns
Style, tone, register
Multiple languages and their interactions

Encoding all of this requires enormous parameter counts. Each additional parameter is another degree of freedom the model can use to capture nuance.

How much space do parameters take?

Each parameter is typically stored as a floating-point number. Common formats:

32-bit (FP32): 4 bytes per parameter, full precision
16-bit (FP16/BF16): 2 bytes per parameter, common for training
8-bit (INT8): 1 byte per parameter, used for efficient inference
4-bit: 0.5 bytes per parameter, aggressive compression

GPT-3's 175 billion parameters at 16-bit precision: about 350 gigabytes just for the weights. This is why running large models requires specialized hardware with substantial memory.

Quantization: making models smaller

Quantization reduces the precision of parameters to save memory and speed up computation. Instead of 16-bit numbers, you might use 8-bit or even 4-bit.

This sounds like it should destroy performance. Surprisingly, it often doesn't. Quantization typically happens after training is complete. Training uses high-precision numbers to allow gradual, fine-grained adjustments as the model searches for optimal weights. Once those weights are found, much of that precision becomes redundant.

The network has redundancy. Small precision losses at the individual parameter level average out across billions of parameters.

A 70-billion parameter model at 4-bit quantization fits in about 35 gigabytes: runnable on a high-end consumer GPU. The same model at full precision would need 280 gigabytes. Quantization enables running powerful models on more accessible hardware, with only modest quality degradation.

Think of it like MP3 vs. FLAC audio. Most people can't hear the difference between high-bitrate MP3 and lossless FLAC. Similarly, quantized models are often "good enough."

What happens to parameters during training?

Before training, parameters are initialized randomly (with some careful choices about the random distribution). The model outputs nonsense.

Training iteratively adjusts parameters to reduce prediction error:

A batch of examples flows through the network (forward pass)
The outputs are compared to targets, producing a loss value
Gradients are computed showing how each parameter affects the loss (backward pass)
Each parameter is nudged slightly in the direction that reduces loss
Repeat billions of times

By the end, the random initial values have been sculpted into a configuration that captures something about language.

The parameter mystery

We can count parameters. We can measure what models do. What we can't easily do is understand how specific parameters contribute to specific behaviors.

This is the interpretability challenge. 175 billion numbers, all contributing fractionally to every output. Which parameters encode grammar? Which encode facts about history? The question may not even be well-formed: knowledge is distributed so diffusely that no parameter "knows" anything individually.

Understanding this distributed, high-dimensional encoding is one of the open frontiers in AI research.

Sources & Further Reading

📄 Paper

Language Models are Few-Shot Learners

Brown et al. · OpenAI · 2020

📄 Paper

Training Compute-Optimal Large Language Models

Hoffmann et al. · DeepMind · 2022

🎬 Video

But what is a GPT?

3Blue1Brown · 2024