What are parameters?

Parameters are the learned numbers inside a neural network. Billions of them encode everything the model knows about language.

When people say GPT-5.1 has "over two trillion parameters," what does that mean?

Parameters are the numbers that define a neural network. Every weight connecting neurons, every bias shifting activations: these are parameters. When a model "learns," it's adjusting these numbers.

A model with 175 billion parameters has 175 billion individual numbers that were tuned during training. Each one contributes, in some small way, to every prediction the model makes.

What do parameters actually store?

This is subtle. Parameters don't store facts like a database stores records. You can't point to a parameter and say "this one knows that Paris is the capital of France."

Instead, knowledge is distributed across parameters. Patterns about language, facts about the world, reasoning heuristics: all encoded as statistical relationships between millions of numbers. The parameter values collectively create a function that maps inputs to outputs in useful ways.

Why do we need so many?

More parameters means more capacity to store patterns. A small network with thousands of parameters can learn simple rules. A network with billions can learn subtle distinctions.

Consider what language requires:

  • Grammar rules and exceptions to those rules
  • Word meanings and how context shifts them
  • Facts about the world
  • Reasoning patterns
  • Style, tone, register
  • Multiple languages and their interactions

Encoding all of this requires enormous parameter counts. Each additional parameter is another degree of freedom the model can use to capture nuance.

How much space do parameters take?

Each parameter is typically stored as a floating-point number. Common formats:

  • 32-bit (FP32): 4 bytes per parameter, full precision
  • 16-bit (FP16/BF16): 2 bytes per parameter, common for training
  • 8-bit (INT8): 1 byte per parameter, used for efficient inference
  • 4-bit: 0.5 bytes per parameter, aggressive compression

GPT-3's 175 billion parameters at 16-bit precision: about 350 gigabytes just for the weights. This is why running large models requires specialized hardware with substantial memory.

What happens to parameters during training?

Before training, parameters are initialized randomly (with some careful choices about the random distribution). The model outputs nonsense.

Training iteratively adjusts parameters to reduce prediction error:

  1. A batch of examples flows through the network (forward pass)
  2. The outputs are compared to targets, producing a loss value
  3. Gradients are computed showing how each parameter affects the loss (backward pass)
  4. Each parameter is nudged slightly in the direction that reduces loss
  5. Repeat billions of times

By the end, the random initial values have been sculpted into a configuration that captures something about language.

The parameter mystery

We can count parameters. We can measure what models do. What we can't easily do is understand how specific parameters contribute to specific behaviors.

This is the interpretability challenge. 175 billion numbers, all contributing fractionally to every output. Which parameters encode grammar? Which encode facts about history? The question may not even be well-formed: knowledge is distributed so diffusely that no parameter "knows" anything individually.

Understanding this distributed, high-dimensional encoding is one of the open frontiers in AI research.

Sources & Further Reading

๐Ÿ“„ Paper
Language Models are Few-Shot Learners
Brown et al. ยท OpenAI ยท 2020
๐Ÿ“„ Paper
Training Compute-Optimal Large Language Models
Hoffmann et al. ยท DeepMind ยท 2022
๐ŸŽฌ Video
But what is a GPT?
3Blue1Brown ยท 2024