What is a neural network?

Neural networks are computing systems loosely inspired by biological brains. They learn patterns from data by adjusting millions of numerical connections.

What's actually inside an AI model?

A neural network is a mathematical function built from simple, repeated building blocks. Each block takes some numbers in, does basic math, and passes numbers out. Stack thousands of these blocks in careful arrangements, and something remarkable happens: the system can learn.

The "neural" part comes from a loose analogy to brain neurons. But don't take it too literally. These are mathematical operations, not biological cells.

The simplest case: the perceptron

To understand neural networks, start with the simplest possible one: a single unit called a perceptron.

A perceptron takes multiple inputs (numbers), multiplies each by a weight (another number), adds them up, and outputs a result. That's it. Multiply, add, output.

inputs:  [x₁, x₂, x₃]
weights: [w₁, w₂, w₃]
output:  w₁·x₁ + w₂·x₂ + w₃·x₃ + bias

Interactive Perceptron

Class 0Class 1Boundary

Weight w₁: 1.00

Weight w₂: 1.00

Bias: -0.80

Activation:

Test x₁: 0.50

Test x₂: 0.50

Computation:(1.0 × 0.50) + (1.0 × 0.50) + (-0.8) = 0.20

Weighted sum: 0.200After step: 1.000Accuracy: 100%

Try it: Adjust the weights and bias to move the decision boundary. Can you get 100% accuracy on the sample points? Notice how the boundary is always a straight line — that's the limitation of a single perceptron.

The magic is in the weights. By adjusting them, the perceptron can learn to make different decisions. High weight on an input means "pay attention to this." Low or negative weight means "ignore or invert this."

From one to many: layers

A single perceptron can only learn simple patterns (technically, linear separations). But stack them into layers, where the outputs of one layer become the inputs to the next, and the network can learn complex patterns.

Input layer: Your raw data (pixels of an image, tokens of text)
Hidden layers: Middle layers that transform and combine features
Output layer: The final answer (a classification, a probability distribution over next tokens)

Neural Network Layers

Output LayerFinal prediction

↓

Hidden Layer ...Higher abstractions

↓

Hidden Layer 2Combinations of features

↓

Hidden Layer 1Low-level features

↓

Input LayerRaw data (tokens, pixels)

Each layer extracts more abstract features. In an image network, early layers might detect edges, middle layers might detect shapes, and later layers might detect faces. Nobody programs these features; they emerge from training.

What makes them learn?

A neural network starts with random weights. It makes terrible predictions. Then training begins:

Show the network an example
Compare its output to the correct answer
Calculate how wrong it was (the "loss")
Adjust the weights slightly to be less wrong
Repeat millions of times

This process is called gradient descent. "Gradient" refers to the mathematical slope that tells you which direction to adjust each weight. "Descent" because you're descending toward lower error.

How big are these networks?

Size varies enormously:

A perceptron: 10-100 weights
A simple image classifier: millions of weights
GPT-3: 175 billion weights
Frontier models (GPT-5.1): 1-2+ trillion weights

Each weight is a number, typically stored as 16 or 32 bits. GPT-3's weights alone take about 350 gigabytes to store. Running the network requires loading these weights and performing matrix multiplications across them.

Picture an old telephone switchboard with billions of connections. Each connection has a dial controlling its strength. Some connections amplify signals, others dampen them, others invert them.

When you input a message, signals flow through this switchboard. At each junction, signals are combined and transformed based on the dial settings. The final output emerges from this cascade of combinations.

No single dial "knows" anything. Ask which dial stores "Paris is the capital of France" and there's no answer. That knowledge is distributed across millions of dials whose combined effect produces correct outputs for Paris-related queries. It's like asking which water molecule contains wetness. Wetness is emergent, a property of the whole, not localized in any part.

Training is an operator adjusting billions of dials, one tiny click at a time, until the switchboard routes messages to correct outputs.

Why does this work at all?

Neural networks exploit a mathematical property: sufficiently large networks can approximate any function. This is the "universal approximation theorem." Give a network enough units and it can, in principle, learn any input-output mapping.

But "can in principle" doesn't mean "will in practice." The genius is in architectures (how you arrange the layers), training procedures (how you adjust weights), and data (what examples you show). These determine whether a network actually learns something useful.

Neural networks are not magic. They're math. But math that, stacked deep enough and trained on enough data, produces capabilities that continually surprise us.