How do tokens become numbers?

Embeddings convert tokens into dense vectors of numbers. Similar meanings end up at similar positions in this high-dimensional space.

Neural networks work with numbers. Text is words. How do you bridge that gap?

When a token enters a neural network, it's immediately converted into a list of numbers called an embedding. The token "cat" might become something like [0.2, -0.5, 0.8, 0.1, ...] extending to hundreds or thousands of dimensions.

This isn't arbitrary. The embedding positions are learned during training. Tokens with similar meanings end up with similar number patterns. The geometry of this number space captures something about meaning.

What does the embedding space look like?

Each dimension in an embedding is like a coordinate axis. Two dimensions give you a flat plane. Three give you a 3D space. Modern embeddings have 768, 1536, or even more dimensions. You can't visualize this directly, but the math works the same way.

In this space, tokens aren't scattered randomly. "King" and "queen" are near each other. "King" and "banana" are far apart. "Paris" and "France" are close. The distances and directions encode relationships.

The famous example: king - man + woman = queen

This arithmetic actually works (approximately) in embedding space.

Take the vector for "king." Subtract the vector for "man." Add the vector for "woman." The result is a vector very close to "queen."

What does this mean? The direction from "man" to "woman" captures something about gender. Apply that same direction to "king" and you get the corresponding gendered royal: "queen."

This isn't programmed. It emerges from training on text where these words appear in analogous contexts.

Dimensions don't have obvious meanings

You might hope that dimension 47 means "animate" and dimension 128 means "positive sentiment." It's not that clean.

Each dimension captures some statistical pattern, but these patterns rarely map to human-interpretable concepts. The meaning is distributed across many dimensions. "Animate" might be a direction in the space (a combination of many dimensions) rather than a single axis.

Every model has its own embedding space

Different models learn different embeddings. GPT-5.1's embedding for "cat" is a completely different list of numbers than BERT's embedding for "cat." The spaces aren't compatible.

This matters when building applications. If you embed text with one model and try to search with another, the geometry doesn't match. You need to use the same embedding model consistently.

Some embedding models are trained specifically for retrieval: they optimize for meaningful similarity rather than text generation. These are what power semantic search.

๐ŸŒ
3D Embedding Explorer
Navigate through embedding space and see how similar words cluster together

Why embeddings matter

Embeddings are where symbols meet geometry. They're why neural networks can process language at all: they convert discrete tokens into continuous space where similarity, analogy, and relationship become mathematical operations.

When you search for similar documents, you're comparing embeddings. When a model understands that your question about "automobiles" is related to "cars," embeddings make that connection. The entire edifice of modern language AI rests on this conversion of words to meaningful coordinates.

Sources & Further Reading

๐Ÿ“„ Paper
Efficient Estimation of Word Representations in Vector Space
Mikolov et al. ยท Google ยท 2013
๐ŸŽฌ Video
๐Ÿ“– Docs