How do tokens become numbers?

Embeddings convert tokens into dense vectors of numbers. Similar meanings end up at similar positions in this high-dimensional space.

Neural networks work with numbers. Text is words. How do you bridge that gap?

When a token enters a neural network, it's immediately converted into a list of numbers called an embedding. The token "cat" might become something like [0.2, -0.5, 0.8, 0.1, ...] extending to hundreds or thousands of dimensions.

This isn't arbitrary. The embedding positions are learned during training. Tokens with similar meanings end up with similar number patterns. The geometry of this number space captures something about meaning.

What does the embedding space look like?

Each dimension in an embedding is like a coordinate axis. Two dimensions give you a flat plane. Three give you a 3D space. Modern embeddings have 768, 1536, or even more dimensions. You can't visualize this directly, but the math works the same way.

In this space, tokens aren't scattered randomly. "King" and "queen" are near each other. "King" and "banana" are far apart. "Paris" and "France" are close. The distances and directions encode relationships.

The famous example: king - man + woman = queen

This arithmetic actually works (approximately) in embedding space.

Take the vector for "king." Subtract the vector for "man." Add the vector for "woman." The result is a vector very close to "queen."

What does this mean? The direction from "man" to "woman" captures something about gender. Apply that same direction to "king" and you get the corresponding gendered royal: "queen."

This isn't programmed. It emerges from training on text where these words appear in analogous contexts.

Dimensions don't have obvious meanings

You might hope that dimension 47 means "animate" and dimension 128 means "positive sentiment." It's not that clean.

Each dimension captures some statistical pattern, but these patterns rarely map to human-interpretable concepts. The meaning is distributed across many dimensions. "Animate" might be a direction in the space (a combination of many dimensions) rather than a single axis.

Eigenvalues and the geometry of meaning

When analyzing high-dimensional embedding spaces, mathematicians use tools like eigenvalues and eigenvectors to find meaningful structure.

Eigenvectors identify the principal directions in the data: axes along which embeddings vary most. The corresponding eigenvalues tell you how important each direction is.

This is how techniques like PCA (Principal Component Analysis) reduce thousands of dimensions to 2D plots you can visualize. The two axes with largest eigenvalues capture the most variation in how embeddings differ from each other.

When you see a 2D plot of word embeddings with similar words clustered together, eigenvalue decomposition probably created those axes.

Every model has its own embedding space

Different models learn different embeddings. GPT-5.1's embedding for "cat" is a completely different list of numbers than BERT's embedding for "cat." The spaces aren't compatible.

This matters when building applications. If you embed text with one model and try to search with another, the geometry doesn't match. You need to use the same embedding model consistently.

Some embedding models are trained specifically for retrieval: they optimize for meaningful similarity rather than text generation. These are what power semantic search.

🌐

3D Embedding Explorer

Navigate through embedding space and see how similar words cluster together

Why embeddings matter

Embeddings are where symbols meet geometry. They're why neural networks can process language at all: they convert discrete tokens into continuous space where similarity, analogy, and relationship become mathematical operations.

When you search for similar documents, you're comparing embeddings. When a model understands that your question about "automobiles" is related to "cars," embeddings make that connection. The entire edifice of modern language AI rests on this conversion of words to meaningful coordinates.

Sources & Further Reading

📄 Paper

Efficient Estimation of Word Representations in Vector Space

Mikolov et al. · Google · 2013

🎬 Video

Word2Vec: Efficient Estimation of Word Representations

StatQuest · 2020

📖 Docs

Embedding Projector

TensorFlow