LLMs read tokens, not letters or words. Tokens are chunks of text that the model has learned to recognize.
When you type "Hello, how are you?" what does the model actually see?
Not letters. Not exactly words. The model sees tokens: chunks of text that might be whole words, parts of words, or individual characters.
Before your message reaches the neural network, a tokenizer chops it up. "Hello" might become one token. "Tokenization" might become "Token" + "ization". An unusual word like "cryptographic" might split into "crypt" + "ographic" or even smaller pieces.
This chunking matters because neural networks work with numbers, not text. Each token maps to a numeric ID in a vocabulary. Token 9906 might be "Hello". Token 30 might be "?". These numbers are what the model actually processes.
Example: GPT-4 tokenization
Original:Hello, how are you?
Hello9906,11␣how1268␣are527␣you499?30
6 tokens
Why not just use words?
Words seem like the obvious choice, but they create problems. English alone has hundreds of thousands of words. Add names, technical terms, foreign words, typos, and internet slang, and you'd need millions of vocabulary entries.
Worse: any word not in your vocabulary becomes impossible to process. The model would choke on "ChatGPT" if it was trained before that word existed.
Tokens solve this elegantly. A typical vocabulary has 50,000-100,000 tokens (newer models like GPT-4o use ~200,000). Common words like "the" and "and" get their own tokens. Rare or new words get assembled from pieces. "ChatGPT" might become "Chat" + "G" + "PT". The model never encounters a word it can't represent.
What does this mean for how the model thinks?
Tokenization shapes what the model can easily perceive. Common English words are single tokens. The model has learned rich associations for these during training. But split a word into pieces and the model must reconstruct its meaning from fragments, each learned in potentially different contexts.
The model sees your text more like you sight-read: not letter-by-letter, but in familiar shapes and chunks. You generally don't look at individual letters when reading. You recognize words and word-parts by shape, slowing down only for unfamiliar terms. Similarly, the model works with chunks. Common words are single, familiar tokens. Rare words fragment into pieces that must be reconstructed.
Try it: Paste common words vs. technical jargon. Notice how frequently-used words become single tokens while rare words get split into pieces.
How many tokens is my conversation?
A rough rule: one token averages about 4 characters in English, or roughly three-quarters of a word. A 100-word paragraph is typically 70-80 tokens. A 1,000-word essay might be 750-800 tokens.
LLMs have a context window: a maximum number of tokens they can consider at once. Exceed it and older content gets forgotten. Token efficiency directly affects how much context the model can use.
When the model behaves unexpectedly with unusual words, names, or non-English text, tokenization is often the underlying cause. The model isn't being stupid; it's working with a different view of your input than you have.