How do LLMs see images?
Multimodal models process images, audio, and video alongside text. They convert everything into tokens, allowing the same architecture to reason across modalities.
How can a language model understand pictures?
Early LLMs only processed text. Modern multimodal models handle images, audio, video, and more. The trick: convert everything into the same format the model already understands, tokens and embeddings.
An image becomes a sequence of visual tokens. Audio becomes a sequence of audio tokens. The transformer processes them just like text, learning relationships between modalities.
Images as tokens
Text is tokenized into words and subwords. Images are tokenized into patches.
An image is divided into a grid (say, 16ร16 patches). Each patch becomes a token with its own embedding. A 256ร256 image with 16ร16 patches produces 256 visual tokens.
These visual tokens enter the same transformer that processes text tokens. The model learns to connect "dog" (text) with patches showing a dog (image).
Image: [patch_1] [patch_2] [patch_3] ... [patch_256]
Text: [What] [animal] [is] [this] [?]
Combined: [patch_1] ... [patch_256] [What] [animal] [is] [this] [?]
The model attends across all tokens, finding which image patches are relevant to which words.
Vision encoders
Converting images to useful tokens requires a vision encoder, a neural network that understands visual content.
Common approaches:
CLIP-style: Train an image encoder and text encoder together. Images and their descriptions end up in the same embedding space. "A photo of a cat" and an actual cat photo have similar vectors.
ViT (Vision Transformer): Treat image patches like text tokens. Apply transformer architecture directly to images. Pre-train on massive image datasets.
Connector modules: Bridge between frozen vision encoders and language models. Map visual representations into the LLM's token space.
The vision encoder does the heavy lifting of understanding pixels. The LLM does reasoning about what it sees.
Audio and speech
Audio works similarly:
- Waveform to spectrogram: Convert raw audio to frequency representations
- Chunk into frames: Divide into short time segments
- Encode to tokens: Neural encoder produces audio tokens
- Process with LLM: Same transformer handles audio alongside text
This enables:
- Speech transcription (audio โ text)
- Voice understanding (audio + text โ response)
- Audio generation (text โ audio)
Models like Whisper specialize in audio-to-text. GPT-4o processes audio natively alongside text and images.
Training multimodal models
Training combines data from multiple modalities:
Image-text pairs: Photos with captions, diagrams with explanations Interleaved documents: Web pages with images and text mixed Instruction data: "What's in this image?" โ description Cross-modal tasks: Describe images, answer questions about charts, follow visual instructions
The model learns to connect modalities: that "red" in text corresponds to certain pixel values, that a graph line going up means "increasing."
What multimodal enables
With multiple modalities:
- Document understanding: Read PDFs with charts, diagrams, and text together
- Visual reasoning: Analyze images, answer questions about what's shown
- Creative work: Understand and generate visual content
- Robotics and embodiment: Process camera feeds, sensor data
- Accessibility: Describe images for visually impaired users
The same reasoning capabilities that work for text now work for any tokenizable input.
Current limitations
Multimodal models aren't perfect:
- Spatial reasoning: Struggle with precise locations, counts, small details
- OCR errors: May misread text in images
- Hallucination: Can "see" things that aren't there
- Resolution limits: May miss fine details due to patch size
- Compute cost: Visual tokens are expensive; long videos are prohibitive
Progress is rapid, but modalities beyond text remain harder. Text is the native language; other modalities are learned translations.