How do LLMs see images?

Multimodal models process images, audio, and video alongside text. They convert everything into tokens, allowing the same architecture to reason across modalities.

How can a language model understand pictures?

Early LLMs only processed text. Modern multimodal models handle images, audio, video, and more. The trick: convert everything into the same format the model already understands, tokens and embeddings.

An image becomes a sequence of visual tokens. Audio becomes a sequence of audio tokens. The transformer processes them just like text, learning relationships between modalities.

Images as tokens

Text is tokenized into words and subwords. Images are tokenized into patches.

An image is divided into a grid (say, 16×16 patches). Each patch becomes a token with its own embedding. A 256×256 image with 16×16 patches produces 256 visual tokens.

These visual tokens enter the same transformer that processes text tokens. The model learns to connect "dog" (text) with patches showing a dog (image).

Image: [patch_1] [patch_2] [patch_3] ... [patch_256]
Text:  [What] [animal] [is] [this] [?]
Combined: [patch_1] ... [patch_256] [What] [animal] [is] [this] [?]

The model attends across all tokens, finding which image patches are relevant to which words.

Vision encoders

Converting images to useful tokens requires a vision encoder, a neural network that understands visual content.

Common approaches:

CLIP-style: Train an image encoder and text encoder together. Images and their descriptions end up in the same embedding space. "A photo of a cat" and an actual cat photo have similar vectors.

ViT (Vision Transformer): Treat image patches like text tokens. Apply transformer architecture directly to images. Pre-train on massive image datasets.

Connector modules: Bridge between frozen vision encoders and language models. Map visual representations into the LLM's token space.

The vision encoder does the heavy lifting of understanding pixels. The LLM does reasoning about what it sees.

Audio and speech

Audio works similarly:

Waveform to spectrogram: Convert raw audio to frequency representations
Chunk into frames: Divide into short time segments
Encode to tokens: Neural encoder produces audio tokens
Process with LLM: Same transformer handles audio alongside text

This enables:

Speech transcription (audio → text)
Voice understanding (audio + text → response)
Audio generation (text → audio)

Models like Whisper specialize in audio-to-text. GPT-4o processes audio natively alongside text and images.

Video as a modality

Video is image sequences plus audio plus time:

Sample frames from video
Tokenize each frame as image patches
Add temporal position information
Include audio tokens if present
Process the combined sequence

Challenges multiply:

Token counts explode (minutes of video = millions of tokens)
Temporal relationships matter
Compute requirements are massive

Current video models typically sample sparse frames or summarize. Full video understanding at LLM-scale remains frontier research.

Training multimodal models

Training combines data from multiple modalities:

Image-text pairs: Photos with captions, diagrams with explanations Interleaved documents: Web pages with images and text mixed Instruction data: "What's in this image?" → description Cross-modal tasks: Describe images, answer questions about charts, follow visual instructions

The model learns to connect modalities: that "red" in text corresponds to certain pixel values, that a graph line going up means "increasing."

What multimodal enables

With multiple modalities:

Document understanding: Read PDFs with charts, diagrams, and text together
Visual reasoning: Analyze images, answer questions about what's shown
Creative work: Understand and generate visual content
Robotics and embodiment: Process camera feeds, sensor data
Accessibility: Describe images for visually impaired users

The same reasoning capabilities that work for text now work for any tokenizable input.

Current limitations

Multimodal models aren't perfect:

Spatial reasoning: Struggle with precise locations, counts, small details
OCR errors: May misread text in images
Hallucination: Can "see" things that aren't there
Resolution limits: May miss fine details due to patch size
Compute cost: Visual tokens are expensive; long videos are prohibitive

Progress is rapid, but modalities beyond text remain harder. Text is the native language; other modalities are learned translations.

Sources & Further Reading

📄 Paper

Learning Transferable Visual Models From Natural Language Supervision (CLIP)

Radford et al. · OpenAI · 2021

📄 Paper

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT)

Dosovitskiy et al. · Google · 2020

📖 Docs

Vision with Claude

Anthropic