How do LLMs reason?

LLMs reason better when they "think out loud." Chain-of-thought prompting and reasoning models show that more compute at inference time improves complex problem-solving.

Why does "think step by step" make AI smarter?

Ask an LLM a complex question directly, and it might fail. Ask it to explain its reasoning step by step, and it often succeeds. The same model, the same knowledge, dramatically different results.

This is chain-of-thought (CoT) prompting: getting models to show their work. It's not a trick. It reflects something fundamental about how these models reason.

The problem with direct answers

LLMs generate token by token, left to right. Each token is based on what came before. When forced to answer immediately, the model must compress all reasoning into choosing the first token of its answer.

Q: If a train travels 60 mph for 2.5 hours, how far does it go?
A: 150 miles

The model had to "know" 150 immediately, with no scratch work.

For simple questions, this works. For complex reasoning, it's like solving calculus in your head without writing anything down. Possible sometimes; unreliable often.

Chain of thought: external working memory

When the model "thinks out loud," its own generated text becomes working memory:

Q: If a train travels 60 mph for 2.5 hours, how far does it go?

A: Let me work through this step by step.
- Speed: 60 miles per hour
- Time: 2.5 hours  
- Distance = Speed × Time
- Distance = 60 × 2.5
- 60 × 2 = 120
- 60 × 0.5 = 30
- 120 + 30 = 150
The train travels 150 miles.

Each step becomes tokens that the model can attend to when generating the next step. The reasoning is externalized, visible, checkable.

Why does this work?

Several factors:

Serial computation: Complex problems require multiple reasoning steps. CoT provides the "space" for those steps to happen sequentially.

Error correction: With visible intermediate steps, the model can notice and correct mistakes. Direct answers have no self-check opportunity.

Pattern activation: The format of step-by-step reasoning resembles training data showing problem-solving. It triggers helpful patterns.

Decomposition: Breaking problems into subproblems makes each step easier. "What's 60 × 2.5?" is harder than "What's 60 × 2? What's 60 × 0.5? What's their sum?"

Reasoning models: thinking deeply

Recent models (like OpenAI's o1) take this further. Instead of just chain-of-thought in the visible output, they do extensive internal reasoning before responding.

The model might:

Explore multiple approaches
Verify its own conclusions
Reconsider and revise
Spend many more tokens thinking than answering

This "test-time compute" (computation spent during inference rather than training) significantly improves performance on hard problems.

Standard model: Think briefly → Answer
Reasoning model: Think at length → Verify → Revise → Think more → Answer

The scaling of inference

Traditional scaling focused on training: bigger models, more data, more compute during training.

Reasoning models reveal another scaling axis: inference compute. Spending more time "thinking" on each problem improves results.

This has implications:

Hard problems can warrant more inference compute
Users might pay for "thinking time" not just tokens
The same model can have different capability levels based on inference budget

Scaling training made models more capable. Scaling inference makes them more reliable on the capabilities they have.

Techniques for better reasoning

Several approaches encourage reasoning:

Zero-shot CoT: Just add "Let's think step by step" to the prompt.

Few-shot CoT: Show examples of step-by-step reasoning before the question.

Self-consistency: Generate multiple reasoning chains, take the majority answer.

Tree of thought: Explore multiple reasoning paths, backtrack when stuck.

Verification: Ask the model to check its own answer; let it revise.

Decomposition: Break complex questions into simpler sub-questions.

Limits of reasoning

Chain-of-thought isn't magic:

Token cost: Extensive reasoning uses many tokens, increasing cost and latency
Faithfulness: Stated reasoning may not reflect actual "thinking"—models can confabulate explanations
Ceiling: More thinking time helps, but doesn't grant new capabilities
Error propagation: Mistakes in early steps compound
Problem fit: Works best for problems with clear step structure; less helpful for intuitive leaps

Reasoning improves reliability, not raw capability. A model that can't solve a problem won't solve it by thinking longer, but a model that sometimes can will succeed more often.

The future of reasoning

Reasoning capability is a frontier. Current directions:

Longer thinking: Models that can "ponder" for minutes, not seconds
Learned reasoning: Training models to develop their own reasoning strategies
Tool-augmented reasoning: Using code execution and search within reasoning chains
Verifiable reasoning: Formal methods to check reasoning validity

The insight that inference-time computation matters is reshaping how we think about AI capability. Training determines potential; inference determines realization.

Sources & Further Reading

📄 Paper

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei et al. · Google · 2022

📄 Paper

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang et al. · Google · 2022

🔗 Article

Extended Thinking

Anthropic