What is temperature in AI?

Temperature controls randomness in text generation. Low temperature means predictable, focused responses. High temperature means creative, varied ones.

Why does the same prompt sometimes give different answers?

When a model generates text, it doesn't just pick the most likely next token. It samples from a probability distribution. This controlled randomness produces variety.

Temperature is the dial that controls this randomness. Low temperature makes the distribution sharper, concentrating probability on likely tokens. High temperature flattens the distribution, giving unlikely tokens more chance.

How temperature works

The model outputs raw scores (logits) for each possible next token. Before sampling, these scores are divided by the temperature value, then converted to probabilities.

Temperature = 0: Always pick the highest probability token. Completely deterministic.
Temperature = 0.7: Moderate randomness. Usually sensible with some variety.
Temperature = 1.0: Standard randomness. Probabilities used as-is.
Temperature = 2.0: High randomness. Less likely tokens get much more chance.

Lower temperature means safer, more predictable text. Higher temperature means riskier, more surprising text.

When to use different temperatures

Low temperature (0-0.3):

Factual questions
Code generation
Data extraction
Anything where there's a "right" answer

Medium temperature (0.5-0.8):

General conversation
Explanations
Problem-solving
Most everyday use

High temperature (1.0+):

Creative writing
Brainstorming
Exploring alternatives
When you want surprises

The determinism question

With temperature=0, is output fully deterministic? Mostly, but not always.

Sources of non-determinism:

GPU floating-point operations can vary slightly
Batch composition might affect results
API providers may use sampling internally even at temperature=0
Model updates can change behavior

If you need reproducibility, some APIs offer seed parameters. Even then, exact reproduction isn't guaranteed across different hardware or model versions.

The sampling process in detail

Full generation with sampling:

Model outputs logits (raw scores) for all vocabulary tokens
Divide logits by temperature
Apply softmax to convert to probabilities
Apply top-k filtering (keep only top k tokens)
Apply top-p filtering (keep tokens until cumulative prob reaches p)
Renormalize remaining probabilities
Sample one token from this distribution
Repeat for next token

Each step shapes the distribution. Temperature first, then filtering, then sampling. The final token could be any of the survivors, weighted by probability.

Temperature and quality

Higher temperature doesn't mean better or worse. It means different.

Too low: repetitive, boring, gets stuck in loops Too high: incoherent, random, loses the thread

The sweet spot depends on the task. There's no universally optimal temperature. Experimentation reveals what works for your specific use case.

🌡️

Temperature Comparison

Adjust temperature and see how the same prompt produces different outputs

Temperature is a symptom

The need for temperature reveals something about how LLMs work. They don't compute "the answer." They compute a probability distribution over possible answers. Sampling from that distribution is where the specific response emerges.

This is fundamentally different from a calculator or search engine, which return definite results. The LLM always sees multiple possibilities. Temperature determines how it navigates them.

Sources & Further Reading

🔗 Article

Language Model Sampling Methods

Hugging Face · 2020

📄 Paper

The Curious Case of Neural Text Degeneration

Holtzman et al. · 2019

📖 Docs

API Parameters

Anthropic