Temperature controls randomness in text generation. Low temperature means predictable, focused responses. High temperature means creative, varied ones.
Why does the same prompt sometimes give different answers?
When a model generates text, it doesn't just pick the most likely next token. It samples from a probability distribution. This controlled randomness produces variety.
Temperature is the dial that controls this randomness. Low temperature makes the distribution sharper, concentrating probability on likely tokens. High temperature flattens the distribution, giving unlikely tokens more chance.
How temperature works
The model outputs raw scores (logits) for each possible next token. Before sampling, these scores are divided by the temperature value, then converted to probabilities.
Temperature = 0: Always pick the highest probability token. Completely deterministic.
Temperature = 0.7: Moderate randomness. Usually sensible with some variety.
Temperature = 1.0: Standard randomness. Probabilities used as-is.
Temperature = 2.0: High randomness. Less likely tokens get much more chance.
Lower temperature means safer, more predictable text. Higher temperature means riskier, more surprising text.
When to use different temperatures
Low temperature (0-0.3):
Factual questions
Code generation
Data extraction
Anything where there's a "right" answer
Medium temperature (0.5-0.8):
General conversation
Explanations
Problem-solving
Most everyday use
High temperature (1.0+):
Creative writing
Brainstorming
Exploring alternatives
When you want surprises
The determinism question
With temperature=0, is output fully deterministic? Mostly, but not always.
Sources of non-determinism:
GPU floating-point operations can vary slightly
Batch composition might affect results
API providers may use sampling internally even at temperature=0
Model updates can change behavior
If you need reproducibility, some APIs offer seed parameters. Even then, exact reproduction isn't guaranteed across different hardware or model versions.
The sampling process in detail
Full generation with sampling:
Model outputs logits (raw scores) for all vocabulary tokens
Divide logits by temperature
Apply softmax to convert to probabilities
Apply top-k filtering (keep only top k tokens)
Apply top-p filtering (keep tokens until cumulative prob reaches p)
Renormalize remaining probabilities
Sample one token from this distribution
Repeat for next token
Each step shapes the distribution. Temperature first, then filtering, then sampling. The final token could be any of the survivors, weighted by probability.
Temperature and quality
Higher temperature doesn't mean better or worse. It means different.
Too low: repetitive, boring, gets stuck in loops
Too high: incoherent, random, loses the thread
The sweet spot depends on the task. There's no universally optimal temperature. Experimentation reveals what works for your specific use case.
๐ก๏ธ
Temperature Comparison
Adjust temperature and see how the same prompt produces different outputs
Temperature is a symptom
The need for temperature reveals something about how LLMs work. They don't compute "the answer." They compute a probability distribution over possible answers. Sampling from that distribution is where the specific response emerges.
This is fundamentally different from a calculator or search engine, which return definite results. The LLM always sees multiple possibilities. Temperature determines how it navigates them.