Inference-Time Compute

Fundamentals

Additional computational effort spent during a model's response generation, such as chain-of-thought reasoning or search, to improve output quality at the cost of speed and resources.

Inference-time compute refers to the strategy of allocating more computational resources during the generation phase (inference) rather than during training, in order to produce higher-quality outputs. Instead of simply generating the first plausible answer, the model spends additional time reasoning, exploring multiple solution paths, or verifying its own outputs.

This approach gained prominence with reasoning models like OpenAI's o1 and o3, DeepSeek R1, and Claude's extended thinking mode. These models use techniques such as chain-of-thought reasoning, internal deliberation, and test-time search to work through complex problems step by step. The key insight is that some problems benefit more from "thinking longer" at inference time than from training on more data.

Inference-time compute represents a paradigm shift in AI scaling. While traditional scaling laws focused on increasing model size and training data, inference-time scaling shows that even smaller models can match larger ones on difficult tasks when given enough compute budget at generation time. The trade-off is increased latency and cost per query, making it most valuable for tasks where accuracy matters more than speed.

Related Terms

Inference Reasoning Chain of Thought

Last updated: March 1, 2026