Test-time compute is the idea that you can improve a model’s answers by letting it do more work at inference, instead of making the model bigger. Generating a longer chain of thought, sampling many solutions and taking a majority vote, or searching over candidate steps all spend more compute per question in exchange for higher accuracy.
This is the second scaling axis behind reasoning models. Pre-training scaling grows the weights; test-time scaling grows the amount of thinking per query. The two are complementary, and reasoning models are partly a way of training a model to use test-time compute well.
Useful background: Snell et al. (2024) show that optimally scaling test-time compute can beat simply using a much larger model; this Hugging Face overview surveys the main techniques; and Zeng et al. (2025) study how to revisit and refine reasoning at inference.
