A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models
Abstract
We propose a general two-stage algorithm that enjoys a provable scaling law for the test-time compute of large language models (LLMs). Given an input problem, the proposed algorithm first generates N candidate solutions, and then chooses the best one via a multiple-round knockout tournament where each pair of candidates are compared for K times and only the winners move on to the next round. In a minimalistic implementation, both stages can be executed with a black-box LLM alone and nothing else (e.g., no external verifier or reward model), and a total of N times (K + 1) highly parallelizable LLM calls are needed for solving an input problem. Assuming that a generated candidate solution is correct with probability p_{gen} > 0 and a comparison between a pair of correct and incorrect solutions identifies the right winner with probability p_{comp} > 0.5 (i.e., better than a random guess), we prove theoretically that the failure probability of the proposed algorithm decays to zero exponentially with respect to N and K: $P(final output is incorrect) le (1 - p_{gen})^N + lceil log_2 N rceil e^{-2 K (p_{comp} - 0.5)^2}.$ Our empirical results with the challenging MMLU-Pro benchmark validate the technical assumptions, as well as the efficacy of the proposed algorithm and the gains from scaling up its test-time compute.
Community
This work proposes a general two-stage algorithm that enjoys a provable scaling law for the test-time compute of large language models (LLMs). Given an input problem, the proposed algorithm first generates multiple candidate solutions, and then chooses a single one among them as the final output, via a knockout tournament where pairwise comparisons among the candidates are conducted. Under certain technical assumptions about the input problem and LLM(s), it can be proved theoretically that the failure probability of the proposed algorithm decays to zero exponentially with respect to the amount of test-time compute.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning (2024)
- Learning How Hard to Think: Input-Adaptive Allocation of LM Computation (2024)
- Enhancing Mathematical Reasoning in LLMs by Stepwise Correction (2024)
- A Comparative Study on Reasoning Patterns of OpenAI's o1 Model (2024)
- Scaling LLM Inference with Optimized Sample Compute Allocation (2024)
- Keep Guessing? When Considering Inference Scaling, Mind the Baselines (2024)
- What Makes Large Language Models Reason in (Multi-Turn) Code Generation? (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper