1. Introduction

The emergence of reasoning-focused large language models (LLMs), such as DeepSeek R1 and OpenAI o1/o3, marks a significant milestone in AI development. These models represent a transition from research and development (R&D) projects to real-world production, highlighting a broader shift in AI model development. Traditionally, the focus has been on pre-training—building and training models from scratch. However, the rise of reasoning LLMs underscores the growing importance of post-training and inference optimization, where the emphasis is on refining and enhancing models after initial training to improve their reasoning capabilities, delivering the high performance, efficiency, and applicability needed in practical scenarios. This evolution reflects a transformative trend in AI model development, shifting the focus from pre-training to post-training and inference-time computing.

In the article “Reinforcement Learning and LLM Post-Training: How LLMs Gain Reasoning Capabilities?”, we discussed how post-training helps LLMs develop consistent reasoning behaviors and generate structured responses, particularly through long chain-of-thought (CoT) reasoning traces that lead to correct answers for complex problems. On the other hand, generating long CoT traces also implies increased compute demands during inference.

However, scaling inference-time compute scaling offers a promising avenue for further enhancing reasoning capabilities. One significant advantage of inference techniques is that they do not require retraining the model or altering its parameters. Instead, they treat the LLM as a black box, making them highly compatible with how LLM-based agents operate. In fact, LLM-based agents often extend the use of LLMs beyond direct output generation. For complex tasks—such as task decomposition, multi-step reasoning, and planning—further processing of the generated results is typically required. This is precisely where inference-time compute (or test-time compute) comes into play.

As reasoning-focused LLMs continue to evolve, the emphasis on post-training and inference optimization is becoming increasingly critical. By leveraging inference-time techniques, we can unlock deeper reasoning capabilities, enabling LLMs to deliver more accurate, efficient, and practical solutions for real-world applications.

This article delves into the major strategies for optimizing inference-time compute, exploring techniques that balance performance, cost, and scalability. By reviewing these approaches, we aim to provide a comprehensive understanding of how different inference techniques enhance the efficiency of LLMs by optimizing the outcomes, focusing on improving reasoning and alignment capabilities. Portions of this content were initially published in Sections 8 and 9 of Agents in the Era of Large Language Models: A Systematic Overview (II), with revisions and updates for clarity and completeness.

2. The Approaches to Scaling Inference-Time Compute

Inference-time computing has emerged as a powerful paradigm for enabling language models to "think" more deeply and methodically about complex challenges, much like expert human reasoning. Inference-time compute refers to the computational resources—such as processing power, memory, and energy—required to generate predictions or outputs from a trained machine learning model when it is applied to new, unseen data. This phase, known as inference, is distinct from the training phase, where the model learns patterns from large datasets. Efficient inference-time compute is crucial for deploying AI applications in real-world scenarios, including image recognition, natural language processing, and autonomous vehicles.

Contents

1. Introduction

2. The Approaches to Scaling Inference-Time Compute