Reasoning With Tree of Thoughts Search

27 Nov 2025

I wanted to understand how you can turn a standard language model into a reasoning model. In this post, we will try to build a tree of thoughts reasoning on top of a base instruction tuned model.

Test-Time Compute

Tree of thoughts search is a test-time compute reasoning strategy. This means we are “buying intelligence” by spending more compute and generating more tokens during the model inference in order to produce better results.

Tree of thoughts divides a problem into steps and at each step, the algorithm explores multiple possible solutions. The model then scores each partial solution and expands only the most promising branches of the search tree.

The figure below illustrates how the search works. We start with the task description at the top of the tree. The task is divided into two steps (max_depth=2). Step 1 is to calculate (2 + 1). Step 2 is to multiply the result from the first step by 3.

At step 1, the model generates k=3 solutions and scores them. It picks two (beam_width=2) results with the highest scores and expands them at step 2. Finally, the model returns the solution with the highest score as the final answer.

Tree-of-thoughts

Python Implementation

Let’s see how we can build this in Python using Hugging Face’s transformers. We start by defining a wrapper class around AutoModelForCausalLM called TreeOfThoughtsReasoner.

from transformers import AutoModelForCausalLM, AutoTokenizer

class TreeOfThoughtsReasoner:
    def __init__(self, model_id: str):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_id, 
            torch_dtype=torch.bfloat16, 
            device_map=self.device
        )

This class implements query_model method which just calls the underlying model with a given prompt and returns k responses. Using do_sample=True setting means that instead of always picking the most probable next token, we sample from the next token probability distribution with temperature=0.7. This helps us get diverse outputs.

def query_model(
    self, prompt: str, k: int, max_tokens: int = 200
) -> list[str]:
    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant.",
        },
        {
            "role": "user",
            "content": prompt,
        },
    ]
    txt = self.tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    input = self.tokenizer(txt, return_tensors="pt").to(self.device)
    
    output = self.model.generate(
        **input, 
        max_new_tokens=max_tokens, 
        temperature=0.7,
        do_sample=True,
        num_return_sequences=k,
        pad_token_id=self.tokenizer.eos_token_id
    )
    # to remove the input prompt from the output
    input_len = input.input_ids.shape[1]
    # return a list of k outputs
    return [
        self.tokenizer.decode(
            out[input_len:], skip_special_tokens=True
        ).strip()
        for out in output
    ]

The get_next_steps method gives the model a partial solution of the original task and asks the model to provide a next logical step. At the beginning, the partial solution is just an empty string. The method retrieves k different next steps and appends them to the current partial solution.

def get_next_steps(
    self,
    current_state: str,
    task_description: str,
    current_depth: int,
    max_depth: int,
    k: int = 3,
) -> list[str]:
    # given the current partial solution,
    # generate k possible next steps
    prompt = f"""Task: {task_description}
Current partial solution: {current_state}
Based on this, write the single most logical next step.
There will be {max_depth} steps in total.
Right now, you need to generate the step {current_depth}.
Output ONLY the next logical step. Keep it brief."""

    possible_steps = self.query_model(prompt, max_tokens=100, k=k)
    return [current_state + "\n" + step for step in possible_steps]

The evaluate_state method uses the model to self-evaluate the current partial solution based on the description of the original task. It returns a score between 0 and 1. The higher the score, the better the partial solution is.

def evaluate_state(
    self, current_state: str, task_description: str
) -> float:
    # ask the model to score the current partial solution
    prompt = f"""Task: {task_description}
Proposed partial solution: {current_state}
Evaluate if this solution is on the right track to solving the task.
Consider logical consistency and constraints.
Give a score between 0.1 and 1.0 where 1.0 is perfect.
Output ONLY the number."""
    
    response = self.query_model(prompt, max_tokens=10, k=1)[0]
    try:
        # extract the first floating point number found in the text
        score = float(re.findall(r"0\.\d+|1\.0|0", response)[0])
    except Exception:
        score = 0.5
        
    return score

The solve method implements the search algorithm.

Generate k different first steps to solve the task. Assign a score to each of them and pick two (beam_width=2) highest scoring solutions.
For each of the selected first step solutions, expand it with k different second steps. Assign the scores to all the k * beam_width second step solutions, pick the best, and expand them.
Repeat this until you reach the max_depth, then return the full solution with the highest score.

def solve(
    self,
    task: str,
    max_depth: int = 3,
    beam_width: int = 2
) -> str:

    # start with 1 empty state
    current_beams = [""] 
    
    for current_depth in range(1, max_depth + 1):            
        all_candidates = []

        for beam in current_beams:
            next_steps = self.get_next_steps(
                beam, task, current_depth, max_depth, k=3
            )
            
            for possible_step in next_steps:
                score = self.evaluate_state(possible_step, task)
                all_candidates.append(
                    {"score": score, "step": possible_step}
                )

        # keep only the top `beam_width` candidates
        # for the next depth level
        all_candidates.sort(key=lambda x: x["score"], reverse=True)
        current_beams = [
            candidate["step"]
            for candidate in all_candidates[:beam_width]
        ]

    return current_beams[0] # Return the single best path

Using the Reasoning Model

We can run the tree of thoughts search on the simple arithmetic problem from the figure at the beginning of the post.

model_id = "Qwen/Qwen2.5-1.5B-Instruct"
tot_engine = TreeOfThoughtsReasoner(model_id)

problem = "Calculate (2 + 1) * 3"
print(tot_engine.solve(problem, max_depth=3, beam_width=2))

# Next logical step:
# (2 + 1) = ?

# Solution:
# 3
# Step 2:
# 3 * 3 = ?
# 3 * 3 = 9

Performance

I made a small benchmark where I asked both the base model and the tree of thoughts reasoning model to solve 30 different arithmetic problems in the form of $(a + b) \times c$ where $a, b, c$ are random integers between 1 and 100. The base model and the tree of thoughts model achieved 7% and 16% accuracy respectively.

Notebook with all the code is available on my GitHub.