27 Nov 2025
I wanted to understand how you can turn a standard language model into a reasoning model. In this post, we will try to build a tree of thoughts reasoning on top of a base instruction tuned model.
Tree of thoughts search is a test-time compute reasoning strategy. This means we are “buying intelligence” by spending more compute and generating more tokens during the model inference in order to produce better results.
Tree of thoughts divides a problem into steps and at each step, the algorithm explores multiple possible solutions. The model then scores each partial solution and expands only the most promising branches of the search tree.
The figure below illustrates how the search works. We start with the task description at the top of the tree. The task is divided into two steps (max_depth=2). Step 1 is to calculate (2 + 1). Step 2 is to multiply the result from the first step by 3.
At step 1, the model generates k=3 solutions and scores them. It picks two (beam_width=2) results with the highest scores and expands them at step 2. Finally, the model returns the solution with the highest score as the final answer.
Let’s see how we can build this in Python using Hugging Face’s transformers.
We start by defining a wrapper class around AutoModelForCausalLM called TreeOfThoughtsReasoner.
from transformers import AutoModelForCausalLM, AutoTokenizer
class TreeOfThoughtsReasoner:
def __init__(self, model_id: str):
self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.tokenizer = AutoTokenizer.from_pretrained(model_id)
self.model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map=self.device
)This class implements query_model method which just calls the underlying model
with a given prompt and returns k responses. Using do_sample=True setting means that instead of
always picking the most probable next token, we sample from the next token probability distribution with
temperature=0.7. This helps us get diverse outputs.
def query_model(
self, prompt: str, k: int, max_tokens: int = 200
) -> list[str]:
messages = [
{
"role": "system",
"content": "You are a helpful assistant.",
},
{
"role": "user",
"content": prompt,
},
]
txt = self.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
input = self.tokenizer(txt, return_tensors="pt").to(self.device)
output = self.model.generate(
**input,
max_new_tokens=max_tokens,
temperature=0.7,
do_sample=True,
num_return_sequences=k,
pad_token_id=self.tokenizer.eos_token_id
)
# to remove the input prompt from the output
input_len = input.input_ids.shape[1]
# return a list of k outputs
return [
self.tokenizer.decode(
out[input_len:], skip_special_tokens=True
).strip()
for out in output
]The get_next_steps method gives the model a partial solution of the original task and
asks the model to provide a next logical step.
At the beginning, the partial solution is just an empty string. The method retrieves k
different next steps and appends them to the current partial solution.
def get_next_steps(
self,
current_state: str,
task_description: str,
current_depth: int,
max_depth: int,
k: int = 3,
) -> list[str]:
# given the current partial solution,
# generate k possible next steps
prompt = f"""Task: {task_description}
Current partial solution: {current_state}
Based on this, write the single most logical next step.
There will be {max_depth} steps in total.
Right now, you need to generate the step {current_depth}.
Output ONLY the next logical step. Keep it brief."""
possible_steps = self.query_model(prompt, max_tokens=100, k=k)
return [current_state + "\n" + step for step in possible_steps]The evaluate_state method uses the model to self-evaluate the current partial
solution based on the description of the original task. It returns a score
between 0 and 1. The higher the score, the better the partial solution is.
def evaluate_state(
self, current_state: str, task_description: str
) -> float:
# ask the model to score the current partial solution
prompt = f"""Task: {task_description}
Proposed partial solution: {current_state}
Evaluate if this solution is on the right track to solving the task.
Consider logical consistency and constraints.
Give a score between 0.1 and 1.0 where 1.0 is perfect.
Output ONLY the number."""
response = self.query_model(prompt, max_tokens=10, k=1)[0]
try:
# extract the first floating point number found in the text
score = float(re.findall(r"0\.\d+|1\.0|0", response)[0])
except Exception:
score = 0.5
return scoreThe solve method implements the search algorithm.
k different first steps to solve the task.
Assign a score to each of them and pick two (beam_width=2) highest scoring solutions.k different second steps.
Assign the scores to all the k * beam_width second step solutions, pick the best, and expand them.max_depth, then return the full solution with the highest score.def solve(
self,
task: str,
max_depth: int = 3,
beam_width: int = 2
) -> str:
# start with 1 empty state
current_beams = [""]
for current_depth in range(1, max_depth + 1):
all_candidates = []
for beam in current_beams:
next_steps = self.get_next_steps(
beam, task, current_depth, max_depth, k=3
)
for possible_step in next_steps:
score = self.evaluate_state(possible_step, task)
all_candidates.append(
{"score": score, "step": possible_step}
)
# keep only the top `beam_width` candidates
# for the next depth level
all_candidates.sort(key=lambda x: x["score"], reverse=True)
current_beams = [
candidate["step"]
for candidate in all_candidates[:beam_width]
]
return current_beams[0] # Return the single best pathWe can run the tree of thoughts search on the simple arithmetic problem from the figure at the beginning of the post.
model_id = "Qwen/Qwen2.5-1.5B-Instruct"
tot_engine = TreeOfThoughtsReasoner(model_id)
problem = "Calculate (2 + 1) * 3"
print(tot_engine.solve(problem, max_depth=3, beam_width=2))
# Next logical step:
# (2 + 1) = ?
# Solution:
# 3
# Step 2:
# 3 * 3 = ?
# 3 * 3 = 9I made a small benchmark where I asked both the base model and the tree of thoughts reasoning model to solve 30 different arithmetic problems in the form of $(a + b) \times c$ where $a, b, c$ are random integers between 1 and 100. The base model and the tree of thoughts model achieved 7% and 16% accuracy respectively.
Notebook with all the code is available on my GitHub.