How NuminaMath Won the 1st AIMO Progress Prize
Introduction
The AI Math Olympiad (AIMO) prize is a prestigious competition that aims to drive the open development of AI models that excel in mathematical reasoning. The first progress prize was held as a Kaggle competition, with problems that are less challenging than those in the IMO but are at the level of IMO preselection. In this blog post, we will share our experience of winning the first AIMO progress prize with our model, NuminaMath 7B TIR.
The AI Math Olympiad (AIMO) Prize
The AIMO Prize is a grand prize of $5M that will be awarded to whoever can create an AI model that can win a gold medal in the IMO. Alongside the grand prize, AIMO has introduced a series of progress prizes to mark milestones toward this ultimate goal. The first progress prize was held as a Kaggle competition, with problems that are less challenging than those in the IMO but are at the level of IMO preselection.
Our Winning Solution
Our solution to the first AIMO progress prize consisted of three main components:
- A recipe to fine-tune DeepSeekMath-Base 7B to act as a "reasoning agent" that can solve mathematical problems via a mix of natural language reasoning and the use of the Python REPL to compute intermediate results.
- A novel decoding algorithm for tool-integrated reasoning (TIR) with code execution feedback to generate solution candidates during inference.
- A variety of internal validation sets that we used to guide model selection and avoid overfitting to the public leaderboard.
The Training Recipe
Our fine-tuning recipe was largely based on the MuMath-Code paper, which involves training the model in two stages:
Stage 1: Fine-tune the base model on a large, diverse dataset of natural language math problems and solutions
- We used a dataset of several hundred thousand problem-solution pairs, covering topics from high school mathematics to competition-level mathematics.
- We fine-tuned the base model on this dataset using a two-stage training method, where the model is first fine-tuned on a large, diverse dataset of natural language math problems and solutions, and then fine-tuned on a synthetic dataset of tool-integrated reasoning.
Stage 2: Fine-tune the model from Stage 1 on a synthetic dataset of tool-integrated reasoning
- We used a dataset of several hundred thousand problem-solution pairs, where each problem is decomposed into a sequence of rationales, Python programs, and their outputs.
- We fine-tuned the model from Stage 1 on this dataset using a two-stage training method, where the model is first fine-tuned on a large, diverse dataset of natural language math problems and solutions, and then fine-tuned on a synthetic dataset of tool-integrated reasoning.
Good Data is All You Need
In terms of the dataset, we have extensively referred to DeepSeek Math and other scholars' approaches, scaling them up significantly. This has resulted in a fine-tuned dataset of several hundred thousand problem-solution pairs, covering topics from high school mathematics to competition-level mathematics.
Chain of Thought
This dataset consists of several hundred thousand problems, each with solutions written in a Chain of Thought manner. The sources of the dataset range from Chinese high school math exercises to US and international mathematics olympiad competition problems.
Tool-Integrated Reasoning
Tool-integrated reasoning (TIR) plays a crucial role in this competition. However, collecting and annotating such data is both costly and time-consuming. To address this, we selected approximately 60,000 problems from the Numina dataset, focusing on those with numerical outputs, most of which are integers.
Taming the Variance with Self-Consistency and Tool Integrated Reasoning (SC-TIR)
As other competitors noted, this competition posed several challenges with respect to model submission and evaluation:
- The evaluation API provides problems in random order, so tactics like early stopping produce high variance because one run may have more hard problems at the start, which leaves less time for the remainder (and vice versa)
- Most innovations in LLM inference require access to modern GPUs, so standard methods like Flash Attention 2 or torch.compile do not work on T4 GPUs.
To handle this, we took a different approach based on tool-integrated reasoning:
- For each problem, copy the input N times to define the initial batch of prompts to feed vLLM.
- Sample N diverse completions until a complete block of Python code is produced.
- Execute each Python block and concatenate the output, including tracebacks if they appear.
- Repeat M times to produce a batch of generations of size N and depth M, allowing the model to self-correct code errors using the traceback.
Avoiding the Curse of Overfitting
Overfitting to the public leaderboard is a common risk in Kaggle competitions, and even more so when the test set is just 50 problems. In addition, the rules allowed at most two submissions per day, making a robust internal validation dataset crucial for pacing our development.
To guide model selection, we used four internal validation sets to gauge the performance of our models on math problems of varying difficulty:
- AMC (83 problems): We picked all the problems from AMC12 22, AMC12 23 and kept those that can be converted to integer outputs.
- AIME (90 problems): We picked all the problems from AIME 22, AIME 23, and AIME 24 to measure how well our models could perform on difficult problems, as well as to gauge the most common failure modes.
- MATH level 4 (754 problems): We retained only the problems with integer outputs, to simplify majority voting and mimic competition evaluation.
- MATH level 5 (721 problems): We retained only the problems with integer outputs, to simplify majority voting and mimic competition evaluation.
Other Ideas We Tried
As mentioned above, we tried a few approaches that were ultimately discarded in favor of the MuMath-Code recipe:
- Training a pure CoT model and using majority voting for evaluation
- Training an MMOS model to solve problems with Python in a single step
- Applying Kahneman-Tversky Optimisation (KTO) to new completions sampled from the SFT model
Numina's Future - Looking for Contributors and Partners!
Following the initial success of Numina at winning the AIMO 2024 progress prize, we now aim to pursue our mission of fostering the development of artificial and human intelligence in the field of mathematics. You can visit our website to know more about our projects and please always feel free to drop us a note at [email protected].
Acknowledgements
We thank Thomas Wolf and Leandro von Werra for enabling the Numina and Hugging Face collaboration. We also thank Hugo Larcher for helping make the GPUs go brrrr on the Hugging Face cluster, Colin Raffel for his advice on model merging methods, and Omar Sanseviero for feedback on the blog post.
We also wanted to express our gratitude to Mistral.ai, General Catalyst, Answer.AI, and Beijing International Center for Mathematical Research @ Peking University who supported the project from the beginning.
Finally, we thank the AIMO Prize team for launching such an exciting and inspiring competition!
Code Blocks
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from transformers import AutoModelForSequenceClassification, AutoTokenizer
class MathDataset(Dataset):
def __init__(self, data, tokenizer):
self.data = data
self.tokenizer = tokenizer
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
item = self.data[idx]
inputs = self.tokenizer.encode_plus(
item['input'],
add_special_tokens=True,
max_length=512,
return_attention_mask=True,
return_tensors='pt',
)
return {
'input_ids': inputs['input_ids'].flatten(),
'attention_mask': inputs['attention_mask'].flatten(),
'label': item['label'],
}
class MathModel(nn.Module):
def __init__(self):
super(MathModel, self).__init__()
self.bert = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')
self.dropout = nn.Dropout(0.1)
self.classifier = nn.Linear(self.bert.config.hidden_size, 2)
def forward(self, input_ids, attention_mask):
outputs = self.bert(input_ids, attention_mask=attention_mask)
pooled_output = outputs.pooler_output
pooled_output = self.dropout(pooled_output)
outputs = self.classifier(pooled_output)
return outputs
model = MathModel()
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
dataset = MathDataset(data, tokenizer)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
# Train the model
python train.py --model math_model --data data.json --batch_size 32 --epochs 10
# Evaluate the model
python evaluate.py --model math_model --data data.json --batch_size 32
# Use the model for inference
python inference.py --model math_model --input input.txt --output output.txt
Note: The code blocks are just examples and may need to be modified to fit your specific use case.
Source: https://huggingface.co/blog/winning-aimo-progress-prize




