Hugging Face’s Open R1 project has completed its first major milestone, releasing a fully open reproduction of the distillation pipeline that powers DeepSeek-R1. The project, which began in February 2025, now includes a curated reasoning dataset of 350,000 verified traces and a 7-billion-parameter model that matches or exceeds DeepSeek’s own distilled 7B model on several key benchmarks.
The news matters because DeepSeek-R1, released by the Chinese AI lab in January 2025, was a breakthrough in chain-of-thought reasoning. But the model was released under a permissive license that did not include the training data, the reward model, or the reinforcement learning pipeline. Researchers could use the model, but they could not study how it was built or reproduce the results. Open R1 closes that gap.
The project’s plan of attack, laid out in the GitHub repository, breaks the R1 reproduction into three steps. Step 1, now complete, replicates the R1-Distill models by distilling a high-quality corpus from DeepSeek-R1. Step 2 aims to replicate the pure reinforcement learning pipeline that DeepSeek used to create R1-Zero, the base model that learns reasoning entirely through RL without supervised fine-tuning. Step 3 would show that the full pipeline from base model to RL-tuned model works end-to-end.
The centerpiece of Step 1 is the Mixture-of-Thoughts dataset, released on May 26, 2025. It contains 350,000 reasoning traces distilled from DeepSeek-R1, spanning mathematics, coding, and science. Each trace is verified, meaning the dataset is not just raw generations but curated examples that produce correct answers. The dataset is designed to teach language models to reason step by step.
The resulting model, OpenR1-Distill-7B, was trained on the same base model as DeepSeek’s distilled 7B: Qwen2.5-Math-7B. The training used supervised fine-tuning with the Mixture-of-Thoughts dataset, running on eight H100 GPUs with 80GB of memory each. The results, published in the repository, show OpenR1-Distill-7B scoring 52.7 on AIME 2024, compared to 51.3 for DeepSeek-R1-Distill-Qwen-7B. On GPQA Diamond, it scores 52.8 versus 52.4. On LiveCodeBench v5, it scores 39.4 versus 37.4. DeepSeek’s model still leads on MATH-500, 93.5 to 89.0, but the margin is narrow.
The project has also released auxiliary datasets. The CodeForces-CoTs dataset, released March 11, 2025, contains 10,000 competitive programming problems and 100,000 solutions distilled from R1. A 7B Qwen model trained on this dataset can outperform Claude 3.7 Sonnet on IOI24, a new benchmark of very hard problems from international olympiads. A 32B model trained on the same data outperforms DeepSeek-R1 itself. The OpenR1-Math-220k dataset, released February 10, 2025, contains 220,000 math traces and produces models that match DeepSeek’s distilled ones on math benchmarks.
This is not a trivial replication. DeepSeek’s tech report, which the project uses as a guide, describes a multi-stage pipeline that includes cold-start data collection, reinforcement learning with rule-based rewards, rejection sampling, and distillation. Each stage requires careful engineering. The GRPO (Group Relative Policy Optimization) training scripts in the repository, for example, use TRL’s vLLM backend to scale training across multiple nodes, with support for sandboxed code execution via E2B and Morph providers. The repository includes Slurm scripts for multi-node training, YAML configs for different model sizes, and detailed instructions for setting up the environment with CUDA 12.4, vLLM 0.8.5, and FlashAttention.
The project is opinionated about its approach. It uses ChatML as the default chat template during training, even for base models like Llama that do not have one. It requires overriding the chat template for DeepSeek’s distilled models because those models omit the contents of the <think> and </think> reasoning tags, which interferes with the format reward function. These are the kinds of practical details that a research paper glosses over but a reproduction project must solve.
What Open R1 proves is that DeepSeek’s results are not a fluke of proprietary infrastructure or secret data. The same distillation recipe works on open models, with open data, on standard hardware. The gap between a frontier lab’s internal pipeline and what the open-source community can build is shrinking.
The open question is whether Step 2 will be as successful. Reproducing R1-Zero, the pure RL pipeline, requires curating large-scale datasets for math, reasoning, and code that can provide reliable reward signals. DeepSeek used rule-based reward functions that check for correct answers and formatting, not learned reward models. That approach is simpler but still requires the right data at scale. The Open R1 project has not announced a timeline for Step 2.
For now, the project has done something concrete: it has taken a model that the community could only use and turned it into a pipeline that the community can study, modify, and build on. The Mixture-of-Thoughts dataset and the OpenR1-Distill-7B weights are on the Hugging Face Hub. The training scripts are in the repository. Any researcher with eight H100s can reproduce the results. That is the point.