The standard recipe for improving a language model’s reasoning is expensive. Run hundreds of samples, generate thousands of rollouts, update weights or optimize prompts. The cost is baked into the assumption that improvement requires scale.
A new paper from researchers on arXiv challenges that assumption. CORE (Contrastive Reflection) takes a different approach. Instead of accumulating data, it accumulates insights.
The algorithm works by comparing a model’s successful and unsuccessful reasoning traces on a given problem. From that comparison, it generates a short natural-language “insight”: a compact description of a reasoning strategy or constraint that separates the correct path from the wrong one. That insight gets stored and fed into the prompt on future attempts.
The results are striking. Across four reasoning tasks, CORE outperforms both parametric methods like GRPO and non-parametric baselines like episodic RAG and MemRL, all while using fewer rollouts. With as few as five training samples, it matches or exceeds the gains of methods that require hundreds.
This is not a marginal efficiency gain. It is a structural shift in how models can improve. The insight is not a weight update or a cached trace. It is an abstraction, written in natural language, that the model can reuse and combine. The paper shows that CORE is also more context-efficient than its peers, requiring fewer prompt tokens to store the same knowledge.
What makes CORE surprising is that it works at all. The received wisdom in reasoning research is that models need many examples to generalize. CORE suggests that the bottleneck is not the number of examples but the ability to extract the right signal from the contrast between success and failure. A single well-chosen insight can do more than a thousand rollouts.
For builders, the implication is practical. CORE opens a path to rapid, interpretable model improvement without the infrastructure cost of RL pipelines or the brittleness of prompt optimization. The insights are human-readable. They can be inspected, edited, and combined. The algorithm is non-parametric, meaning the model’s weights stay untouched.
The paper is available on arXiv at http://arxiv.org/abs/2605.28742v1. The question it leaves open is whether the approach scales to tasks where the reward signal is noisy or the reasoning traces are long. That is the next test.